CN106202259A - A kind of info web extracting method based on body thought - Google Patents

A kind of info web extracting method based on body thought Download PDF

Info

Publication number
CN106202259A
CN106202259A CN201610499614.8A CN201610499614A CN106202259A CN 106202259 A CN106202259 A CN 106202259A CN 201610499614 A CN201610499614 A CN 201610499614A CN 106202259 A CN106202259 A CN 106202259A
Authority
CN
China
Prior art keywords
web page
url
subject
degree
webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201610499614.8A
Other languages
Chinese (zh)
Inventor
董雄飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei Minzhongyixing Software Development Co Ltd
Original Assignee
Hefei Minzhongyixing Software Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei Minzhongyixing Software Development Co Ltd filed Critical Hefei Minzhongyixing Software Development Co Ltd
Priority to CN201610499614.8A priority Critical patent/CN106202259A/en
Publication of CN106202259A publication Critical patent/CN106202259A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The invention discloses a kind of info web extracting method based on body thought, the method uses vector space model, first it is analyzed drawing Feature Words to webpage word segmentation result, next calculates feature weight, the degree of subject relativity of webpage is analyzed then in conjunction with body thought, finally take degree of subject relativity to compare with the threshold value of default, thus extract the subject information of this webpage.The method makes the operand of web page analysis reduce, and reduces the omission of info web, improves the quality of information retrieval.

Description

A kind of info web extracting method based on body thought
Technical field
The invention belongs to network method field, it is more particularly related to a kind of webpage based on body thought letter Breath extracting method.
Background technology
Along with developing rapidly of the Internet, the webpage number on Web is just with exponential explosive trend growth.In the face of such as This huge resource, retrieves on Web and finds that valuable information has become an important task.The research of sing on web Relating to information retrieval, information filtering, information extraction, search engine, Web page classifying etc., the main object that they researchs process is exactly Info web.In webpage in addition to expressing the body matter of theme, navigation bar, the advertisement the most unrelated with subject content are believed The noise content such as breath, copyright information and peer link.
Summary of the invention
Problem to be solved by this invention is to provide a kind of info web extracting method based on body thought.
To achieve these goals, the technical scheme that the present invention takes is:
A kind of info web extracting method based on body thought, comprises the steps:
(1) web document pretreatment
Using the webpage of information to be extracted as information source, Theme Crawler of Content to Web page anchor text, web page title, text title and Text carries out structured analysis in the way of tag tree, is processed into web page text;
(2) ontological classification
The interface utilizing Words partition system FreeICTCLAS carries out participle, and word is carried out ontological classification, obtains simultaneously The frequency that Feature Words occurs in the text;
(3) weight computing
According to vector space model, each web page text is abstracted into a vector, then passes through formula by the spy of text Levying the weight computing shared by key word out, described formula is Wi=∑ (Wt*Pt*Wi);
(4) degree of subject relativity is calculated
According to degree of subject relativity formula
Analyze degree of subject relativity;
(5) degree of subject relativity is analyzed
The threshold value that calculated degree of subject relativity and system are arranged is compared.
Preferably, in described step (1), the process that realizes of Theme Crawler of Content is divided into:
1. the choosing of training set;
2. the transition probability between each subject categories and subject classification device are obtained by training set;
3. utilize the VIPS algorithm of view-based access control model feature by web page release;
4. on the basis of web page blocks, predict the access privileges of URL in block.
Preferably, in described step (1), Theme Crawler of Content determines that the priority tasks that the URL in this webpage accesses is:
1. the URL self-contained information of string is utilized to determine the weight of URL;
2. the Anchor Text information utilizing URL to go here and there determines the weight of URL;
3. for common URL remaining in web page blocks, judging according to grader first with web page blocks content information should Theme q belonging to web page blocks, is then web page blocks with the similarity of q and q to target topic by the weight assignment of these common URL The product of transition probability;
4. being inserted in queue to be creeped according to the height of the weighted value of its correspondence by the URL in web page blocks, weighted value is high URL preferentially creeps.
Preferably, the structure of described step 2. subject classification device comprises the steps:
1) the choosing of Feature Words;
2) weight of each component of class center vector;
3) subject categories belonging to web page blocks to be sorted is determined;
4) similarity of webpage and the subject categories downloaded is calculated.
Preferably, in described step (3) weight computing also should in conjunction with term frequencies, reverse document-frequency and normalization because of Son.
Preferably, described step (5) if in the threshold value that arranges more than system of degree of subject relativity, then retain this webpage, if main The threshold value that topic degree of association is arranged less than system, then give up.
Beneficial effect: the invention provides a kind of info web extracting method based on body thought, the method use to Quantity space model, first is analyzed drawing Feature Words, next calculates feature weight, then in conjunction with body to webpage word segmentation result Thought analyzes the degree of subject relativity of webpage, finally takes degree of subject relativity to compare with the threshold value of default, thus extracts The subject information of this webpage.The method makes the operand of web page analysis reduce, and reduces the omission of info web, and the information of improve carries The quality taken.
Detailed description of the invention
Fig. 1 is the flow chart of a kind of info web extracting method based on body thought;
A kind of info web extracting method based on body thought, it is characterised in that comprise the steps:
(1) web document pretreatment
Using the webpage of information to be extracted as information source, Theme Crawler of Content to Web page anchor text, web page title, text title and Text carries out structured analysis in the way of tag tree, is processed into web page text, and the process that realizes of Theme Crawler of Content is divided into:
1. the choosing of training set;
2. the transition probability between each subject categories and subject classification device, the structure of described subject classification device are obtained by training set Comprise the steps:
1) the choosing of Feature Words;
2) weight of each component of class center vector;
3) subject categories belonging to web page blocks to be sorted is determined;
4) similarity of webpage and the subject categories downloaded is calculated;
3. utilize the VIPS algorithm of view-based access control model feature by web page release;
4. on the basis of web page blocks, predict the access privileges of URL in block;
Described Theme Crawler of Content determines that the priority tasks that the URL in this webpage accesses is:
1. the URL self-contained information of string is utilized to determine the weight of URL;
2. the Anchor Text information utilizing URL to go here and there determines the weight of URL;
3. for common URL remaining in web page blocks, judging according to grader first with web page blocks content information should Theme q belonging to web page blocks, is then web page blocks with the similarity of q and q to target topic by the weight assignment of these common URL The product of transition probability;
4. being inserted in queue to be creeped according to the height of the weighted value of its correspondence by the URL in web page blocks, weighted value is high URL preferentially creeps
(2) ontological classification
The interface utilizing Words partition system FreeICTCLAS carries out participle, and word is carried out ontological classification, obtains simultaneously The frequency that Feature Words occurs in the text;
(3) weight computing
According to vector space model, each web page text is abstracted into a vector, then passes through formula by the spy of text Levy the weight computing shared by key word out, weight computing also should in conjunction with term frequencies, reverse document-frequency and normalization factor, Described formula is Wi=∑ (Wt*Pt*Wi);
(4) degree of subject relativity is calculated
According to degree of subject relativity formula
Analyze degree of subject relativity;
(5) degree of subject relativity is analyzed
The threshold value that calculated degree of subject relativity and system are arranged is compared, if degree of subject relativity sets more than system The threshold value put, then retain this webpage, if the threshold value that degree of subject relativity is arranged less than system, then gives up.
The invention provides a kind of info web extracting method based on body thought, the method uses vector space mould Type, first is analyzed drawing Feature Words, next calculates feature weight to webpage word segmentation result, analyzes then in conjunction with body thought The degree of subject relativity of webpage, finally takes degree of subject relativity to compare with the threshold value of default, thus extracts this webpage Subject information.The method makes the operand of web page analysis reduce, and reduces the omission of info web, improves the matter of information retrieval Amount.
The foregoing is only embodiments of the invention, not thereby limit the scope of the claims of the present invention, every utilize this Equivalent structure or equivalence flow process that bright description is made convert, or are directly or indirectly used in other relevant technology necks Territory, is the most in like manner included in the scope of patent protection of the present invention.

Claims (6)

1. an info web extracting method based on body thought, it is characterised in that comprise the steps:
(1) web document pretreatment
Using the webpage of information to be extracted as information source, Theme Crawler of Content is to Web page anchor text, web page title, text title and text In the way of tag tree, carry out structured analysis, be processed into web page text;
(2) ontological classification
The interface utilizing Words partition system FreeICTCLAS carries out participle, and word carries out ontological classification, obtains feature simultaneously The frequency that word occurs in the text;
(3) weight computing
According to vector space model, each web page text is abstracted into a vector, then passes through formula and the feature of text is closed Out, described formula is Wi=∑ (Wt*Pt*Wi) to weight computing shared by keyword;
(4) degree of subject relativity is calculated
According to degree of subject relativity formula
Analyze degree of subject relativity;
(5) degree of subject relativity is analyzed
The threshold value that calculated degree of subject relativity and system are arranged is compared.
2. according to a kind of based on body thought the info web extracting method described in claim 1, it is characterised in that: described step Suddenly in (1), the process that realizes of Theme Crawler of Content is divided into:
1. the choosing of training set;
2. the transition probability between each subject categories and subject classification device are obtained by training set;
3. utilize the VIPS algorithm of view-based access control model feature by web page release;
4. on the basis of web page blocks, predict the access privileges of URL in block.
3. according to a kind of based on body thought the info web extracting method described in claim 1, it is characterised in that: described step Suddenly in (1), Theme Crawler of Content determines that the priority tasks that the URL in this webpage accesses is:
1. the URL self-contained information of string is utilized to determine the weight of URL;
2. the Anchor Text information utilizing URL to go here and there determines the weight of URL;
3. for common URL remaining in web page blocks, this webpage is judged first with web page blocks content information according to grader Theme q belonging to block, then by the weight assignment of these common URL be web page blocks with the similarity of q and q to target topic turn Move the product of probability;
4. being inserted in queue to be creeped according to the height of the weighted value of its correspondence by the URL in web page blocks, the URL that weighted value is high is excellent First creep.
4. according to a kind of based on body thought the info web extracting method described in claim 2, it is characterised in that: described step The structure of rapid 2. subject classification device comprises the steps:
1) the choosing of Feature Words;
2) weight of each component of class center vector;
3) subject categories belonging to web page blocks to be sorted is determined;
4) similarity of webpage and the subject categories downloaded is calculated.
5. according to a kind of based on body thought the info web extracting method described in claim 1, it is characterised in that: described step Suddenly in (3), weight computing also should be in conjunction with term frequencies, reverse document-frequency and normalization factor.
6. according to a kind of based on body thought the info web extracting method described in claim 1, it is characterised in that: described step Suddenly (5) if in degree of subject relativity more than system arrange threshold value, then retain this webpage, if degree of subject relativity less than system arrange Threshold value, then give up.
CN201610499614.8A 2016-06-29 2016-06-29 A kind of info web extracting method based on body thought Withdrawn CN106202259A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610499614.8A CN106202259A (en) 2016-06-29 2016-06-29 A kind of info web extracting method based on body thought

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610499614.8A CN106202259A (en) 2016-06-29 2016-06-29 A kind of info web extracting method based on body thought

Publications (1)

Publication Number Publication Date
CN106202259A true CN106202259A (en) 2016-12-07

Family

ID=57463372

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610499614.8A Withdrawn CN106202259A (en) 2016-06-29 2016-06-29 A kind of info web extracting method based on body thought

Country Status (1)

Country Link
CN (1) CN106202259A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107590219A (en) * 2017-09-04 2018-01-16 电子科技大学 Webpage personage subject correlation message extracting method
CN109063076A (en) * 2018-07-24 2018-12-21 维沃移动通信有限公司 A kind of Picture Generation Method and mobile terminal
CN111143649A (en) * 2019-12-09 2020-05-12 杭州迪普科技股份有限公司 Webpage searching method and device
CN112597369A (en) * 2020-12-22 2021-04-02 荆门汇易佳信息科技有限公司 Webpage spider theme type search system based on improved cloud platform

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107590219A (en) * 2017-09-04 2018-01-16 电子科技大学 Webpage personage subject correlation message extracting method
CN109063076A (en) * 2018-07-24 2018-12-21 维沃移动通信有限公司 A kind of Picture Generation Method and mobile terminal
CN109063076B (en) * 2018-07-24 2021-07-13 维沃移动通信有限公司 Picture generation method and mobile terminal
CN111143649A (en) * 2019-12-09 2020-05-12 杭州迪普科技股份有限公司 Webpage searching method and device
CN112597369A (en) * 2020-12-22 2021-04-02 荆门汇易佳信息科技有限公司 Webpage spider theme type search system based on improved cloud platform

Similar Documents

Publication Publication Date Title
CN107818105B (en) Recommendation method of application program and server
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN104199972B (en) A kind of name entity relation extraction and construction method based on deep learning
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
CN107992764B (en) Sensitive webpage identification and detection method and device
CN111104510B (en) Text classification training sample expansion method based on word embedding
CN107608999A (en) A kind of Question Classification method suitable for automatically request-answering system
CN106202294B (en) Related news computing method and device based on keyword and topic model fusion
CN105975491A (en) Enterprise news analysis method and system
CN106202259A (en) A kind of info web extracting method based on body thought
CN111310476B (en) Public opinion monitoring method and system using aspect-based emotion analysis method
CN110909531B (en) Information security screening method, device, equipment and storage medium
CN103309862A (en) Webpage type recognition method and system
CN104361037B (en) Microblogging sorting technique and device
CN101763431A (en) PL clustering method based on massive network public sentiment information
CN109145180B (en) Enterprise hot event mining method based on incremental clustering
CN105528422A (en) Focused crawler processing method and apparatus
CN106126502A (en) A kind of emotional semantic classification system and method based on support vector machine
CN110191096A (en) A kind of term vector homepage invasion detection method based on semantic analysis
CN105912525A (en) Sentiment classification method for semi-supervised learning based on theme characteristics
CN109299277A (en) The analysis of public opinion method, server and computer readable storage medium
CN109558587A (en) A kind of classification method for the unbalanced public opinion orientation identification of category distribution
CN106294786A (en) A kind of code search method and system
CN109992703A (en) A kind of credibility evaluation method of the differentiation feature mining based on multi-task learning
CN106096055A (en) A kind of info web extracting method based on body thought

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20161207