CN107357801A - A kind of enterprise's related web page theme measuring method and system - Google Patents
A kind of enterprise's related web page theme measuring method and system Download PDFInfo
- Publication number
- CN107357801A CN107357801A CN201710354041.4A CN201710354041A CN107357801A CN 107357801 A CN107357801 A CN 107357801A CN 201710354041 A CN201710354041 A CN 201710354041A CN 107357801 A CN107357801 A CN 107357801A
- Authority
- CN
- China
- Prior art keywords
- web page
- webpage
- link
- words
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/957—Browsing optimisation, e.g. caching or content distillation
- G06F16/9574—Browsing optimisation, e.g. caching or content distillation of access to content, e.g. by caching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The present invention proposes a kind of enterprise's related web page theme measuring method and system, including:Sample web page information is obtained, Web page subject is extracted from info web, and calculate the number of words of Web page subject;Calculate the quantity for the vocabulary for meeting following conditions in webpage;The URL addresses of friendly link are searched for, whether the friendly link for finding the URL of link webpage includes oneself source web page domain name;The URL addresses for calculating each link in webpage are not the quantity and the to one's name quantity of source web page domain name of oneself source web page domain name;Calculate picture number in Webpage;Multiple words are extracted as a vocabulary sequence according to the order occurred in HTML, calculate the probability that each word occurs simultaneously in vocabulary sequence;Calculate above-mentioned ginseng for given webpage and sample web page, calculate given webpage with sample web page variance, determine Web page subject.The webpage that the present invention crawls to reptile carries out identical and calculates measurement and score value comparison, carries out qualitative classification, obtains Web page subject.
Description
Technical field
The present invention relates to technical field of the computer network, more particularly to a kind of enterprise's related web page theme measuring method and it is
System.
Background technology
Existing company information general website, mostly it is that the simple of company information is enumerated, and is mainly for single
The information of enterprise collects and analyzed.The shortcomings that prior art is to exist to lack a kind of correlation between enterprise and analyze
Mode.Wherein, how by the essential information of each enterprise to realize that computer automation is determined to the theme of the enterprise,
It is the technical problem for being currently needed for solving.
The content of the invention
The purpose of the present invention is intended at least solve one of described technological deficiency.
Therefore, it is an object of the invention to propose a kind of enterprise's related web page theme measuring method and system.
To achieve these goals, embodiments of the invention provide a kind of enterprise's related web page theme measuring method and are
System, comprises the following steps:
Step S1, sample web page information is obtained, Web page subject is extracted from the info web, and calculate the webpage master
The number of words P1 of topic;
Step S2, the quantity for the vocabulary for meeting following conditions in webpage is calculated, including:Html tag is independently surrounded, had
Hyperlink, four words converge;
Step S3, the URL addresses of friendly link are searched for, whether the friendly link for finding the URL of link webpage wraps
The friendly link P3 returned containing oneself source web page domain name, calculating linking;
Step S4, the URL addresses for calculating each link in webpage are not the quantity P4 of oneself source web page domain name and belonged to
The quantity P5 of oneself source web page domain name;
Step S5, calculate picture number P6 in Webpage;
Step S6, extract webpage in html tag independently surrounds, have hyperlink and be four words remittance word, according to
The order occurred in HTML extracts multiple words as a vocabulary sequence, and calculates each word in vocabulary sequence while appear in
The probability P 7 occurred in the menu vocabulary sequence of sample web page, wherein, the vocabulary that four words converges by being extracted from webpage
Four words are scrabbled up according to voice rhythm to converge;
Step S7, above-mentioned parameter P1 to P7 is calculated for given webpage and sample web page, and calculate the P1 of given webpage extremely
P7, the variance with the P1 to P7 of sample web page, to obtain the similitude of the given webpage and sample web page, determine webpage master
Topic.
Further, the info web includes:Web page title, web menu, friendly link, inside and outside link, picture number,
Menu text.
Further, in the step S7, the P1 to P7 of given webpage is calculated using the F- methods of inspection, with sample web page
P1 to P7 variance.
Further, in the step S7, P1-P7 sets different weights to be debugged.
The embodiment of the present invention also proposes a kind of enterprise's related web page theme measuring system, including:
Webpage acquisition module, for obtaining sample web page information, Web page subject is extracted from the info web, and calculate
The number of words P1 of the Web page subject;
Vocabulary number calculating section, for calculating the quantity for the vocabulary for meeting following conditions in webpage, including:Html tag
It is independent to surround, there is hyperlink, four words to converge;
Friendly link search module, for searching for the URL addresses of friendly link, find the URL of link webpage
Whether friendly link includes oneself source web page domain name, the friendly link P3 that calculating linking returns;
Number calculating section, the URL addresses for calculating each link in webpage are not the numbers of oneself source web page domain name
Measure the P4 and to one's name quantity P5 of source web page domain name, and picture number P6 in Webpage;
Probability statistics module, independently surrounded for extracting html tag in webpage, have hyperlink and for four words converge
Word, multiple words are extracted as a vocabulary sequence according to the order occurred in HTML, and calculate each in vocabulary sequence
Word appears in the probability P 7 occurred in the menu vocabulary sequence of sample web page simultaneously, wherein, four words converges by from webpage
The vocabulary extracted scrabbles up four words according to voice rhythm and converged;
Web page subject determining module, for calculating above-mentioned parameter P1 to P7 for given webpage and sample web page, and calculate
The P1 to P7 of given webpage, the variance with the P1 to P7 of sample web page, to obtain the similar of the given webpage and sample web page
Property, determine Web page subject.
Further, the info web includes:Web page title, web menu, friendly link, inside and outside link, picture number,
Menu text.
Further, the Web page subject determining module calculates the P1 to P7 of given webpage using the F- methods of inspection, with sample
The P1 of webpage to P7 variance.
Further, the Web page subject determining module sets different weights to be debugged.
Enterprise's related web page theme measuring method according to embodiments of the present invention and system, from web page title, web menu,
Some indexs of friendly link, inside and outside link, picture number, menu text etc. calculate score value respectively, first collect certain
The webpage of quantity allows computer that the mean scores of these webpage indices are first calculated, then reptile is climbed as sample
The webpage taken carries out identical and calculates measurement and score value comparison, so as to carry out qualitative classification, obtains Web page subject.The present invention passes through
To the multiple words extracted, generation four words remittance mode is handled, and can improve measurement accuracy to 85% left side from 60%
The right side, substantially increase treatment effeciency.
The additional aspect of the present invention and advantage will be set forth in part in the description, and will partly become from the following description
Obtain substantially, or recognized by the practice of the present invention.
Brief description of the drawings
The above-mentioned and/or additional aspect and advantage of the present invention will become in the description from combination accompanying drawings below to embodiment
Substantially and it is readily appreciated that, wherein:
Fig. 1 is the flow chart according to enterprise's related web page theme measuring method of the embodiment of the present invention;
Fig. 2 is the structure chart according to enterprise's related web page theme measuring system of the embodiment of the present invention.
Embodiment
Embodiments of the invention are described below in detail, the example of embodiment is shown in the drawings, wherein identical from beginning to end
Or similar label represents same or similar element or the element with same or like function.Retouched below with reference to accompanying drawing
The embodiment stated is exemplary, it is intended to for explaining the present invention, and is not considered as limiting the invention.
The present invention proposes a kind of enterprise's related web page theme measuring method and system, by entering row information to given webpage
Analysis, judges whether given webpage belongs to subject classification set in advance,
As shown in figure 1, enterprise's related web page theme measuring method of the embodiment of the present invention, comprises the following steps:
Step S1, sample web page information is obtained, Web page subject is extracted from info web, and calculate the number of words of Web page subject
P1。
In one embodiment of the invention, info web includes:Web page title, web menu, friendly link, interior exterior chain
Connect, picture number, menu text.
It should be noted that above-mentioned is only the citing to info web type, it is not intended to be limiting of the invention.The present invention
In info web can also include other guide, will not be repeated here.
Step S2, the quantity P2 for the vocabulary for meeting following conditions in webpage is calculated, including:Html tag is independently surrounded, had
There are hyperlink, four words to converge.It should be noted that the menu of webpage is mostly made up of four words.
Step S3, the URL addresses of friendly link are searched for, whether the friendly link for finding the URL of link webpage wraps
The friendly link P3 returned containing oneself source web page domain name, calculating linking.
Step S4, the URL addresses for calculating each link in webpage are not the quantity P4 of oneself source web page domain name and belonged to
The quantity P5 of oneself source web page domain name.
Step S5, calculate picture number P6 in Webpage.
Step S6, extract webpage in html tag independently surrounds, have hyperlink and be four words remittance word, according to
The order occurred in HTML extracts multiple words as a vocabulary sequence, and calculates each word in vocabulary sequence while appear in
The probability P 7 occurred in the menu vocabulary sequence of sample web page, wherein, the vocabulary that four words converges by being extracted from webpage
Four words are scrabbled up according to voice rhythm to converge.
In one embodiment of the invention, determination can be optimized in engineering by extracting the quantity of word.
Step S7, above-mentioned parameter P1 to P7 is calculated for given webpage and sample web page, and calculate the P1 of given webpage extremely
P7, the variance with the P1 to P7 of sample web page, to obtain the similitude of given webpage and sample web page, determine Web page subject.Its
In, Web page subject can be the types such as online media sites, profession portal, enterprise official website, electric business website.It should be noted that webpage
The type of theme is not limited to above-mentioned, can also be other types, will not be repeated here.
In one embodiment of the invention, in practical engineering calculation, given webpage is calculated using the F- methods of inspection
P1 to P7, the variance with the P1 to P7 of sample web page.It should be noted that P1-P7 sets different weights to be debugged.
As shown in Fig. 2 enterprise's related web page theme measuring system of the embodiment of the present invention, including:Webpage acquisition module 1,
Vocabulary number calculating section 42, friendly link search module 3, number calculating section 4, probability statistics module 5 and Web page subject are true
Cover half block 6.
Specifically, webpage acquisition module 1 is used to obtain sample web page information, and Web page subject is extracted from info web, and
Calculate the number of words P1 of Web page subject.
In one embodiment of the invention, info web includes:Web page title, web menu, friendly link, interior exterior chain
Connect, picture number, menu text.
It should be noted that above-mentioned is only the citing to info web type, it is not intended to be limiting of the invention.The present invention
In info web can also include other guide, will not be repeated here.
Vocabulary number calculating section 42 is used for the quantity for calculating the vocabulary for meeting following conditions in webpage, including:HTML is marked
Label are independent to be surrounded, there is hyperlink, four words to converge.It should be noted that the menu of webpage is mostly made up of four words.
Friendly link search module 3 is used for the URL addresses for searching for friendly link, finds the URL of link webpage
Whether friendly link includes oneself source web page domain name, the friendly link P3 that calculating linking returns.
The URL addresses that number calculating section 4 is used to calculating each link in webpage are not the numbers of oneself source web page domain name
Measure the P4 and to one's name quantity P5 of source web page domain name, and picture number P6 in Webpage.
What probability statistics module 5 independently surrounded for extracting html tag in webpage, and had hyperlink and converged for four words
Word, multiple words are extracted as a vocabulary sequence according to the order occurred in HTML, and calculate each in vocabulary sequence
Word appears in the probability P 7 occurred in the menu vocabulary sequence of sample web page simultaneously, wherein, four words converges by from webpage
The vocabulary extracted scrabbles up four words according to voice rhythm and converged.
In one embodiment of the invention, determination can be optimized in engineering by extracting the quantity of word.
Web page subject determining module 6 is used to calculate above-mentioned parameter P1 to P7 for given webpage and sample web page, and calculates
The P1 to P7 of given webpage, the variance with the P1 to P7 of sample web page, to obtain the similitude of given webpage and sample web page, really
Determine Web page subject.
In one embodiment of the invention, Web page subject determining module 6 calculates given webpage using the F- methods of inspection
P1 to P7, the variance with the P1 to P7 of sample web page.It should be noted that Web page subject determining module 6 sets different weights to enter
Row debugging.
Enterprise's related web page theme measuring method according to embodiments of the present invention and system, from web page title, web menu,
Some indexs of friendly link, inside and outside link, picture number, menu text etc. calculate score value respectively, first collect certain
The webpage of quantity allows computer that the mean scores of these webpage indices are first calculated, then reptile is climbed as sample
The webpage taken carries out identical and calculates measurement and score value comparison, so as to carry out qualitative classification, obtains Web page subject.The present invention passes through
To the multiple words extracted, generation four words remittance mode is handled, and can improve measurement accuracy to 85% left side from 60%
The right side, substantially increase treatment effeciency.In the description of this specification, reference term " one embodiment ", " some embodiments ",
The description of " example ", " specific example " or " some examples " etc. means to combine specific features, the knot that the embodiment or example describe
Structure, material or feature are contained at least one embodiment or example of the present invention.In this manual, to above-mentioned term
Schematic representation is not necessarily referring to identical embodiment or example.Moreover, specific features, structure, material or the spy of description
Point can combine in an appropriate manner in any one or more embodiments or example.
Although embodiments of the invention have been shown and described above, it is to be understood that above-described embodiment is example
Property, it is impossible to limitation of the present invention is interpreted as, one of ordinary skill in the art is not departing from the principle and objective of the present invention
In the case of above-described embodiment can be changed within the scope of the invention, change, replace and modification.The scope of the present invention
By appended claims and its equivalent limit.
Claims (8)
1. a kind of enterprise's related web page theme measuring method, it is characterised in that comprise the following steps:
Step S1, sample companies info web is obtained, Web page subject is extracted from the info web, and calculate the webpage master
The number of words P1 of topic;
Step S2, the quantity for the vocabulary for meeting following conditions in webpage is calculated, including:Html tag independently surrounds, has hyperlink
Connect, four words converge;
Step S3, the URL addresses of friendly link are searched for, whether the friendly link for finding the URL of link webpage includes certainly
Own source web page domain name, the friendly link P3 that calculating linking returns;
Step S4, the URL addresses for calculating each link in webpage are not the quantity P4 and to one's name of oneself source web page domain name
The quantity P5 of source web page domain name;
Step S5, calculate picture number P6 in Webpage;
Step S6, extract html tag in webpage and independently surround, there is hyperlink and the word converged for four words, according in HTML
The order of middle appearance extracts multiple words as a vocabulary sequence, and calculates each word in vocabulary sequence while appear in sample
The probability P 7 occurred in the menu vocabulary sequence of webpage, wherein, four words converge by the vocabulary that is extracted from webpage according to
Voice rhythm scrabbles up the remittance of four words;
Step S7, above-mentioned parameter P1 to P7 is calculated for given webpage and sample web page, and calculate the P1 to P7 of given webpage, with
The P1 of sample web page to P7 variance, to obtain the similitude of the given webpage and sample web page, determine Web page subject.
2. enterprise's related web page theme measuring method as claimed in claim 1, it is characterised in that the info web includes:
Web page title, web menu, friendly link, inside and outside link, picture number, menu text.
3. enterprise's related web page theme measuring method as claimed in claim 1, it is characterised in that in the step S7, adopt
The P1 to P7 of given webpage, the variance with the P1 to P7 of sample web page are calculated with the F- methods of inspection.
4. enterprise's related web page theme measuring method as claimed in claim 1, it is characterised in that in the step S7, P1-
P7 sets different weights to be debugged.
A kind of 5. enterprise's related web page theme measuring system, it is characterised in that including:
Webpage acquisition module, for obtaining sample web page information, Web page subject is extracted from the info web, and described in calculating
The number of words P1 of Web page subject;
Vocabulary number calculating section, for calculating the quantity for the vocabulary for meeting following conditions in webpage, including:Html tag is independent
Surround, there is hyperlink, four words to converge;
Friendly link search module, for searching for the URL addresses of friendly link, find the friendship of the URL of link webpage
Whether link includes oneself source web page domain name, the friendly link P3 that calculating linking returns;
Number calculating section, the URL addresses for calculating each link in webpage are not the quantity P4 of oneself source web page domain name
The to one's name quantity P5 of source web page domain name, and picture number P6 in Webpage;
Probability statistics module, independently surrounded for extracting html tag in webpage, there is hyperlink and the word converged for four words,
Multiple words are extracted as a vocabulary sequence according to the order that occurs in HTML, and calculate in vocabulary sequence each word simultaneously
The probability P 7 occurred in the menu vocabulary sequence of sample web page is appeared in, wherein, four words converges by being extracted from webpage
Vocabulary according to voice rhythm scrabble up four words converge;
Web page subject determining module, for calculating above-mentioned parameter P1 to P7 for given webpage and sample web page, and calculate given
The P1 of webpage to P7, the variance with the P1 to P7 of sample web page, to obtain the similitude of the given webpage and sample web page, really
Determine Web page subject.
6. enterprise's related web page theme measuring system as claimed in claim 5, it is characterised in that the info web includes:
Web page title, web menu, friendly link, inside and outside link, picture number, menu text.
7. enterprise's related web page theme measuring system as claimed in claim 5, it is characterised in that the Web page subject determines mould
Block calculates the P1 to P7 of given webpage, the variance with the P1 to P7 of sample web page using the F- methods of inspection.
8. enterprise's related web page theme measuring system as claimed in claim 5, it is characterised in that the Web page subject determines mould
Block sets different weights to be debugged.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710354041.4A CN107357801B (en) | 2017-05-18 | 2017-05-18 | Enterprise related webpage theme measuring method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710354041.4A CN107357801B (en) | 2017-05-18 | 2017-05-18 | Enterprise related webpage theme measuring method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107357801A true CN107357801A (en) | 2017-11-17 |
CN107357801B CN107357801B (en) | 2021-05-28 |
Family
ID=60271916
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710354041.4A Active CN107357801B (en) | 2017-05-18 | 2017-05-18 | Enterprise related webpage theme measuring method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107357801B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090177628A1 (en) * | 2003-06-27 | 2009-07-09 | Hiroyuki Yanagisawa | System, apparatus, and method for providing illegal use research service for image data, and system, apparatus, and method for providing proper use research service for image data |
CN102117339A (en) * | 2011-03-30 | 2011-07-06 | 曹晓晶 | Filter supervision method specific to unsecure web page texts |
CN103838792A (en) * | 2012-11-27 | 2014-06-04 | 大连灵动科技发展有限公司 | Method for determining webpage theme |
CN104331449A (en) * | 2014-10-29 | 2015-02-04 | 百度在线网络技术(北京)有限公司 | Method and device for determining similarity between inquiry sentence and webpage, terminal and server |
CN105589892A (en) * | 2014-11-12 | 2016-05-18 | 中国银联股份有限公司 | Webpage theme analysis method based on anchor text backtracking chain |
-
2017
- 2017-05-18 CN CN201710354041.4A patent/CN107357801B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090177628A1 (en) * | 2003-06-27 | 2009-07-09 | Hiroyuki Yanagisawa | System, apparatus, and method for providing illegal use research service for image data, and system, apparatus, and method for providing proper use research service for image data |
CN102117339A (en) * | 2011-03-30 | 2011-07-06 | 曹晓晶 | Filter supervision method specific to unsecure web page texts |
CN103838792A (en) * | 2012-11-27 | 2014-06-04 | 大连灵动科技发展有限公司 | Method for determining webpage theme |
CN104331449A (en) * | 2014-10-29 | 2015-02-04 | 百度在线网络技术(北京)有限公司 | Method and device for determining similarity between inquiry sentence and webpage, terminal and server |
CN105589892A (en) * | 2014-11-12 | 2016-05-18 | 中国银联股份有限公司 | Webpage theme analysis method based on anchor text backtracking chain |
Non-Patent Citations (1)
Title |
---|
彭敏等: "基于频繁项集的海量短文本聚类与主题抽取", 《计算机研究与发展》 * |
Also Published As
Publication number | Publication date |
---|---|
CN107357801B (en) | 2021-05-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9092789B2 (en) | Method and system for semantic analysis of unstructured data | |
CN107135092B (en) | A kind of Web service clustering method towards global social interaction server net | |
CN110602045B (en) | Malicious webpage identification method based on feature fusion and machine learning | |
US20110302486A1 (en) | Method and apparatus for obtaining the effective contents of web page | |
CN102693279B (en) | Method, device and system for fast calculating comment similarity | |
US20120259859A1 (en) | Method for recommending best information in real time by appropriately obtaining gist of web page and user's preference | |
CN102890702A (en) | Internet forum-oriented opinion leader mining method | |
CN103210387B (en) | Conjunctive word calling mechanism, information processor, conjunctive word register method and conjunctive word register system | |
CN102054015A (en) | System and method of organizing community intelligent information by using organic matter data model | |
CN106095979A (en) | URL merging treatment method and apparatus | |
CN104331438B (en) | To novel web page contents selectivity abstracting method and device | |
CN103838862B (en) | Video searching method, device and terminal | |
CN102654861B (en) | Webpage extraction accuracy computational methods and system | |
CN108694325A (en) | The condition discriminating apparatus of the discriminating conduct and specified type website of specified type website | |
US20090204889A1 (en) | Adaptive sampling of web pages for extraction | |
CN104537080B (en) | Information recommends method and system | |
CN103605744B (en) | The analysis method and device of site search engine data on flows | |
KR101532252B1 (en) | The system for collecting and analyzing of information of social network | |
CN104156458B (en) | The extracting method and device of a kind of information | |
CN106202312A (en) | A kind of interest point search method for mobile Internet and system | |
CN108052507A (en) | A kind of city management information the analysis of public opinion system and method | |
CN107357801A (en) | A kind of enterprise's related web page theme measuring method and system | |
CN109582846A (en) | Method, apparatus, electronic equipment and the storage medium scanned for by article | |
JP5180894B2 (en) | Attribute expression acquisition method, apparatus and program | |
Othman et al. | Customer opinion summarization based on twitter conversations |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |