CN107357801A - A kind of enterprise's related web page theme measuring method and system - Google Patents

A kind of enterprise's related web page theme measuring method and system Download PDF

Info

Publication number
CN107357801A
CN107357801A CN201710354041.4A CN201710354041A CN107357801A CN 107357801 A CN107357801 A CN 107357801A CN 201710354041 A CN201710354041 A CN 201710354041A CN 107357801 A CN107357801 A CN 107357801A
Authority
CN
China
Prior art keywords
web page
webpage
link
words
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710354041.4A
Other languages
Chinese (zh)
Other versions
CN107357801B (en
Inventor
辛柯俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201710354041.4A priority Critical patent/CN107357801B/en
Publication of CN107357801A publication Critical patent/CN107357801A/en
Application granted granted Critical
Publication of CN107357801B publication Critical patent/CN107357801B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9574Browsing optimisation, e.g. caching or content distillation of access to content, e.g. by caching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention proposes a kind of enterprise's related web page theme measuring method and system, including:Sample web page information is obtained, Web page subject is extracted from info web, and calculate the number of words of Web page subject;Calculate the quantity for the vocabulary for meeting following conditions in webpage;The URL addresses of friendly link are searched for, whether the friendly link for finding the URL of link webpage includes oneself source web page domain name;The URL addresses for calculating each link in webpage are not the quantity and the to one's name quantity of source web page domain name of oneself source web page domain name;Calculate picture number in Webpage;Multiple words are extracted as a vocabulary sequence according to the order occurred in HTML, calculate the probability that each word occurs simultaneously in vocabulary sequence;Calculate above-mentioned ginseng for given webpage and sample web page, calculate given webpage with sample web page variance, determine Web page subject.The webpage that the present invention crawls to reptile carries out identical and calculates measurement and score value comparison, carries out qualitative classification, obtains Web page subject.

Description

A kind of enterprise's related web page theme measuring method and system
Technical field
The present invention relates to technical field of the computer network, more particularly to a kind of enterprise's related web page theme measuring method and it is System.
Background technology
Existing company information general website, mostly it is that the simple of company information is enumerated, and is mainly for single The information of enterprise collects and analyzed.The shortcomings that prior art is to exist to lack a kind of correlation between enterprise and analyze Mode.Wherein, how by the essential information of each enterprise to realize that computer automation is determined to the theme of the enterprise, It is the technical problem for being currently needed for solving.
The content of the invention
The purpose of the present invention is intended at least solve one of described technological deficiency.
Therefore, it is an object of the invention to propose a kind of enterprise's related web page theme measuring method and system.
To achieve these goals, embodiments of the invention provide a kind of enterprise's related web page theme measuring method and are System, comprises the following steps:
Step S1, sample web page information is obtained, Web page subject is extracted from the info web, and calculate the webpage master The number of words P1 of topic;
Step S2, the quantity for the vocabulary for meeting following conditions in webpage is calculated, including:Html tag is independently surrounded, had Hyperlink, four words converge;
Step S3, the URL addresses of friendly link are searched for, whether the friendly link for finding the URL of link webpage wraps The friendly link P3 returned containing oneself source web page domain name, calculating linking;
Step S4, the URL addresses for calculating each link in webpage are not the quantity P4 of oneself source web page domain name and belonged to The quantity P5 of oneself source web page domain name;
Step S5, calculate picture number P6 in Webpage;
Step S6, extract webpage in html tag independently surrounds, have hyperlink and be four words remittance word, according to The order occurred in HTML extracts multiple words as a vocabulary sequence, and calculates each word in vocabulary sequence while appear in The probability P 7 occurred in the menu vocabulary sequence of sample web page, wherein, the vocabulary that four words converges by being extracted from webpage Four words are scrabbled up according to voice rhythm to converge;
Step S7, above-mentioned parameter P1 to P7 is calculated for given webpage and sample web page, and calculate the P1 of given webpage extremely P7, the variance with the P1 to P7 of sample web page, to obtain the similitude of the given webpage and sample web page, determine webpage master Topic.
Further, the info web includes:Web page title, web menu, friendly link, inside and outside link, picture number, Menu text.
Further, in the step S7, the P1 to P7 of given webpage is calculated using the F- methods of inspection, with sample web page P1 to P7 variance.
Further, in the step S7, P1-P7 sets different weights to be debugged.
The embodiment of the present invention also proposes a kind of enterprise's related web page theme measuring system, including:
Webpage acquisition module, for obtaining sample web page information, Web page subject is extracted from the info web, and calculate The number of words P1 of the Web page subject;
Vocabulary number calculating section, for calculating the quantity for the vocabulary for meeting following conditions in webpage, including:Html tag It is independent to surround, there is hyperlink, four words to converge;
Friendly link search module, for searching for the URL addresses of friendly link, find the URL of link webpage Whether friendly link includes oneself source web page domain name, the friendly link P3 that calculating linking returns;
Number calculating section, the URL addresses for calculating each link in webpage are not the numbers of oneself source web page domain name Measure the P4 and to one's name quantity P5 of source web page domain name, and picture number P6 in Webpage;
Probability statistics module, independently surrounded for extracting html tag in webpage, have hyperlink and for four words converge Word, multiple words are extracted as a vocabulary sequence according to the order occurred in HTML, and calculate each in vocabulary sequence Word appears in the probability P 7 occurred in the menu vocabulary sequence of sample web page simultaneously, wherein, four words converges by from webpage The vocabulary extracted scrabbles up four words according to voice rhythm and converged;
Web page subject determining module, for calculating above-mentioned parameter P1 to P7 for given webpage and sample web page, and calculate The P1 to P7 of given webpage, the variance with the P1 to P7 of sample web page, to obtain the similar of the given webpage and sample web page Property, determine Web page subject.
Further, the info web includes:Web page title, web menu, friendly link, inside and outside link, picture number, Menu text.
Further, the Web page subject determining module calculates the P1 to P7 of given webpage using the F- methods of inspection, with sample The P1 of webpage to P7 variance.
Further, the Web page subject determining module sets different weights to be debugged.
Enterprise's related web page theme measuring method according to embodiments of the present invention and system, from web page title, web menu, Some indexs of friendly link, inside and outside link, picture number, menu text etc. calculate score value respectively, first collect certain The webpage of quantity allows computer that the mean scores of these webpage indices are first calculated, then reptile is climbed as sample The webpage taken carries out identical and calculates measurement and score value comparison, so as to carry out qualitative classification, obtains Web page subject.The present invention passes through To the multiple words extracted, generation four words remittance mode is handled, and can improve measurement accuracy to 85% left side from 60% The right side, substantially increase treatment effeciency.
The additional aspect of the present invention and advantage will be set forth in part in the description, and will partly become from the following description Obtain substantially, or recognized by the practice of the present invention.
Brief description of the drawings
The above-mentioned and/or additional aspect and advantage of the present invention will become in the description from combination accompanying drawings below to embodiment Substantially and it is readily appreciated that, wherein:
Fig. 1 is the flow chart according to enterprise's related web page theme measuring method of the embodiment of the present invention;
Fig. 2 is the structure chart according to enterprise's related web page theme measuring system of the embodiment of the present invention.
Embodiment
Embodiments of the invention are described below in detail, the example of embodiment is shown in the drawings, wherein identical from beginning to end Or similar label represents same or similar element or the element with same or like function.Retouched below with reference to accompanying drawing The embodiment stated is exemplary, it is intended to for explaining the present invention, and is not considered as limiting the invention.
The present invention proposes a kind of enterprise's related web page theme measuring method and system, by entering row information to given webpage Analysis, judges whether given webpage belongs to subject classification set in advance,
As shown in figure 1, enterprise's related web page theme measuring method of the embodiment of the present invention, comprises the following steps:
Step S1, sample web page information is obtained, Web page subject is extracted from info web, and calculate the number of words of Web page subject P1。
In one embodiment of the invention, info web includes:Web page title, web menu, friendly link, interior exterior chain Connect, picture number, menu text.
It should be noted that above-mentioned is only the citing to info web type, it is not intended to be limiting of the invention.The present invention In info web can also include other guide, will not be repeated here.
Step S2, the quantity P2 for the vocabulary for meeting following conditions in webpage is calculated, including:Html tag is independently surrounded, had There are hyperlink, four words to converge.It should be noted that the menu of webpage is mostly made up of four words.
Step S3, the URL addresses of friendly link are searched for, whether the friendly link for finding the URL of link webpage wraps The friendly link P3 returned containing oneself source web page domain name, calculating linking.
Step S4, the URL addresses for calculating each link in webpage are not the quantity P4 of oneself source web page domain name and belonged to The quantity P5 of oneself source web page domain name.
Step S5, calculate picture number P6 in Webpage.
Step S6, extract webpage in html tag independently surrounds, have hyperlink and be four words remittance word, according to The order occurred in HTML extracts multiple words as a vocabulary sequence, and calculates each word in vocabulary sequence while appear in The probability P 7 occurred in the menu vocabulary sequence of sample web page, wherein, the vocabulary that four words converges by being extracted from webpage Four words are scrabbled up according to voice rhythm to converge.
In one embodiment of the invention, determination can be optimized in engineering by extracting the quantity of word.
Step S7, above-mentioned parameter P1 to P7 is calculated for given webpage and sample web page, and calculate the P1 of given webpage extremely P7, the variance with the P1 to P7 of sample web page, to obtain the similitude of given webpage and sample web page, determine Web page subject.Its In, Web page subject can be the types such as online media sites, profession portal, enterprise official website, electric business website.It should be noted that webpage The type of theme is not limited to above-mentioned, can also be other types, will not be repeated here.
In one embodiment of the invention, in practical engineering calculation, given webpage is calculated using the F- methods of inspection P1 to P7, the variance with the P1 to P7 of sample web page.It should be noted that P1-P7 sets different weights to be debugged.
As shown in Fig. 2 enterprise's related web page theme measuring system of the embodiment of the present invention, including:Webpage acquisition module 1, Vocabulary number calculating section 42, friendly link search module 3, number calculating section 4, probability statistics module 5 and Web page subject are true Cover half block 6.
Specifically, webpage acquisition module 1 is used to obtain sample web page information, and Web page subject is extracted from info web, and Calculate the number of words P1 of Web page subject.
In one embodiment of the invention, info web includes:Web page title, web menu, friendly link, interior exterior chain Connect, picture number, menu text.
It should be noted that above-mentioned is only the citing to info web type, it is not intended to be limiting of the invention.The present invention In info web can also include other guide, will not be repeated here.
Vocabulary number calculating section 42 is used for the quantity for calculating the vocabulary for meeting following conditions in webpage, including:HTML is marked Label are independent to be surrounded, there is hyperlink, four words to converge.It should be noted that the menu of webpage is mostly made up of four words.
Friendly link search module 3 is used for the URL addresses for searching for friendly link, finds the URL of link webpage Whether friendly link includes oneself source web page domain name, the friendly link P3 that calculating linking returns.
The URL addresses that number calculating section 4 is used to calculating each link in webpage are not the numbers of oneself source web page domain name Measure the P4 and to one's name quantity P5 of source web page domain name, and picture number P6 in Webpage.
What probability statistics module 5 independently surrounded for extracting html tag in webpage, and had hyperlink and converged for four words Word, multiple words are extracted as a vocabulary sequence according to the order occurred in HTML, and calculate each in vocabulary sequence Word appears in the probability P 7 occurred in the menu vocabulary sequence of sample web page simultaneously, wherein, four words converges by from webpage The vocabulary extracted scrabbles up four words according to voice rhythm and converged.
In one embodiment of the invention, determination can be optimized in engineering by extracting the quantity of word.
Web page subject determining module 6 is used to calculate above-mentioned parameter P1 to P7 for given webpage and sample web page, and calculates The P1 to P7 of given webpage, the variance with the P1 to P7 of sample web page, to obtain the similitude of given webpage and sample web page, really Determine Web page subject.
In one embodiment of the invention, Web page subject determining module 6 calculates given webpage using the F- methods of inspection P1 to P7, the variance with the P1 to P7 of sample web page.It should be noted that Web page subject determining module 6 sets different weights to enter Row debugging.
Enterprise's related web page theme measuring method according to embodiments of the present invention and system, from web page title, web menu, Some indexs of friendly link, inside and outside link, picture number, menu text etc. calculate score value respectively, first collect certain The webpage of quantity allows computer that the mean scores of these webpage indices are first calculated, then reptile is climbed as sample The webpage taken carries out identical and calculates measurement and score value comparison, so as to carry out qualitative classification, obtains Web page subject.The present invention passes through To the multiple words extracted, generation four words remittance mode is handled, and can improve measurement accuracy to 85% left side from 60% The right side, substantially increase treatment effeciency.In the description of this specification, reference term " one embodiment ", " some embodiments ", The description of " example ", " specific example " or " some examples " etc. means to combine specific features, the knot that the embodiment or example describe Structure, material or feature are contained at least one embodiment or example of the present invention.In this manual, to above-mentioned term Schematic representation is not necessarily referring to identical embodiment or example.Moreover, specific features, structure, material or the spy of description Point can combine in an appropriate manner in any one or more embodiments or example.
Although embodiments of the invention have been shown and described above, it is to be understood that above-described embodiment is example Property, it is impossible to limitation of the present invention is interpreted as, one of ordinary skill in the art is not departing from the principle and objective of the present invention In the case of above-described embodiment can be changed within the scope of the invention, change, replace and modification.The scope of the present invention By appended claims and its equivalent limit.

Claims (8)

1. a kind of enterprise's related web page theme measuring method, it is characterised in that comprise the following steps:
Step S1, sample companies info web is obtained, Web page subject is extracted from the info web, and calculate the webpage master The number of words P1 of topic;
Step S2, the quantity for the vocabulary for meeting following conditions in webpage is calculated, including:Html tag independently surrounds, has hyperlink Connect, four words converge;
Step S3, the URL addresses of friendly link are searched for, whether the friendly link for finding the URL of link webpage includes certainly Own source web page domain name, the friendly link P3 that calculating linking returns;
Step S4, the URL addresses for calculating each link in webpage are not the quantity P4 and to one's name of oneself source web page domain name The quantity P5 of source web page domain name;
Step S5, calculate picture number P6 in Webpage;
Step S6, extract html tag in webpage and independently surround, there is hyperlink and the word converged for four words, according in HTML The order of middle appearance extracts multiple words as a vocabulary sequence, and calculates each word in vocabulary sequence while appear in sample The probability P 7 occurred in the menu vocabulary sequence of webpage, wherein, four words converge by the vocabulary that is extracted from webpage according to Voice rhythm scrabbles up the remittance of four words;
Step S7, above-mentioned parameter P1 to P7 is calculated for given webpage and sample web page, and calculate the P1 to P7 of given webpage, with The P1 of sample web page to P7 variance, to obtain the similitude of the given webpage and sample web page, determine Web page subject.
2. enterprise's related web page theme measuring method as claimed in claim 1, it is characterised in that the info web includes: Web page title, web menu, friendly link, inside and outside link, picture number, menu text.
3. enterprise's related web page theme measuring method as claimed in claim 1, it is characterised in that in the step S7, adopt The P1 to P7 of given webpage, the variance with the P1 to P7 of sample web page are calculated with the F- methods of inspection.
4. enterprise's related web page theme measuring method as claimed in claim 1, it is characterised in that in the step S7, P1- P7 sets different weights to be debugged.
A kind of 5. enterprise's related web page theme measuring system, it is characterised in that including:
Webpage acquisition module, for obtaining sample web page information, Web page subject is extracted from the info web, and described in calculating The number of words P1 of Web page subject;
Vocabulary number calculating section, for calculating the quantity for the vocabulary for meeting following conditions in webpage, including:Html tag is independent Surround, there is hyperlink, four words to converge;
Friendly link search module, for searching for the URL addresses of friendly link, find the friendship of the URL of link webpage Whether link includes oneself source web page domain name, the friendly link P3 that calculating linking returns;
Number calculating section, the URL addresses for calculating each link in webpage are not the quantity P4 of oneself source web page domain name The to one's name quantity P5 of source web page domain name, and picture number P6 in Webpage;
Probability statistics module, independently surrounded for extracting html tag in webpage, there is hyperlink and the word converged for four words, Multiple words are extracted as a vocabulary sequence according to the order that occurs in HTML, and calculate in vocabulary sequence each word simultaneously The probability P 7 occurred in the menu vocabulary sequence of sample web page is appeared in, wherein, four words converges by being extracted from webpage Vocabulary according to voice rhythm scrabble up four words converge;
Web page subject determining module, for calculating above-mentioned parameter P1 to P7 for given webpage and sample web page, and calculate given The P1 of webpage to P7, the variance with the P1 to P7 of sample web page, to obtain the similitude of the given webpage and sample web page, really Determine Web page subject.
6. enterprise's related web page theme measuring system as claimed in claim 5, it is characterised in that the info web includes: Web page title, web menu, friendly link, inside and outside link, picture number, menu text.
7. enterprise's related web page theme measuring system as claimed in claim 5, it is characterised in that the Web page subject determines mould Block calculates the P1 to P7 of given webpage, the variance with the P1 to P7 of sample web page using the F- methods of inspection.
8. enterprise's related web page theme measuring system as claimed in claim 5, it is characterised in that the Web page subject determines mould Block sets different weights to be debugged.
CN201710354041.4A 2017-05-18 2017-05-18 Enterprise related webpage theme measuring method and system Active CN107357801B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710354041.4A CN107357801B (en) 2017-05-18 2017-05-18 Enterprise related webpage theme measuring method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710354041.4A CN107357801B (en) 2017-05-18 2017-05-18 Enterprise related webpage theme measuring method and system

Publications (2)

Publication Number Publication Date
CN107357801A true CN107357801A (en) 2017-11-17
CN107357801B CN107357801B (en) 2021-05-28

Family

ID=60271916

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710354041.4A Active CN107357801B (en) 2017-05-18 2017-05-18 Enterprise related webpage theme measuring method and system

Country Status (1)

Country Link
CN (1) CN107357801B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090177628A1 (en) * 2003-06-27 2009-07-09 Hiroyuki Yanagisawa System, apparatus, and method for providing illegal use research service for image data, and system, apparatus, and method for providing proper use research service for image data
CN102117339A (en) * 2011-03-30 2011-07-06 曹晓晶 Filter supervision method specific to unsecure web page texts
CN103838792A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Method for determining webpage theme
CN104331449A (en) * 2014-10-29 2015-02-04 百度在线网络技术(北京)有限公司 Method and device for determining similarity between inquiry sentence and webpage, terminal and server
CN105589892A (en) * 2014-11-12 2016-05-18 中国银联股份有限公司 Webpage theme analysis method based on anchor text backtracking chain

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090177628A1 (en) * 2003-06-27 2009-07-09 Hiroyuki Yanagisawa System, apparatus, and method for providing illegal use research service for image data, and system, apparatus, and method for providing proper use research service for image data
CN102117339A (en) * 2011-03-30 2011-07-06 曹晓晶 Filter supervision method specific to unsecure web page texts
CN103838792A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Method for determining webpage theme
CN104331449A (en) * 2014-10-29 2015-02-04 百度在线网络技术(北京)有限公司 Method and device for determining similarity between inquiry sentence and webpage, terminal and server
CN105589892A (en) * 2014-11-12 2016-05-18 中国银联股份有限公司 Webpage theme analysis method based on anchor text backtracking chain

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
彭敏等: "基于频繁项集的海量短文本聚类与主题抽取", 《计算机研究与发展》 *

Also Published As

Publication number Publication date
CN107357801B (en) 2021-05-28

Similar Documents

Publication Publication Date Title
US9092789B2 (en) Method and system for semantic analysis of unstructured data
CN102254038B (en) System and method for analyzing network comment relevance
CN102054015B (en) System and method of organizing community intelligent information by using organic matter data model
CN102054016B (en) For capturing and manage the system and method for community intelligent information
CN103559235B (en) A kind of online social networks malicious web pages detection recognition methods
CN110602045B (en) Malicious webpage identification method based on feature fusion and machine learning
CN102693279B (en) Method, device and system for fast calculating comment similarity
CN103210387B (en) Conjunctive word calling mechanism, information processor, conjunctive word register method and conjunctive word register system
CN106095979A (en) URL merging treatment method and apparatus
CN109857956A (en) The automatic abstracting method of news web page key message based on label and blocking characteristic
CN103559234A (en) System and method for automated semantic annotation of RESTful Web services
CN106446113A (en) Mobile big data analysis method and device
CN102654861B (en) Webpage extraction accuracy computational methods and system
CN110134845A (en) Project public sentiment monitoring method, device, computer equipment and storage medium
US20090204889A1 (en) Adaptive sampling of web pages for extraction
CN104537080B (en) Information recommends method and system
CN103605744B (en) The analysis method and device of site search engine data on flows
KR101532252B1 (en) The system for collecting and analyzing of information of social network
CN104156458B (en) The extracting method and device of a kind of information
CN108694325A (en) The condition discriminating apparatus of the discriminating conduct and specified type website of specified type website
CN106970962A (en) A kind of method and apparatus for obtaining search engine search results
CN106202312A (en) A kind of interest point search method for mobile Internet and system
CN108052507A (en) A kind of city management information the analysis of public opinion system and method
CN107357801A (en) A kind of enterprise's related web page theme measuring method and system
Henriques et al. Scraping news sites and social networks for prejudice term analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant