CN107357801B - Enterprise related webpage theme measuring method and system - Google Patents

Enterprise related webpage theme measuring method and system Download PDF

Info

Publication number
CN107357801B
CN107357801B CN201710354041.4A CN201710354041A CN107357801B CN 107357801 B CN107357801 B CN 107357801B CN 201710354041 A CN201710354041 A CN 201710354041A CN 107357801 B CN107357801 B CN 107357801B
Authority
CN
China
Prior art keywords
webpage
web page
calculating
theme
link
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710354041.4A
Other languages
Chinese (zh)
Other versions
CN107357801A (en
Inventor
辛柯俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201710354041.4A priority Critical patent/CN107357801B/en
Publication of CN107357801A publication Critical patent/CN107357801A/en
Application granted granted Critical
Publication of CN107357801B publication Critical patent/CN107357801B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9574Browsing optimisation, e.g. caching or content distillation of access to content, e.g. by caching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a method and a system for measuring enterprise related webpage topics, which comprise the following steps: acquiring sample webpage information, extracting a webpage theme from the webpage information, and calculating the word number of the webpage theme; calculating the number of vocabularies meeting the following conditions in the webpage; searching the URL address of the friend link, and finding out whether the friend link of the webpage of the URL of the link contains the domain name of the self-source webpage or not; calculating the number of URL addresses of each link in the webpage, which are not the domain name of the source webpage of the URL address and belong to the domain name of the source webpage of the URL address; calculating the number of pictures in a webpage; extracting a plurality of words as a word list sequence according to the sequence appearing in the HTML, and calculating the probability of each word appearing in the word list sequence at the same time; the parameters are calculated for the given web page and the sample web page, the variance of the given web page and the sample web page is calculated, and the web page theme is determined. The method carries out the same calculation measurement and score comparison on the webpages crawled by the crawler, and carries out classification and qualification to obtain webpage themes.

Description

Enterprise related webpage theme measuring method and system
Technical Field
The invention relates to the technical field of computer networks, in particular to a method and a system for measuring enterprise related webpage topics.
Background
The existing enterprise information comprehensive websites are mostly simple lists of enterprise information and mainly aim at information summarization and analysis of a single enterprise. The prior art has the disadvantage of lacking a way to analyze the interrelationships between enterprises. How to determine the theme of each enterprise automatically by a computer through basic information of the enterprise is a technical problem to be solved at present.
Disclosure of Invention
The object of the present invention is to solve at least one of the technical drawbacks mentioned.
Therefore, the invention aims to provide a method and a system for measuring the theme of the enterprise-related webpage.
In order to achieve the above object, an embodiment of the present invention provides a method and a system for measuring a theme of an enterprise-related webpage, including the following steps:
step S1, acquiring sample webpage information, extracting webpage subjects from the webpage information, and calculating the word number P1 of the webpage subjects;
step S2, calculating the number of words in the web page that meet the following conditions, including: the HTML label is independently surrounded and provided with a hyperlink and a four-word vocabulary;
step S3, searching the URL address of the friend link, finding out whether the friend link of the webpage of the URL of the link contains the domain name of the self-source webpage, and calculating the friend link P3 which is linked back;
step S4, calculating the number P4 that the URL address of each link in the webpage is not the domain name of the self-source webpage and the number P5 that the URL address belongs to the domain name of the self-source webpage;
step S5, calculating the number P6 of pictures in the webpage;
step S6, extracting words which are independently surrounded by HTML labels in the webpage, have hyperlinks and are four-character vocabularies, extracting a plurality of words as a word list sequence according to the sequence of the words appearing in the HTML, and calculating the probability P7 that each word in the word list sequence appears in the menu word list sequence of the sample webpage at the same time, wherein the four-character vocabularies are pieced together into the four-character vocabularies according to the phonetic rhythmicity by the vocabularies extracted from the webpage;
step S7, calculating the above parameters P1 to P7 for the given web page and the sample web page, and calculating the variances of P1 to P7 of the given web page and P1 to P7 of the sample web page, so as to obtain the similarity between the given web page and the sample web page, and determine the web page theme.
Further, the web page information includes: webpage title, webpage menu, friend link, internal and external link, picture quantity and menu characters.
Further, in the step S7, the F-test method is used to calculate the variance between P1 through P7 of the given web page and P1 through P7 of the sample web page.
Further, in the step S7, the P1-P7 set different weights for debugging.
The embodiment of the present invention further provides an enterprise-related webpage theme measurement system, which includes:
the webpage obtaining module is used for obtaining sample webpage information, extracting a webpage theme from the webpage information and calculating the word number P1 of the webpage theme;
the vocabulary quantity calculating module is used for calculating the quantity of vocabularies meeting the following conditions in the webpage, and comprises the following steps: the HTML label is independently surrounded and provided with a hyperlink and a four-word vocabulary;
the friend link searching module is used for searching the URL address of the friend link, searching whether the friend link of the webpage of the URL of the link contains the domain name of the self-source webpage or not, and calculating the friend link P3 which is linked back;
the number calculation module is used for calculating the number P4 that the URL address of each link in the webpage is not the domain name of the self-source webpage, the number P5 that the URL address belongs to the domain name of the self-source webpage, and the number P6 of pictures in the webpage;
the probability statistics module is used for extracting words which are independently surrounded by HTML labels in the webpage, have hyperlinks and are four-character words, extracting a plurality of words as a word list sequence according to the sequence of the words appearing in the HTML, and calculating the probability P7 that each word in the word list sequence appears in the menu word list sequence of the sample webpage at the same time, wherein the four-character words are pieced into the four-character words according to the phonetic rhythmicity by the words extracted from the webpage;
and the webpage theme determining module is used for calculating the parameters P1 to P7 for the given webpage and the sample webpage, calculating the variances of the P1 to P7 of the given webpage and the P1 to P7 of the sample webpage, obtaining the similarity of the given webpage and the sample webpage, and determining the webpage theme.
Further, the web page information includes: webpage title, webpage menu, friend link, internal and external link, picture quantity and menu characters.
Further, the web page theme determination module calculates the variance of P1 through P7 of a given web page from P1 through P7 of a sample web page using an F-test method.
Further, the webpage theme determining module sets different weights for debugging.
According to the method and the system for measuring the enterprise related webpage topics, disclosed by the embodiment of the invention, the scores are respectively calculated from a plurality of indexes in the aspects of webpage titles, webpage menus, friend links, internal and external links, the number of pictures, menu characters and the like, a certain number of webpages are collected as samples, a computer is used for calculating the average scores of all indexes of the webpages, and then the same calculation measurement and score comparison are carried out on the webpages crawled by crawlers, so that classification and qualitative are carried out, and the webpage topics are obtained. According to the invention, the four-word vocabulary is generated for processing by the extracted words, so that the measurement precision can be improved from 60% to about 85%, and the processing efficiency is greatly improved.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flowchart of a method for measuring a topic of an enterprise-related web page according to an embodiment of the invention;
fig. 2 is a block diagram of a system for measuring a theme of an enterprise-related web page according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
The invention provides a method and a system for measuring enterprise related webpage topics, which judge whether a given webpage belongs to preset topic classifications or not by analyzing information of the given webpage,
as shown in fig. 1, the method for measuring the theme of the enterprise-related web page in the embodiment of the present invention includes the following steps:
step S1, obtaining sample web page information, extracting web page topics from the web page information, and calculating the word number P1 of the web page topics.
In one embodiment of the invention, the web page information includes: webpage title, webpage menu, friend link, internal and external link, picture quantity and menu characters.
It should be noted that the above is only an example of the type of the web page information, and is not intended to limit the present invention. The web page information in the present invention may also include other contents, which are not described herein again.
Step S2, calculating the number P2 of the vocabularies meeting the following conditions in the webpage, including: the HTML tag is independently surrounded, and has a hyperlink and a four-word vocabulary. It should be noted that the menu of the web page is mostly composed of four words.
Step S3, searching the URL address of the friend link, finding out whether the friend link of the URL webpage of the link contains the self-source webpage domain name or not, and calculating the friend link P3 which is linked back.
In step S4, the number P4 of URL addresses of each link in the web page other than the domain name of the self-originating web page and the number P5 of domain names belonging to the self-originating web page are calculated.
In step S5, the number of pictures P6 in the web page is calculated.
And step S6, extracting words which are independently surrounded by HTML labels in the webpage, have hyperlinks and are four-character vocabularies, extracting a plurality of words as a word list sequence according to the sequence of the words appearing in the HTML, and calculating the probability P7 that each word in the word list sequence appears in the menu word list sequence of the sample webpage at the same time, wherein the four-character vocabularies are pieced together into the four-character vocabularies according to the phonetic rhythmicity by the vocabularies extracted from the webpage.
In one embodiment of the invention, the number of extracted words may be optimally determined in engineering.
Step S7, calculating the above parameters P1 to P7 for the given web page and the sample web page, and calculating the variances of P1 to P7 of the given web page and P1 to P7 of the sample web page, so as to obtain the similarity between the given web page and the sample web page, and determine the web page theme. The webpage subject can be a media website, an industry portal, an enterprise official website, an e-commerce website and the like. It should be noted that the types of the web page topics are not limited to the above, and may also be other types, which are not described herein again.
In one embodiment of the present invention, in actual engineering calculations, the F-test method is used to calculate the variance of P1 through P7 for a given web page, and P1 through P7 for a sample web page. It should be noted that P1-P7 set different weights for debugging.
As shown in fig. 2, the system for measuring the theme of the enterprise-related web page in the embodiment of the present invention includes: the system comprises a webpage acquisition module 1, a vocabulary quantity calculation module 42, a friend link search module 3, a quantity calculation module 4, a probability statistics module 5 and a webpage theme determination module 6.
Specifically, the web page obtaining module 1 is configured to obtain sample web page information, extract a web page theme from the web page information, and calculate a word number P1 of the web page theme.
In one embodiment of the invention, the web page information includes: webpage title, webpage menu, friend link, internal and external link, picture quantity and menu characters.
It should be noted that the above is only an example of the type of the web page information, and is not intended to limit the present invention. The web page information in the present invention may also include other contents, which are not described herein again.
The vocabulary quantity calculating module 42 is used for calculating the quantity of the vocabulary in the webpage, which meets the following conditions, including: the HTML tag is independently surrounded, and has a hyperlink and a four-word vocabulary. It should be noted that the menu of the web page is mostly composed of four words.
The friend link searching module 3 is used for searching the URL address of the friend link, finding out whether the friend link of the webpage of the URL of the link contains the domain name of the self-source webpage or not, and calculating the friend link P3 which is linked back.
The number calculating module 4 is used for calculating the number P4 that the URL address of each link in the web page is not the domain name of the self-source web page and the number P5 that the URL address belongs to the domain name of the self-source web page, and the number P6 of pictures in the web page.
The probability statistic module 5 is used for extracting words which are independently surrounded by HTML tags in the web pages, have hyperlinks and are four-character words, extracting a plurality of words as a word list sequence according to the sequence of the words appearing in the HTML, and calculating the probability P7 that each word in the word list sequence appears in the menu word list sequence of the sample web page at the same time, wherein the four-character words are pieced together into the four-character words according to the phonetic rhythmicity by the words extracted from the web pages.
In one embodiment of the invention, the number of extracted words may be optimally determined in engineering.
The web page theme determining module 6 is configured to calculate the above parameters P1 to P7 for the given web page and the sample web page, and calculate the variances of P1 to P7 of the given web page and P1 to P7 of the sample web page, so as to obtain the similarity between the given web page and the sample web page, and determine the web page theme.
In one embodiment of the invention, web page theme determination module 6 uses an F-test method to calculate the variance of P1 through P7 for a given web page from P1 through P7 for a sample web page. It should be noted that the webpage theme determining module 6 sets different weights for debugging.
According to the method and the system for measuring the enterprise related webpage topics, disclosed by the embodiment of the invention, the scores are respectively calculated from a plurality of indexes in the aspects of webpage titles, webpage menus, friend links, internal and external links, the number of pictures, menu characters and the like, a certain number of webpages are collected as samples, a computer is used for calculating the average scores of all indexes of the webpages, and then the same calculation measurement and score comparison are carried out on the webpages crawled by crawlers, so that classification and qualitative are carried out, and the webpage topics are obtained. According to the invention, the four-word vocabulary is generated for processing by the extracted words, so that the measurement precision can be improved from 60% to about 85%, and the processing efficiency is greatly improved. In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made in the above embodiments by those of ordinary skill in the art without departing from the principle and spirit of the present invention. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (8)

1. A method for measuring the theme of an enterprise related webpage is characterized by comprising the following steps:
step S1, acquiring sample enterprise webpage information, extracting a webpage theme from the webpage information, and calculating the word number P1 of the webpage title;
step S2, calculating the number P2 of the vocabularies meeting the following conditions in the webpage, including: the HTML label is independently surrounded and provided with a hyperlink and a four-word vocabulary;
step S3, searching the URL address of the friend link, finding out whether the friend link of the webpage of the URL of the link contains the domain name of the self-source webpage, and calculating the friend link P3 which is linked back;
step S4, calculating the number P4 that the URL address of each link in the webpage is not the domain name of the self-source webpage and the number P5 that the URL address belongs to the domain name of the self-source webpage;
step S5, calculating the number P6 of pictures in the webpage;
step S6, extracting words which are independently surrounded by HTML labels in the webpage, have hyperlinks and are four-character vocabularies, extracting a plurality of words as a word list sequence according to the sequence of the words appearing in the HTML, and calculating the probability P7 that each word in the word list sequence appears in the menu word list sequence of the sample enterprise webpage at the same time, wherein the four-character vocabularies are pieced together into the four-character vocabularies according to the phonetic rhythmicity by the vocabularies extracted from the webpage;
step S7, calculating the above parameters P1 to P7 for the given web page and the sample enterprise web page, and calculating the variances of P1 to P7 of the given web page and P1 to P7 of the sample enterprise web page, so as to obtain the similarity between the given web page and the sample enterprise web page, and determine the web page theme.
2. The method of claim 1, wherein the web page information comprises: webpage title, webpage menu, friend link, internal and external link, picture quantity and menu characters.
3. The method for measuring theme of business-related web pages of claim 1, wherein in the step S7, the F-test method is used to calculate the variance between P1 to P7 of a given web page and P1 to P7 of a sample business web page.
4. The method for measuring theme of enterprise-related web pages as claimed in claim 1, wherein in the step S7, P1-P7 sets different weights for debugging.
5. An enterprise-related web page theme measurement system, comprising:
the webpage obtaining module is used for obtaining sample enterprise webpage information, extracting a webpage theme from the webpage information and calculating the word number P1 of the webpage title;
the vocabulary quantity calculating module is used for calculating the quantity of vocabularies meeting the following conditions in the webpage, and comprises the following steps: the HTML label is independently surrounded and provided with a hyperlink and a four-word vocabulary;
the friend link searching module is used for searching the URL address of the friend link, searching whether the friend link of the webpage of the URL of the link contains the domain name of the self-source webpage or not, and calculating the friend link P3 which is linked back;
the number calculation module is used for calculating the number P4 that the URL address of each link in the webpage is not the domain name of the self-source webpage, the number P5 that the URL address belongs to the domain name of the self-source webpage, and the number P6 of pictures in the webpage;
the probability statistics module is used for extracting words which are independently surrounded by HTML labels in the webpage, have hyperlinks and are four-character words, extracting a plurality of words as a word list sequence according to the sequence of the words appearing in the HTML, and calculating the probability P7 that each word in the word list sequence appears in the menu word list sequence of the sample enterprise webpage at the same time, wherein the four-character words are pieced into the four-character words by the words extracted from the webpage according to the voice rhythmicity;
and the webpage theme determining module is used for calculating the parameters P1 to P7 for the given webpage and the sample enterprise webpage, calculating the variances of the P1 to P7 of the given webpage and the P1 to P7 of the sample enterprise webpage so as to obtain the similarity of the given webpage and the sample enterprise webpage and determine the webpage theme.
6. The enterprise-related web page theme measurement system of claim 5, wherein the web page information comprises: webpage title, webpage menu, friend link, internal and external link, picture quantity and menu characters.
7. The system of claim 5, wherein the web topic determination module calculates the variance of P1 through P7 for a given web page from P1 through P7 for a sample business web page using an F-test method.
8. The enterprise-related web page theme measurement system of claim 5, wherein the web page theme determination module sets different weights for debugging.
CN201710354041.4A 2017-05-18 2017-05-18 Enterprise related webpage theme measuring method and system Active CN107357801B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710354041.4A CN107357801B (en) 2017-05-18 2017-05-18 Enterprise related webpage theme measuring method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710354041.4A CN107357801B (en) 2017-05-18 2017-05-18 Enterprise related webpage theme measuring method and system

Publications (2)

Publication Number Publication Date
CN107357801A CN107357801A (en) 2017-11-17
CN107357801B true CN107357801B (en) 2021-05-28

Family

ID=60271916

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710354041.4A Active CN107357801B (en) 2017-05-18 2017-05-18 Enterprise related webpage theme measuring method and system

Country Status (1)

Country Link
CN (1) CN107357801B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102117339A (en) * 2011-03-30 2011-07-06 曹晓晶 Filter supervision method specific to unsecure web page texts
CN103838792A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Method for determining webpage theme
CN104331449A (en) * 2014-10-29 2015-02-04 百度在线网络技术(北京)有限公司 Method and device for determining similarity between inquiry sentence and webpage, terminal and server
CN105589892A (en) * 2014-11-12 2016-05-18 中国银联股份有限公司 Webpage theme analysis method based on anchor text backtracking chain

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7454061B2 (en) * 2003-06-27 2008-11-18 Ricoh Company, Ltd. System, apparatus, and method for providing illegal use research service for image data, and system, apparatus, and method for providing proper use research service for image data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102117339A (en) * 2011-03-30 2011-07-06 曹晓晶 Filter supervision method specific to unsecure web page texts
CN103838792A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Method for determining webpage theme
CN104331449A (en) * 2014-10-29 2015-02-04 百度在线网络技术(北京)有限公司 Method and device for determining similarity between inquiry sentence and webpage, terminal and server
CN105589892A (en) * 2014-11-12 2016-05-18 中国银联股份有限公司 Webpage theme analysis method based on anchor text backtracking chain

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于频繁项集的海量短文本聚类与主题抽取;彭敏等;《计算机研究与发展》;20151231;第52卷(第9期);第1941-1953 *

Also Published As

Publication number Publication date
CN107357801A (en) 2017-11-17

Similar Documents

Publication Publication Date Title
CN102054016B (en) For capturing and manage the system and method for community intelligent information
CN102054015B (en) System and method of organizing community intelligent information by using organic matter data model
CN102982153B (en) A kind of information retrieval method and device thereof
US8630972B2 (en) Providing context for web articles
CN102411587B (en) Webpage classification method and device
US7962523B2 (en) System and method for detecting templates of a website using hyperlink analysis
CN110602045B (en) Malicious webpage identification method based on feature fusion and machine learning
CA2460538A1 (en) Information analyzing method and apparatus
US8560518B2 (en) Method and apparatus for building sales tools by mining data from websites
CN107451120B (en) Content conflict detection method and system for open text information
CN102156746A (en) Method for evaluating performance of search engine
US20100235342A1 (en) Tagging system using internet search engine
CN103729354B (en) web information processing method and device
CN107357801B (en) Enterprise related webpage theme measuring method and system
CN109948015B (en) Meta search list result extraction method and system
KR20120090131A (en) Method, system and computer readable recording medium for providing search results
CN110955845A (en) User interest identification method and device, and search result processing method and device
CN104978431B (en) Web data fusion method and device
CN114706948A (en) News processing method and device, storage medium and electronic equipment
CN107545020A (en) A kind of determination method and device of Web page classifying
EP3040932A1 (en) A method for tracking discussion in social media
KR101402339B1 (en) System and method of managing document
CN108153817B (en) Intelligent web page data acquisition method
CN111666749A (en) Hot article identification method
Barua et al. Removing noise content from online news articles

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant