CN107357801B

CN107357801B - Enterprise related webpage theme measuring method and system

Info

Publication number: CN107357801B
Application number: CN201710354041.4A
Authority: CN
Inventors: 辛柯俊
Original assignee: Individual
Current assignee: Individual
Priority date: 2017-05-18
Filing date: 2017-05-18
Publication date: 2021-05-28
Anticipated expiration: 2037-05-18
Also published as: CN107357801A

Abstract

The invention provides a method and a system for measuring enterprise related webpage topics, which comprise the following steps: acquiring sample webpage information, extracting a webpage theme from the webpage information, and calculating the word number of the webpage theme; calculating the number of vocabularies meeting the following conditions in the webpage; searching the URL address of the friend link, and finding out whether the friend link of the webpage of the URL of the link contains the domain name of the self-source webpage or not; calculating the number of URL addresses of each link in the webpage, which are not the domain name of the source webpage of the URL address and belong to the domain name of the source webpage of the URL address; calculating the number of pictures in a webpage; extracting a plurality of words as a word list sequence according to the sequence appearing in the HTML, and calculating the probability of each word appearing in the word list sequence at the same time; the parameters are calculated for the given web page and the sample web page, the variance of the given web page and the sample web page is calculated, and the web page theme is determined. The method carries out the same calculation measurement and score comparison on the webpages crawled by the crawler, and carries out classification and qualification to obtain webpage themes.

Description

Enterprise related webpage theme measuring method and system

Technical Field

The invention relates to the technical field of computer networks, in particular to a method and a system for measuring enterprise related webpage topics.

Background

The existing enterprise information comprehensive websites are mostly simple lists of enterprise information and mainly aim at information summarization and analysis of a single enterprise. The prior art has the disadvantage of lacking a way to analyze the interrelationships between enterprises. How to determine the theme of each enterprise automatically by a computer through basic information of the enterprise is a technical problem to be solved at present.

Disclosure of Invention

The object of the present invention is to solve at least one of the technical drawbacks mentioned.

Therefore, the invention aims to provide a method and a system for measuring the theme of the enterprise-related webpage.

In order to achieve the above object, an embodiment of the present invention provides a method and a system for measuring a theme of an enterprise-related webpage, including the following steps:

step S1, acquiring sample webpage information, extracting webpage subjects from the webpage information, and calculating the word number P1 of the webpage subjects;

step S2, calculating the number of words in the web page that meet the following conditions, including: the HTML label is independently surrounded and provided with a hyperlink and a four-word vocabulary;

step S3, searching the URL address of the friend link, finding out whether the friend link of the webpage of the URL of the link contains the domain name of the self-source webpage, and calculating the friend link P3 which is linked back;

step S4, calculating the number P4 that the URL address of each link in the webpage is not the domain name of the self-source webpage and the number P5 that the URL address belongs to the domain name of the self-source webpage;

step S5, calculating the number P6 of pictures in the webpage;

step S6, extracting words which are independently surrounded by HTML labels in the webpage, have hyperlinks and are four-character vocabularies, extracting a plurality of words as a word list sequence according to the sequence of the words appearing in the HTML, and calculating the probability P7 that each word in the word list sequence appears in the menu word list sequence of the sample webpage at the same time, wherein the four-character vocabularies are pieced together into the four-character vocabularies according to the phonetic rhythmicity by the vocabularies extracted from the webpage;

step S7, calculating the above parameters P1 to P7 for the given web page and the sample web page, and calculating the variances of P1 to P7 of the given web page and P1 to P7 of the sample web page, so as to obtain the similarity between the given web page and the sample web page, and determine the web page theme.

Further, the web page information includes: webpage title, webpage menu, friend link, internal and external link, picture quantity and menu characters.

Further, in the step S7, the F-test method is used to calculate the variance between P1 through P7 of the given web page and P1 through P7 of the sample web page.

Further, in the step S7, the P1-P7 set different weights for debugging.

The embodiment of the present invention further provides an enterprise-related webpage theme measurement system, which includes:

the webpage obtaining module is used for obtaining sample webpage information, extracting a webpage theme from the webpage information and calculating the word number P1 of the webpage theme;

the vocabulary quantity calculating module is used for calculating the quantity of vocabularies meeting the following conditions in the webpage, and comprises the following steps: the HTML label is independently surrounded and provided with a hyperlink and a four-word vocabulary;

the friend link searching module is used for searching the URL address of the friend link, searching whether the friend link of the webpage of the URL of the link contains the domain name of the self-source webpage or not, and calculating the friend link P3 which is linked back;

the number calculation module is used for calculating the number P4 that the URL address of each link in the webpage is not the domain name of the self-source webpage, the number P5 that the URL address belongs to the domain name of the self-source webpage, and the number P6 of pictures in the webpage;

the probability statistics module is used for extracting words which are independently surrounded by HTML labels in the webpage, have hyperlinks and are four-character words, extracting a plurality of words as a word list sequence according to the sequence of the words appearing in the HTML, and calculating the probability P7 that each word in the word list sequence appears in the menu word list sequence of the sample webpage at the same time, wherein the four-character words are pieced into the four-character words according to the phonetic rhythmicity by the words extracted from the webpage;

and the webpage theme determining module is used for calculating the parameters P1 to P7 for the given webpage and the sample webpage, calculating the variances of the P1 to P7 of the given webpage and the P1 to P7 of the sample webpage, obtaining the similarity of the given webpage and the sample webpage, and determining the webpage theme.

Further, the web page theme determination module calculates the variance of P1 through P7 of a given web page from P1 through P7 of a sample web page using an F-test method.

Further, the webpage theme determining module sets different weights for debugging.

According to the method and the system for measuring the enterprise related webpage topics, disclosed by the embodiment of the invention, the scores are respectively calculated from a plurality of indexes in the aspects of webpage titles, webpage menus, friend links, internal and external links, the number of pictures, menu characters and the like, a certain number of webpages are collected as samples, a computer is used for calculating the average scores of all indexes of the webpages, and then the same calculation measurement and score comparison are carried out on the webpages crawled by crawlers, so that classification and qualitative are carried out, and the webpage topics are obtained. According to the invention, the four-word vocabulary is generated for processing by the extracted words, so that the measurement precision can be improved from 60% to about 85%, and the processing efficiency is greatly improved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flowchart of a method for measuring a topic of an enterprise-related web page according to an embodiment of the invention;

fig. 2 is a block diagram of a system for measuring a theme of an enterprise-related web page according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The invention provides a method and a system for measuring enterprise related webpage topics, which judge whether a given webpage belongs to preset topic classifications or not by analyzing information of the given webpage,

as shown in fig. 1, the method for measuring the theme of the enterprise-related web page in the embodiment of the present invention includes the following steps:

step S1, obtaining sample web page information, extracting web page topics from the web page information, and calculating the word number P1 of the web page topics.

In one embodiment of the invention, the web page information includes: webpage title, webpage menu, friend link, internal and external link, picture quantity and menu characters.

It should be noted that the above is only an example of the type of the web page information, and is not intended to limit the present invention. The web page information in the present invention may also include other contents, which are not described herein again.

Step S2, calculating the number P2 of the vocabularies meeting the following conditions in the webpage, including: the HTML tag is independently surrounded, and has a hyperlink and a four-word vocabulary. It should be noted that the menu of the web page is mostly composed of four words.

Step S3, searching the URL address of the friend link, finding out whether the friend link of the URL webpage of the link contains the self-source webpage domain name or not, and calculating the friend link P3 which is linked back.

In step S4, the number P4 of URL addresses of each link in the web page other than the domain name of the self-originating web page and the number P5 of domain names belonging to the self-originating web page are calculated.

In step S5, the number of pictures P6 in the web page is calculated.

And step S6, extracting words which are independently surrounded by HTML labels in the webpage, have hyperlinks and are four-character vocabularies, extracting a plurality of words as a word list sequence according to the sequence of the words appearing in the HTML, and calculating the probability P7 that each word in the word list sequence appears in the menu word list sequence of the sample webpage at the same time, wherein the four-character vocabularies are pieced together into the four-character vocabularies according to the phonetic rhythmicity by the vocabularies extracted from the webpage.

In one embodiment of the invention, the number of extracted words may be optimally determined in engineering.

Step S7, calculating the above parameters P1 to P7 for the given web page and the sample web page, and calculating the variances of P1 to P7 of the given web page and P1 to P7 of the sample web page, so as to obtain the similarity between the given web page and the sample web page, and determine the web page theme. The webpage subject can be a media website, an industry portal, an enterprise official website, an e-commerce website and the like. It should be noted that the types of the web page topics are not limited to the above, and may also be other types, which are not described herein again.

In one embodiment of the present invention, in actual engineering calculations, the F-test method is used to calculate the variance of P1 through P7 for a given web page, and P1 through P7 for a sample web page. It should be noted that P1-P7 set different weights for debugging.

As shown in fig. 2, the system for measuring the theme of the enterprise-related web page in the embodiment of the present invention includes: the system comprises a webpage acquisition module 1, a vocabulary quantity calculation module 42, a friend link search module 3, a quantity calculation module 4, a probability statistics module 5 and a webpage theme determination module 6.

Specifically, the web page obtaining module 1 is configured to obtain sample web page information, extract a web page theme from the web page information, and calculate a word number P1 of the web page theme.

The vocabulary quantity calculating module 42 is used for calculating the quantity of the vocabulary in the webpage, which meets the following conditions, including: the HTML tag is independently surrounded, and has a hyperlink and a four-word vocabulary. It should be noted that the menu of the web page is mostly composed of four words.

The friend link searching module 3 is used for searching the URL address of the friend link, finding out whether the friend link of the webpage of the URL of the link contains the domain name of the self-source webpage or not, and calculating the friend link P3 which is linked back.

The number calculating module 4 is used for calculating the number P4 that the URL address of each link in the web page is not the domain name of the self-source web page and the number P5 that the URL address belongs to the domain name of the self-source web page, and the number P6 of pictures in the web page.

The probability statistic module 5 is used for extracting words which are independently surrounded by HTML tags in the web pages, have hyperlinks and are four-character words, extracting a plurality of words as a word list sequence according to the sequence of the words appearing in the HTML, and calculating the probability P7 that each word in the word list sequence appears in the menu word list sequence of the sample web page at the same time, wherein the four-character words are pieced together into the four-character words according to the phonetic rhythmicity by the words extracted from the web pages.

The web page theme determining module 6 is configured to calculate the above parameters P1 to P7 for the given web page and the sample web page, and calculate the variances of P1 to P7 of the given web page and P1 to P7 of the sample web page, so as to obtain the similarity between the given web page and the sample web page, and determine the web page theme.

In one embodiment of the invention, web page theme determination module 6 uses an F-test method to calculate the variance of P1 through P7 for a given web page from P1 through P7 for a sample web page. It should be noted that the webpage theme determining module 6 sets different weights for debugging.

According to the method and the system for measuring the enterprise related webpage topics, disclosed by the embodiment of the invention, the scores are respectively calculated from a plurality of indexes in the aspects of webpage titles, webpage menus, friend links, internal and external links, the number of pictures, menu characters and the like, a certain number of webpages are collected as samples, a computer is used for calculating the average scores of all indexes of the webpages, and then the same calculation measurement and score comparison are carried out on the webpages crawled by crawlers, so that classification and qualitative are carried out, and the webpage topics are obtained. According to the invention, the four-word vocabulary is generated for processing by the extracted words, so that the measurement precision can be improved from 60% to about 85%, and the processing efficiency is greatly improved. In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made in the above embodiments by those of ordinary skill in the art without departing from the principle and spirit of the present invention. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A method for measuring the theme of an enterprise related webpage is characterized by comprising the following steps:

step S1, acquiring sample enterprise webpage information, extracting a webpage theme from the webpage information, and calculating the word number P1 of the webpage title;

step S2, calculating the number P2 of the vocabularies meeting the following conditions in the webpage, including: the HTML label is independently surrounded and provided with a hyperlink and a four-word vocabulary;

step S5, calculating the number P6 of pictures in the webpage;

step S6, extracting words which are independently surrounded by HTML labels in the webpage, have hyperlinks and are four-character vocabularies, extracting a plurality of words as a word list sequence according to the sequence of the words appearing in the HTML, and calculating the probability P7 that each word in the word list sequence appears in the menu word list sequence of the sample enterprise webpage at the same time, wherein the four-character vocabularies are pieced together into the four-character vocabularies according to the phonetic rhythmicity by the vocabularies extracted from the webpage;

step S7, calculating the above parameters P1 to P7 for the given web page and the sample enterprise web page, and calculating the variances of P1 to P7 of the given web page and P1 to P7 of the sample enterprise web page, so as to obtain the similarity between the given web page and the sample enterprise web page, and determine the web page theme.

2. The method of claim 1, wherein the web page information comprises: webpage title, webpage menu, friend link, internal and external link, picture quantity and menu characters.

3. The method for measuring theme of business-related web pages of claim 1, wherein in the step S7, the F-test method is used to calculate the variance between P1 to P7 of a given web page and P1 to P7 of a sample business web page.

4. The method for measuring theme of enterprise-related web pages as claimed in claim 1, wherein in the step S7, P1-P7 sets different weights for debugging.

5. An enterprise-related web page theme measurement system, comprising:

the webpage obtaining module is used for obtaining sample enterprise webpage information, extracting a webpage theme from the webpage information and calculating the word number P1 of the webpage title;

the probability statistics module is used for extracting words which are independently surrounded by HTML labels in the webpage, have hyperlinks and are four-character words, extracting a plurality of words as a word list sequence according to the sequence of the words appearing in the HTML, and calculating the probability P7 that each word in the word list sequence appears in the menu word list sequence of the sample enterprise webpage at the same time, wherein the four-character words are pieced into the four-character words by the words extracted from the webpage according to the voice rhythmicity;

and the webpage theme determining module is used for calculating the parameters P1 to P7 for the given webpage and the sample enterprise webpage, calculating the variances of the P1 to P7 of the given webpage and the P1 to P7 of the sample enterprise webpage so as to obtain the similarity of the given webpage and the sample enterprise webpage and determine the webpage theme.

6. The enterprise-related web page theme measurement system of claim 5, wherein the web page information comprises: webpage title, webpage menu, friend link, internal and external link, picture quantity and menu characters.

7. The system of claim 5, wherein the web topic determination module calculates the variance of P1 through P7 for a given web page from P1 through P7 for a sample business web page using an F-test method.

8. The enterprise-related web page theme measurement system of claim 5, wherein the web page theme determination module sets different weights for debugging.