CN107357801B - Enterprise related webpage theme measuring method and system - Google Patents
Enterprise related webpage theme measuring method and system Download PDFInfo
- Publication number
- CN107357801B CN107357801B CN201710354041.4A CN201710354041A CN107357801B CN 107357801 B CN107357801 B CN 107357801B CN 201710354041 A CN201710354041 A CN 201710354041A CN 107357801 B CN107357801 B CN 107357801B
- Authority
- CN
- China
- Prior art keywords
- webpage
- web page
- calculating
- theme
- link
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/957—Browsing optimisation, e.g. caching or content distillation
- G06F16/9574—Browsing optimisation, e.g. caching or content distillation of access to content, e.g. by caching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention provides a method and a system for measuring enterprise related webpage topics, which comprise the following steps: acquiring sample webpage information, extracting a webpage theme from the webpage information, and calculating the word number of the webpage theme; calculating the number of vocabularies meeting the following conditions in the webpage; searching the URL address of the friend link, and finding out whether the friend link of the webpage of the URL of the link contains the domain name of the self-source webpage or not; calculating the number of URL addresses of each link in the webpage, which are not the domain name of the source webpage of the URL address and belong to the domain name of the source webpage of the URL address; calculating the number of pictures in a webpage; extracting a plurality of words as a word list sequence according to the sequence appearing in the HTML, and calculating the probability of each word appearing in the word list sequence at the same time; the parameters are calculated for the given web page and the sample web page, the variance of the given web page and the sample web page is calculated, and the web page theme is determined. The method carries out the same calculation measurement and score comparison on the webpages crawled by the crawler, and carries out classification and qualification to obtain webpage themes.
Description
Technical Field
The invention relates to the technical field of computer networks, in particular to a method and a system for measuring enterprise related webpage topics.
Background
The existing enterprise information comprehensive websites are mostly simple lists of enterprise information and mainly aim at information summarization and analysis of a single enterprise. The prior art has the disadvantage of lacking a way to analyze the interrelationships between enterprises. How to determine the theme of each enterprise automatically by a computer through basic information of the enterprise is a technical problem to be solved at present.
Disclosure of Invention
The object of the present invention is to solve at least one of the technical drawbacks mentioned.
Therefore, the invention aims to provide a method and a system for measuring the theme of the enterprise-related webpage.
In order to achieve the above object, an embodiment of the present invention provides a method and a system for measuring a theme of an enterprise-related webpage, including the following steps:
step S1, acquiring sample webpage information, extracting webpage subjects from the webpage information, and calculating the word number P1 of the webpage subjects;
step S2, calculating the number of words in the web page that meet the following conditions, including: the HTML label is independently surrounded and provided with a hyperlink and a four-word vocabulary;
step S3, searching the URL address of the friend link, finding out whether the friend link of the webpage of the URL of the link contains the domain name of the self-source webpage, and calculating the friend link P3 which is linked back;
step S4, calculating the number P4 that the URL address of each link in the webpage is not the domain name of the self-source webpage and the number P5 that the URL address belongs to the domain name of the self-source webpage;
step S5, calculating the number P6 of pictures in the webpage;
step S6, extracting words which are independently surrounded by HTML labels in the webpage, have hyperlinks and are four-character vocabularies, extracting a plurality of words as a word list sequence according to the sequence of the words appearing in the HTML, and calculating the probability P7 that each word in the word list sequence appears in the menu word list sequence of the sample webpage at the same time, wherein the four-character vocabularies are pieced together into the four-character vocabularies according to the phonetic rhythmicity by the vocabularies extracted from the webpage;
step S7, calculating the above parameters P1 to P7 for the given web page and the sample web page, and calculating the variances of P1 to P7 of the given web page and P1 to P7 of the sample web page, so as to obtain the similarity between the given web page and the sample web page, and determine the web page theme.
Further, the web page information includes: webpage title, webpage menu, friend link, internal and external link, picture quantity and menu characters.
Further, in the step S7, the F-test method is used to calculate the variance between P1 through P7 of the given web page and P1 through P7 of the sample web page.
Further, in the step S7, the P1-P7 set different weights for debugging.
The embodiment of the present invention further provides an enterprise-related webpage theme measurement system, which includes:
the webpage obtaining module is used for obtaining sample webpage information, extracting a webpage theme from the webpage information and calculating the word number P1 of the webpage theme;
the vocabulary quantity calculating module is used for calculating the quantity of vocabularies meeting the following conditions in the webpage, and comprises the following steps: the HTML label is independently surrounded and provided with a hyperlink and a four-word vocabulary;
the friend link searching module is used for searching the URL address of the friend link, searching whether the friend link of the webpage of the URL of the link contains the domain name of the self-source webpage or not, and calculating the friend link P3 which is linked back;
the number calculation module is used for calculating the number P4 that the URL address of each link in the webpage is not the domain name of the self-source webpage, the number P5 that the URL address belongs to the domain name of the self-source webpage, and the number P6 of pictures in the webpage;
the probability statistics module is used for extracting words which are independently surrounded by HTML labels in the webpage, have hyperlinks and are four-character words, extracting a plurality of words as a word list sequence according to the sequence of the words appearing in the HTML, and calculating the probability P7 that each word in the word list sequence appears in the menu word list sequence of the sample webpage at the same time, wherein the four-character words are pieced into the four-character words according to the phonetic rhythmicity by the words extracted from the webpage;
and the webpage theme determining module is used for calculating the parameters P1 to P7 for the given webpage and the sample webpage, calculating the variances of the P1 to P7 of the given webpage and the P1 to P7 of the sample webpage, obtaining the similarity of the given webpage and the sample webpage, and determining the webpage theme.
Further, the web page information includes: webpage title, webpage menu, friend link, internal and external link, picture quantity and menu characters.
Further, the web page theme determination module calculates the variance of P1 through P7 of a given web page from P1 through P7 of a sample web page using an F-test method.
Further, the webpage theme determining module sets different weights for debugging.
According to the method and the system for measuring the enterprise related webpage topics, disclosed by the embodiment of the invention, the scores are respectively calculated from a plurality of indexes in the aspects of webpage titles, webpage menus, friend links, internal and external links, the number of pictures, menu characters and the like, a certain number of webpages are collected as samples, a computer is used for calculating the average scores of all indexes of the webpages, and then the same calculation measurement and score comparison are carried out on the webpages crawled by crawlers, so that classification and qualitative are carried out, and the webpage topics are obtained. According to the invention, the four-word vocabulary is generated for processing by the extracted words, so that the measurement precision can be improved from 60% to about 85%, and the processing efficiency is greatly improved.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flowchart of a method for measuring a topic of an enterprise-related web page according to an embodiment of the invention;
fig. 2 is a block diagram of a system for measuring a theme of an enterprise-related web page according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
The invention provides a method and a system for measuring enterprise related webpage topics, which judge whether a given webpage belongs to preset topic classifications or not by analyzing information of the given webpage,
as shown in fig. 1, the method for measuring the theme of the enterprise-related web page in the embodiment of the present invention includes the following steps:
step S1, obtaining sample web page information, extracting web page topics from the web page information, and calculating the word number P1 of the web page topics.
In one embodiment of the invention, the web page information includes: webpage title, webpage menu, friend link, internal and external link, picture quantity and menu characters.
It should be noted that the above is only an example of the type of the web page information, and is not intended to limit the present invention. The web page information in the present invention may also include other contents, which are not described herein again.
Step S2, calculating the number P2 of the vocabularies meeting the following conditions in the webpage, including: the HTML tag is independently surrounded, and has a hyperlink and a four-word vocabulary. It should be noted that the menu of the web page is mostly composed of four words.
Step S3, searching the URL address of the friend link, finding out whether the friend link of the URL webpage of the link contains the self-source webpage domain name or not, and calculating the friend link P3 which is linked back.
In step S4, the number P4 of URL addresses of each link in the web page other than the domain name of the self-originating web page and the number P5 of domain names belonging to the self-originating web page are calculated.
In step S5, the number of pictures P6 in the web page is calculated.
And step S6, extracting words which are independently surrounded by HTML labels in the webpage, have hyperlinks and are four-character vocabularies, extracting a plurality of words as a word list sequence according to the sequence of the words appearing in the HTML, and calculating the probability P7 that each word in the word list sequence appears in the menu word list sequence of the sample webpage at the same time, wherein the four-character vocabularies are pieced together into the four-character vocabularies according to the phonetic rhythmicity by the vocabularies extracted from the webpage.
In one embodiment of the invention, the number of extracted words may be optimally determined in engineering.
Step S7, calculating the above parameters P1 to P7 for the given web page and the sample web page, and calculating the variances of P1 to P7 of the given web page and P1 to P7 of the sample web page, so as to obtain the similarity between the given web page and the sample web page, and determine the web page theme. The webpage subject can be a media website, an industry portal, an enterprise official website, an e-commerce website and the like. It should be noted that the types of the web page topics are not limited to the above, and may also be other types, which are not described herein again.
In one embodiment of the present invention, in actual engineering calculations, the F-test method is used to calculate the variance of P1 through P7 for a given web page, and P1 through P7 for a sample web page. It should be noted that P1-P7 set different weights for debugging.
As shown in fig. 2, the system for measuring the theme of the enterprise-related web page in the embodiment of the present invention includes: the system comprises a webpage acquisition module 1, a vocabulary quantity calculation module 42, a friend link search module 3, a quantity calculation module 4, a probability statistics module 5 and a webpage theme determination module 6.
Specifically, the web page obtaining module 1 is configured to obtain sample web page information, extract a web page theme from the web page information, and calculate a word number P1 of the web page theme.
In one embodiment of the invention, the web page information includes: webpage title, webpage menu, friend link, internal and external link, picture quantity and menu characters.
It should be noted that the above is only an example of the type of the web page information, and is not intended to limit the present invention. The web page information in the present invention may also include other contents, which are not described herein again.
The vocabulary quantity calculating module 42 is used for calculating the quantity of the vocabulary in the webpage, which meets the following conditions, including: the HTML tag is independently surrounded, and has a hyperlink and a four-word vocabulary. It should be noted that the menu of the web page is mostly composed of four words.
The friend link searching module 3 is used for searching the URL address of the friend link, finding out whether the friend link of the webpage of the URL of the link contains the domain name of the self-source webpage or not, and calculating the friend link P3 which is linked back.
The number calculating module 4 is used for calculating the number P4 that the URL address of each link in the web page is not the domain name of the self-source web page and the number P5 that the URL address belongs to the domain name of the self-source web page, and the number P6 of pictures in the web page.
The probability statistic module 5 is used for extracting words which are independently surrounded by HTML tags in the web pages, have hyperlinks and are four-character words, extracting a plurality of words as a word list sequence according to the sequence of the words appearing in the HTML, and calculating the probability P7 that each word in the word list sequence appears in the menu word list sequence of the sample web page at the same time, wherein the four-character words are pieced together into the four-character words according to the phonetic rhythmicity by the words extracted from the web pages.
In one embodiment of the invention, the number of extracted words may be optimally determined in engineering.
The web page theme determining module 6 is configured to calculate the above parameters P1 to P7 for the given web page and the sample web page, and calculate the variances of P1 to P7 of the given web page and P1 to P7 of the sample web page, so as to obtain the similarity between the given web page and the sample web page, and determine the web page theme.
In one embodiment of the invention, web page theme determination module 6 uses an F-test method to calculate the variance of P1 through P7 for a given web page from P1 through P7 for a sample web page. It should be noted that the webpage theme determining module 6 sets different weights for debugging.
According to the method and the system for measuring the enterprise related webpage topics, disclosed by the embodiment of the invention, the scores are respectively calculated from a plurality of indexes in the aspects of webpage titles, webpage menus, friend links, internal and external links, the number of pictures, menu characters and the like, a certain number of webpages are collected as samples, a computer is used for calculating the average scores of all indexes of the webpages, and then the same calculation measurement and score comparison are carried out on the webpages crawled by crawlers, so that classification and qualitative are carried out, and the webpage topics are obtained. According to the invention, the four-word vocabulary is generated for processing by the extracted words, so that the measurement precision can be improved from 60% to about 85%, and the processing efficiency is greatly improved. In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made in the above embodiments by those of ordinary skill in the art without departing from the principle and spirit of the present invention. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (8)
1. A method for measuring the theme of an enterprise related webpage is characterized by comprising the following steps:
step S1, acquiring sample enterprise webpage information, extracting a webpage theme from the webpage information, and calculating the word number P1 of the webpage title;
step S2, calculating the number P2 of the vocabularies meeting the following conditions in the webpage, including: the HTML label is independently surrounded and provided with a hyperlink and a four-word vocabulary;
step S3, searching the URL address of the friend link, finding out whether the friend link of the webpage of the URL of the link contains the domain name of the self-source webpage, and calculating the friend link P3 which is linked back;
step S4, calculating the number P4 that the URL address of each link in the webpage is not the domain name of the self-source webpage and the number P5 that the URL address belongs to the domain name of the self-source webpage;
step S5, calculating the number P6 of pictures in the webpage;
step S6, extracting words which are independently surrounded by HTML labels in the webpage, have hyperlinks and are four-character vocabularies, extracting a plurality of words as a word list sequence according to the sequence of the words appearing in the HTML, and calculating the probability P7 that each word in the word list sequence appears in the menu word list sequence of the sample enterprise webpage at the same time, wherein the four-character vocabularies are pieced together into the four-character vocabularies according to the phonetic rhythmicity by the vocabularies extracted from the webpage;
step S7, calculating the above parameters P1 to P7 for the given web page and the sample enterprise web page, and calculating the variances of P1 to P7 of the given web page and P1 to P7 of the sample enterprise web page, so as to obtain the similarity between the given web page and the sample enterprise web page, and determine the web page theme.
2. The method of claim 1, wherein the web page information comprises: webpage title, webpage menu, friend link, internal and external link, picture quantity and menu characters.
3. The method for measuring theme of business-related web pages of claim 1, wherein in the step S7, the F-test method is used to calculate the variance between P1 to P7 of a given web page and P1 to P7 of a sample business web page.
4. The method for measuring theme of enterprise-related web pages as claimed in claim 1, wherein in the step S7, P1-P7 sets different weights for debugging.
5. An enterprise-related web page theme measurement system, comprising:
the webpage obtaining module is used for obtaining sample enterprise webpage information, extracting a webpage theme from the webpage information and calculating the word number P1 of the webpage title;
the vocabulary quantity calculating module is used for calculating the quantity of vocabularies meeting the following conditions in the webpage, and comprises the following steps: the HTML label is independently surrounded and provided with a hyperlink and a four-word vocabulary;
the friend link searching module is used for searching the URL address of the friend link, searching whether the friend link of the webpage of the URL of the link contains the domain name of the self-source webpage or not, and calculating the friend link P3 which is linked back;
the number calculation module is used for calculating the number P4 that the URL address of each link in the webpage is not the domain name of the self-source webpage, the number P5 that the URL address belongs to the domain name of the self-source webpage, and the number P6 of pictures in the webpage;
the probability statistics module is used for extracting words which are independently surrounded by HTML labels in the webpage, have hyperlinks and are four-character words, extracting a plurality of words as a word list sequence according to the sequence of the words appearing in the HTML, and calculating the probability P7 that each word in the word list sequence appears in the menu word list sequence of the sample enterprise webpage at the same time, wherein the four-character words are pieced into the four-character words by the words extracted from the webpage according to the voice rhythmicity;
and the webpage theme determining module is used for calculating the parameters P1 to P7 for the given webpage and the sample enterprise webpage, calculating the variances of the P1 to P7 of the given webpage and the P1 to P7 of the sample enterprise webpage so as to obtain the similarity of the given webpage and the sample enterprise webpage and determine the webpage theme.
6. The enterprise-related web page theme measurement system of claim 5, wherein the web page information comprises: webpage title, webpage menu, friend link, internal and external link, picture quantity and menu characters.
7. The system of claim 5, wherein the web topic determination module calculates the variance of P1 through P7 for a given web page from P1 through P7 for a sample business web page using an F-test method.
8. The enterprise-related web page theme measurement system of claim 5, wherein the web page theme determination module sets different weights for debugging.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710354041.4A CN107357801B (en) | 2017-05-18 | 2017-05-18 | Enterprise related webpage theme measuring method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710354041.4A CN107357801B (en) | 2017-05-18 | 2017-05-18 | Enterprise related webpage theme measuring method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107357801A CN107357801A (en) | 2017-11-17 |
CN107357801B true CN107357801B (en) | 2021-05-28 |
Family
ID=60271916
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710354041.4A Active CN107357801B (en) | 2017-05-18 | 2017-05-18 | Enterprise related webpage theme measuring method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107357801B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102117339A (en) * | 2011-03-30 | 2011-07-06 | 曹晓晶 | Filter supervision method specific to unsecure web page texts |
CN103838792A (en) * | 2012-11-27 | 2014-06-04 | 大连灵动科技发展有限公司 | Method for determining webpage theme |
CN104331449A (en) * | 2014-10-29 | 2015-02-04 | 百度在线网络技术(北京)有限公司 | Method and device for determining similarity between inquiry sentence and webpage, terminal and server |
CN105589892A (en) * | 2014-11-12 | 2016-05-18 | 中国银联股份有限公司 | Webpage theme analysis method based on anchor text backtracking chain |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7454061B2 (en) * | 2003-06-27 | 2008-11-18 | Ricoh Company, Ltd. | System, apparatus, and method for providing illegal use research service for image data, and system, apparatus, and method for providing proper use research service for image data |
-
2017
- 2017-05-18 CN CN201710354041.4A patent/CN107357801B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102117339A (en) * | 2011-03-30 | 2011-07-06 | 曹晓晶 | Filter supervision method specific to unsecure web page texts |
CN103838792A (en) * | 2012-11-27 | 2014-06-04 | 大连灵动科技发展有限公司 | Method for determining webpage theme |
CN104331449A (en) * | 2014-10-29 | 2015-02-04 | 百度在线网络技术(北京)有限公司 | Method and device for determining similarity between inquiry sentence and webpage, terminal and server |
CN105589892A (en) * | 2014-11-12 | 2016-05-18 | 中国银联股份有限公司 | Webpage theme analysis method based on anchor text backtracking chain |
Non-Patent Citations (1)
Title |
---|
基于频繁项集的海量短文本聚类与主题抽取;彭敏等;《计算机研究与发展》;20151231;第52卷(第9期);第1941-1953 * |
Also Published As
Publication number | Publication date |
---|---|
CN107357801A (en) | 2017-11-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102054016B (en) | For capturing and manage the system and method for community intelligent information | |
CN102054015B (en) | System and method of organizing community intelligent information by using organic matter data model | |
CN102982153B (en) | A kind of information retrieval method and device thereof | |
US8630972B2 (en) | Providing context for web articles | |
CN102411587B (en) | Webpage classification method and device | |
US7962523B2 (en) | System and method for detecting templates of a website using hyperlink analysis | |
CN110602045B (en) | Malicious webpage identification method based on feature fusion and machine learning | |
CA2460538A1 (en) | Information analyzing method and apparatus | |
US8560518B2 (en) | Method and apparatus for building sales tools by mining data from websites | |
CN107451120B (en) | Content conflict detection method and system for open text information | |
CN102156746A (en) | Method for evaluating performance of search engine | |
US20100235342A1 (en) | Tagging system using internet search engine | |
CN103729354B (en) | web information processing method and device | |
CN107357801B (en) | Enterprise related webpage theme measuring method and system | |
CN109948015B (en) | Meta search list result extraction method and system | |
KR20120090131A (en) | Method, system and computer readable recording medium for providing search results | |
CN110955845A (en) | User interest identification method and device, and search result processing method and device | |
CN104978431B (en) | Web data fusion method and device | |
CN114706948A (en) | News processing method and device, storage medium and electronic equipment | |
CN107545020A (en) | A kind of determination method and device of Web page classifying | |
EP3040932A1 (en) | A method for tracking discussion in social media | |
KR101402339B1 (en) | System and method of managing document | |
CN108153817B (en) | Intelligent web page data acquisition method | |
CN111666749A (en) | Hot article identification method | |
Barua et al. | Removing noise content from online news articles |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |