CN107357781B - System and method for identifying relevance between webpage title and text - Google Patents

System and method for identifying relevance between webpage title and text Download PDF

Info

Publication number
CN107357781B
CN107357781B CN201710516064.0A CN201710516064A CN107357781B CN 107357781 B CN107357781 B CN 107357781B CN 201710516064 A CN201710516064 A CN 201710516064A CN 107357781 B CN107357781 B CN 107357781B
Authority
CN
China
Prior art keywords
title
information
text
keywords
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710516064.0A
Other languages
Chinese (zh)
Other versions
CN107357781A (en
Inventor
胡玥莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Caitu Information Technology Co., Ltd
Original Assignee
Shanghai Caitu Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Caitu Information Technology Co Ltd filed Critical Shanghai Caitu Information Technology Co Ltd
Priority to CN201710516064.0A priority Critical patent/CN107357781B/en
Publication of CN107357781A publication Critical patent/CN107357781A/en
Application granted granted Critical
Publication of CN107357781B publication Critical patent/CN107357781B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Abstract

The invention provides a system and a method for identifying the association degree of a webpage title and a text, which relate to the technical field of network communication, and the system comprises the following steps: a link extraction unit for extracting link information; the keyword unit is used for extracting the title information pointed by the cursor, extracting a plurality of keywords in the title information, and opening the text in the background according to the link information to extract the text information; a judging unit for judging sentence patterns of the title information, wherein the sentence patterns comprise statement sentences and question sentences; the relevancy calculation unit is used for calculating the occurrence frequency of the keywords in the text information and calculating the weight of the keywords in the text information according to the sentence patterns of the title information; a display unit for displaying the frequency and the weight beside the title; wherein, the system starts after the cursor stays on the title for more than the preset time. The net friend can obtain the association degree information of the title and the pointed text content by moving the cursor to the title, so that the net friend can be replaced to screen invalid garbage information to avoid wasting reading time.

Description

System and method for identifying relevance between webpage title and text
Technical Field
The invention relates to the technical field of network communication, in particular to a system and a method for identifying the association degree of a webpage title and a text.
Background
Today, as the information network covers the aspects of daily life, people need to read a large amount of information on the network every day to obtain news, know little common knowledge or have time to break. However, there are always a large number of articles or posts in the information, which are not consistent with the title and content, and the people who write the titles and articles intentionally use the exaggerated and towering article titles to attract the net friends to click and watch. The time of the net friends is greatly wasted by reading the articles with the seriously inconsistent subjects, and the feelings of the net friends are deceived.
Disclosure of Invention
It is an object of the present invention to provide a system and method for identifying the relevance between the title and the text of a web page, so that when an attractive title is seen, a net friend can obtain the relevance information between the title and the text pointed by the title by moving a cursor to the title. The invention can avoid wasting reading time and screen invalid and junk information for net friends.
In particular, the present invention provides a system for identifying a degree of association between a title of a web page and a body, wherein a cursor of a mouse points to a title including link information for opening the body associated with the title, the system comprising:
a link extraction unit for extracting the link information;
the keyword unit is used for extracting the title information of the title pointed by the cursor, extracting a plurality of keywords in the title information, and opening the text in the background according to the link information to extract text information;
a judging unit for judging the sentence pattern of the title information, wherein the sentence pattern comprises statement sentences and question sentences;
the relevancy calculation unit is used for calculating the occurrence frequency of the keywords in the text information and calculating the weight of the keywords in the text information according to the sentence patterns of the title information;
a display unit for displaying the frequency and the weight beside the title;
and after the stay time of the cursor on the title exceeds a preset time, starting the system.
Further, in the case where the sentence pattern is a statement sentence, the association calculating unit calculates the weight according to the position of the keyword in the body information.
Further, in a case where the sentence pattern is a question, the association calculating unit calculates the weight according to an answer situation of the question in the body information.
Further, the keyword unit extracts a noun, a verb, and/or an adjective in the header information as the keyword.
Further, the system further comprises a part-of-speech analysis unit for analyzing whether the nouns in the keywords have multiple meanings.
According to another aspect of the present invention, the present invention also provides a method for identifying a degree of association between a title of a web page and a body text, wherein a cursor of a mouse points to a title including link information for opening the body text associated with the title, the method comprising the steps of:
a link extraction step: extracting the link information;
a keyword step: extracting the title information of the title pointed by the cursor, extracting a plurality of keywords in the title information, and opening the text at the background according to the link information to extract text information;
a judging step: judging sentence patterns of the title information, wherein the sentence patterns comprise statement sentences and question sentences;
and a correlation calculation step: calculating the occurrence frequency of the keywords in the text information, and calculating the weight of the keywords in the text information according to the sentence pattern of the title information;
a display step: displaying the frequency and the weight beside the title;
wherein the link extraction step is started after the cursor stays on the title for more than a predetermined time period.
Further, in the case where the sentence pattern is a statement sentence, the association calculating step calculates the weight according to the position of the keyword in the text information.
Further, in a case where the sentence pattern is a question sentence, the association calculating step calculates the weight according to a solution condition of the question sentence in the text information.
Further, the keywords are nouns, verbs and/or adjectives in the header information.
Further, the method also comprises a part-of-speech analysis step of analyzing whether the nouns in the keywords have multiple meanings.
The above and other objects, advantages and features of the present invention will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.
Drawings
Some specific embodiments of the invention will be described in detail hereinafter, by way of illustration and not limitation, with reference to the accompanying drawings. The same reference numbers in the drawings identify the same or similar elements or components. Those skilled in the art will appreciate that the drawings are not necessarily drawn to scale. In the drawings:
FIG. 1 is a schematic diagram of a system for identifying a degree of association of a web page title with a body text according to one embodiment of the present invention;
FIG. 2 is a flow chart of a method for identifying a degree of association of a web page title with a body text according to another aspect of the present invention.
The symbols in the drawings represent the following meanings:
1. title, 2, link information, 3, text, 4, link extraction unit, 5, keyword unit, 6, judgment unit, 7, relevance calculation unit, 8 and display unit.
Detailed Description
The invention provides a system for identifying the relevance between a webpage title and a text, wherein a cursor of a mouse points to a title, and the title comprises link information for opening the text associated with the title. When an attractive title is seen, the net friend moves the cursor to the title to obtain the association information of the title and the text content pointed by the title. The invention can avoid wasting reading time and screen invalid and junk information for net friends.
As shown in fig. 1, the system includes: link extracting section 4, keyword section 5, judging section 6, association degree calculating section 7, and display section 8. Generally, when the cursor changes from an arrow shape to a hand shape, the system determines that the cursor is resting on title 1. When a webpage is browsed, the system is started after the stay time of the cursor on the title 1 exceeds a preset time. The predetermined period of time is set to 3 seconds or other suitable period of time.
After the system is started, first, the link extraction unit 4 extracts the link information 2. The keyword unit 5 opens the body 3 in the background based on the link information 2 to extract body information. Then, the keyword unit 5 extracts the title information of the title 1 pointed by the cursor, and then extracts a plurality of keywords in the title information. Wherein, the keyword unit 5 extracts nouns, verbs and/or adjectives in the header information as the keywords.
Further, the system further comprises a part-of-speech analysis unit for analyzing whether the nouns in the keywords have multiple meanings. For example, a hackle is both the name of a public character and a tree category. For example, a "walkthrough graph" is distinct from the meaning that (all people) have walked through (have gone away). The part-of-speech analysis unit is suitable for some intelligent analysis systems in the prior art, and the intelligent analysis systems can automatically expand and learn word banks.
Then, the judgment unit 6 makes a judgment of the sentence pattern of the header information. Wherein, the sentence pattern comprises statement sentences and question sentences. On the basis of the obtained keyword and text information, the association degree calculation unit 7 is started. And the relevancy calculation unit 7 is used for calculating the occurrence frequency of the keywords in the text information and calculating the weight of the keywords in the text information according to the sentence patterns of the title information. In the case where the sentence pattern is a statement sentence, the association calculating unit 7 calculates the weight according to the position of the keyword in the text information. And in the case that the sentence pattern is a question, the association calculation unit calculates the weight according to the solution condition of the question in the text information.
Further, for intuitive purposes, the weights have multiple levels, and in one embodiment, the weights can be divided into: completely unrelated, one-star related, two-star related, subject matter consistent, etc.
On the basis of the obtained frequency and weight, the display unit 8 displays the frequency and the weight next to the title 1. For example, frequency and weight are displayed with small boxes next to the title 1. Therefore, when the net friend browses the webpage, the system is opened, and as long as the cursor stays on the interested title 1 for a few seconds, information of the frequency and the weight of displaying the related relation between the title 1 and the text 3 can be obtained, so that the net friend can freely decide whether to read the text 3 or not on the basis.
According to another aspect of the present invention, as shown in fig. 2, the present invention further provides a method for identifying the association degree between the title and the body of a web page, wherein a cursor of a mouse points to a title, the title includes link information for opening the body associated with the title, and the link extraction step starts after the cursor stays on the title for more than a predetermined time. The method comprises the following steps:
s11: a link extraction step of extracting the link information;
s13: a keyword step: extracting the title information of the title pointed by the cursor, extracting a plurality of keywords in the title information, and opening the text at the background according to the link information to extract text information;
s15: a judging step: judging sentence patterns of the title information, wherein the sentence patterns comprise statement sentences and question sentences;
s17: calculating the frequency of the keywords appearing in the text information, and calculating the weight according to the positions of the keywords in the text information in the association calculation step under the condition that the sentence pattern is a statement sentence;
s19: calculating the frequency of the keywords appearing in the text information, and calculating the weight according to the positions of the keywords in the text information in the association calculation step under the condition that the sentence pattern is a statement sentence;
s21: a display step: displaying the frequency and the weight next to the title.
In step S21, the weights have multiple levels for intuitive purposes, and in one embodiment, the weights can be divided into: completely unrelated, one-star related, two-star related, subject matter consistent, etc. Therefore, when the net friend browses the webpage, the system is opened, and as long as the cursor stays on the interested title for a few seconds, information of the frequency and the weight of displaying the related title and the text can be obtained, so that the net friend can freely determine whether to read the text on the basis.
Further, the keywords are nouns, verbs and/or adjectives in the header information.
Further, the method also comprises a part-of-speech analysis step of analyzing whether the nouns in the keywords have multiple meanings. For example, a hackle is both the name of a public character and a tree category. For example, a "walkthrough graph" is distinct from the meaning that (all people) have walked through (have gone away). The part-of-speech analysis unit is suitable for some intelligent analysis systems in the prior art, and the intelligent analysis systems can automatically expand and learn word banks.
Thus, it should be appreciated by those skilled in the art that while a number of exemplary embodiments of the invention have been illustrated and described in detail herein, many other variations or modifications consistent with the principles of the invention may be directly determined or derived from the disclosure of the present invention without departing from the spirit and scope of the invention. Accordingly, the scope of the invention should be understood and interpreted to cover all such other variations or modifications.

Claims (2)

1. A system for identifying a degree of association of a title of a web page with a body, wherein a cursor of a mouse points to a title including link information for opening the body associated with the title, the system comprising:
a link extraction unit for extracting the link information;
the keyword unit is used for extracting the title information of the title pointed by the cursor, extracting a plurality of keywords in the title information, and opening the text in the background according to the link information to extract text information;
a judging unit for judging the sentence pattern of the title information, wherein the sentence pattern comprises statement sentences and question sentences;
the relevancy calculation unit is used for calculating the occurrence frequency of the keywords in the text information and calculating the weight of the keywords in the text information according to the sentence patterns of the title information;
a display unit for displaying the frequency and the weight beside the title;
after the stay time of the cursor on the title exceeds a preset time, the system is started;
wherein, in the case where the sentence pattern is a statement sentence, the association calculation unit calculates the weight according to the position of the keyword in the text information;
wherein, in the case that the sentence pattern is a question, the association calculation unit calculates the weight according to the answer condition of the question in the text information;
wherein, the keyword unit extracts nouns, verbs and adjectives in the header information as the keywords;
the system also comprises a part-of-speech analysis unit, a word classification analysis unit and a word classification analysis unit, wherein the part-of-speech analysis unit is used for analyzing whether the nouns in the keywords have multiple meanings;
pointing a title to which a cursor of a mouse points, wherein the title comprises link information for opening a text associated with the title;
the method of operation of the system comprises the steps of:
a link extraction step: extracting the link information;
a keyword step: extracting the title information of the title pointed by the cursor, extracting a plurality of keywords in the title information, and opening the text at the background according to the link information to extract text information;
a judging step: judging sentence patterns of the title information, wherein the sentence patterns comprise statement sentences and question sentences;
and a correlation calculation step: calculating the occurrence frequency of the keywords in the text information, and calculating the weight of the keywords in the text information according to the sentence pattern of the title information;
a display step: displaying the frequency and the weight beside the title;
wherein, after the stay time of the cursor on the title exceeds a preset time, the link extraction step starts to be started;
wherein, in the case where the sentence pattern is a statement sentence, the association calculating step calculates the weight according to the position of the keyword in the text information;
wherein, in the case where the sentence pattern is a question sentence, the association calculation step calculates the weight according to an answer situation of the question sentence in the text information.
2. The system for identifying the association degree of a title and a body of a web page as claimed in claim 1, wherein the method for operating the system further comprises a part-of-speech analysis step of analyzing whether the noun in the keyword has multiple meanings.
CN201710516064.0A 2017-06-29 2017-06-29 System and method for identifying relevance between webpage title and text Active CN107357781B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710516064.0A CN107357781B (en) 2017-06-29 2017-06-29 System and method for identifying relevance between webpage title and text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710516064.0A CN107357781B (en) 2017-06-29 2017-06-29 System and method for identifying relevance between webpage title and text

Publications (2)

Publication Number Publication Date
CN107357781A CN107357781A (en) 2017-11-17
CN107357781B true CN107357781B (en) 2020-12-29

Family

ID=60274081

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710516064.0A Active CN107357781B (en) 2017-06-29 2017-06-29 System and method for identifying relevance between webpage title and text

Country Status (1)

Country Link
CN (1) CN107357781B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109614625B (en) * 2018-12-17 2022-06-17 北京百度网讯科技有限公司 Method, device and equipment for determining title text relevancy and storage medium
CN111753553B (en) * 2020-07-06 2022-07-05 北京世纪好未来教育科技有限公司 Statement type identification method and device, electronic equipment and storage medium
CN114282092A (en) * 2021-12-07 2022-04-05 咪咕音乐有限公司 Information processing method, device, equipment and computer readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102132273A (en) * 2008-06-27 2011-07-20 谷歌公司 Annotating webpage content
CN102541527A (en) * 2010-12-17 2012-07-04 深圳市金蝶中间件有限公司 Hovering prompting system and method
CN103617213A (en) * 2013-11-19 2014-03-05 北京奇虎科技有限公司 Method and system for identifying newspage attributive characters

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102132273A (en) * 2008-06-27 2011-07-20 谷歌公司 Annotating webpage content
CN102541527A (en) * 2010-12-17 2012-07-04 深圳市金蝶中间件有限公司 Hovering prompting system and method
CN103617213A (en) * 2013-11-19 2014-03-05 北京奇虎科技有限公司 Method and system for identifying newspage attributive characters

Also Published As

Publication number Publication date
CN107357781A (en) 2017-11-17

Similar Documents

Publication Publication Date Title
CN109800352B (en) Method, system and terminal device for pushing information based on clipboard
US8051080B2 (en) Contextual ranking of keywords using click data
Lu et al. Opinion integration through semi-supervised topic modeling
US8849725B2 (en) Automatic classification of segmented portions of web pages
CN102708174B (en) Method and device for displaying rich media information in browser
CN102760172B (en) Network searching method and network searching system
US9411886B2 (en) Ranking advertisements with pseudo-relevance feedback and translation models
CN110888990B (en) Text recommendation method, device, equipment and medium
CN107357781B (en) System and method for identifying relevance between webpage title and text
CN105718585B (en) Document and label word justice correlating method and its device
JP2011154668A (en) Method for recommending the most appropriate information in real time by properly recognizing main idea of web page and preference of user
EP2307951A1 (en) Method and apparatus for relating datasets by using semantic vectors and keyword analyses
CN106951530A (en) A kind of event type abstracting method and device
CN105843796A (en) Microblog emotional tendency analysis method and device
CN104881428B (en) A kind of hum pattern extraction, search method and the device of hum pattern webpage
CN109871433B (en) Method, device, equipment and medium for calculating relevance between document and topic
CN103186556A (en) Method for obtaining and searching structural semantic knowledge and corresponding device
Gasparetti et al. Exploiting web browsing activities for user needs identification
CN112989208A (en) Information recommendation method and device, electronic equipment and storage medium
Li et al. Improving relevance judgment of web search results with image excerpts
CN101004753B (en) Method and system for recognizing conception type files
CN103942233B (en) The lobby page recognition methods of directory type web and device
US20170293683A1 (en) Method and system for providing contextual information
CN111737607A (en) Data processing method, data processing device, electronic equipment and storage medium
CN107622125B (en) Information crawling method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20201023

Address after: Room 2036, 4 / F, building 3, Xiushan Road, Chengqiao Town, Chongming District, Shanghai 202150 (Chongming Industrial Park, Shanghai)

Applicant after: Shanghai Caitu Information Technology Co., Ltd

Address before: 528100 No. 6 Yundonghai Avenue, Yundonghai Street, Sanshui District, Foshan City, Guangdong Province

Applicant before: Hu Yueying

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant