CN102737017B - Method and apparatus for extracting page theme - Google Patents

Method and apparatus for extracting page theme Download PDF

Info

Publication number
CN102737017B
CN102737017B CN201110080852.2A CN201110080852A CN102737017B CN 102737017 B CN102737017 B CN 102737017B CN 201110080852 A CN201110080852 A CN 201110080852A CN 102737017 B CN102737017 B CN 102737017B
Authority
CN
China
Prior art keywords
word
paragraph
confidence
page
segmentation processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110080852.2A
Other languages
Chinese (zh)
Other versions
CN102737017A (en
Inventor
刘海浪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201110080852.2A priority Critical patent/CN102737017B/en
Publication of CN102737017A publication Critical patent/CN102737017A/en
Application granted granted Critical
Publication of CN102737017B publication Critical patent/CN102737017B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a method and an apparatus for extracting a page theme. The method comprises: A. acquiring candidate paragraphs which convey the page theme; B, if a candidate paragraph which can be re-paragraphed exists, paragraphing the candidate paragraph which can be re-paragraphed; otherwise performing step C; C. calculating the confidences of the paragraphs obtained after the step B respectively; and D. taking the paragraph with a confidence that meets the requirement of a preset confidence as the paragraph of the page theme. By using the method and the apparatus, the page theme can be determined more accurately, and the deviation between an extracted page theme and an actual page theme can be reduced.

Description

A kind of method and apparatus extracting page subject matter
[technical field]
The present invention relates to field of computer technology, particularly a kind of method and apparatus extracting page subject matter.
[background technology]
It is the acquisition that the sequence in page search, the determination of page subject matter word or other aspects all can relate to page subject matter, such as, in the sequence of page search, higher for the degree of correlation between page subject matter and query can be come more, page subject matter word extracts usually from page subject matter, etc.
At present, usually simply using the whole title paragraph (title) of the page as page subject matter.But may there is multiple paragraph in the title of the page, some paragraph is the incoherent content of page subject matter, can cause the skew of page subject matter.Be applied in the sequence of page search and may not meet consumers' demand exactly, be applied in the determination of page subject matter word and the page subject matter word determined may be caused can not to embody page subject matter exactly.
[summary of the invention]
The invention provides a kind of method and apparatus extracting page subject matter, so that reduce the deviation of page subject matter and the actual pages theme extracted.
Concrete technical scheme is as follows:
Extract a method for page subject matter, the method comprises:
Candidate's paragraph of page subject matter is expressed in A, the acquisition page;
Can candidate's paragraph of segmentation again if B exists, then to the candidate segment of segmentation again dropping into row staging treating; Otherwise perform step C;
The degree of confidence of each paragraph obtained after C, respectively calculation procedure B;
D, degree of confidence met paragraph that default degree of confidence requires as page subject matter paragraph.
The described candidate's paragraph obtained in above-mentioned steps A comprises following listed at least one:
The navigation paragraph that page title is capable, label is mypos that label is the page title paragraph of title, label is realtitle and label are the front chain of preanchor.
Particularly, in step B, comprise candidate's paragraph of the symbol of preset kind if existed, then determine that this candidate's paragraph can segmentation again, and with the symbol of described preset kind for separator is to the candidate segment of segmentation again dropping into row staging treating.
Wherein, the symbol of described preset kind comprises: punctuation mark, space, underscore, oblique line or bracket.
In addition, described step C specifically comprises:
C1, word segmentation processing is carried out to each paragraph obtained after described step B;
C2, according to formula D ij=α * S ij+ β * P ij, obtain the degree of confidence of each word after calculating word segmentation processing, wherein, D ijbe the degree of confidence of the jth word obtained after i-th paragraph word segmentation processing, S ijbe total frequency that the jth word obtained after i-th paragraph word segmentation processing occurs in described each paragraph, P ijbe the frequency that the jth word obtained after i-th paragraph word segmentation processing occurs in the described page, α and β is default weighting coefficient;
C3, utilize in each paragraph the degree of confidence of each word comprised, obtain the degree of confidence of described each paragraph respectively.In described step C3, the degree of confidence D of i-th paragraph ican be: n is the word number obtained after i-th paragraph word segmentation processing.
More preferably, before described step C or described step D, also comprise:
According to the website dictionary preset, will occur in described each paragraph that the paragraph that ratio that content in described website dictionary accounts for bout length reaches default proportion threshold value filters out.
Particularly, described in step D, degree of confidence requires to include: the degree of confidence of paragraph reaches default confidence threshold value; Or,
The degree of confidence of paragraph comes the top n in described each paragraph; Or,
The degree of confidence of paragraph reaches default confidence threshold value and comes the top n in described each paragraph; Wherein N is default positive integer.
Further, the method also comprises respectively to described page subject matter paragraph execution following steps:
E, word segmentation processing is carried out to described page subject matter paragraph;
F, part-of-speech tagging is carried out to each word obtained after word segmentation processing;
G, at least one in following filter operation is performed to each word obtained after word segmentation processing:
Is filtered out each word that the word that the inactive vocabulary preset comprises is obtained after word segmentation processing;
Filter out each word that the word of being expressed the meaning by dependent obtains after word segmentation processing;
If there is the word of hyponymy each other in each word obtained after word segmentation processing, then filter out each word upper word obtained after word segmentation processing; And,
Is filtered out each word that page type attribute word is obtained after word segmentation processing;
H, each word obtained after word segmentation processing performed step G after remaining word be defined as the descriptor of the described page.
Wherein, filter out described each word that page type attribute word is obtained after word segmentation processing and comprise:
If the described page is default page type, then filter out each word type attribute word of the described page obtained after word segmentation processing; Wherein said default page type comprises: video type, novel types, audio types, type of play or Forum Type.
Extract a device for page subject matter, this device comprises: paragraph acquiring unit, staging treating unit, confidence computation unit and theme paragraph determining unit;
Described paragraph acquiring unit, for obtaining in the page candidate's paragraph of expressing page subject matter and being supplied to described staging treating unit;
Described staging treating unit, for can not the candidate segment of segmentation shaving one's head and give described confidence computation unit again, to can the candidate segment of segmentation again drop into row staging treating after send to described confidence computation unit;
Described confidence computation unit, for calculating the degree of confidence of each paragraph that described staging treating unit sends;
Described theme paragraph determining unit, for the result of calculation according to described confidence computation unit, meets the paragraph of default degree of confidence requirement as page subject matter paragraph using degree of confidence.
Wherein, described candidate's paragraph that described paragraph acquiring unit obtains comprises following listed at least one:
The navigation paragraph that page title is capable, label is mypos that label is the page title paragraph of title, label is realtitle and label are the front chain of preanchor.
Particularly, if described staging treating unit determines to exist candidate's paragraph of the symbol comprising preset kind, then determine that this candidate's paragraph can segmentation again, and with the symbol of preset kind for separator is to the candidate segment of segmentation again dropping into row staging treating.
Wherein, the symbol of described preset kind comprises: punctuation mark, space, underscore, oblique line or bracket.
Particularly, described confidence computation unit can comprise: first participle subelement, the first computation subunit and the second computation subunit;
Described first participle subelement, carries out word segmentation processing for each paragraph sent described staging treating unit;
Described first computation subunit, for according to formula D ij=α * S ij+ β * P ij, obtain the degree of confidence of each word after calculating described first participle subelement word segmentation processing, wherein, D ijbe the degree of confidence of the jth word obtained after i-th paragraph word segmentation processing, S ijbe total frequency that the jth word obtained after i-th paragraph word segmentation processing occurs in described each paragraph, P ijbe the frequency that the jth word obtained after i-th paragraph word segmentation processing occurs in the page, α and β is default weighting coefficient;
Described second computation subunit, for utilizing in each paragraph the degree of confidence of each word comprised, obtains the degree of confidence of described each paragraph respectively.
Wherein, described second computation subunit according to calculate the degree of confidence D of i-th paragraph i, N is the word number obtained after i-th paragraph word segmentation processing.
More preferably, this device also comprises: the first filter element, for according to the website dictionary preset, is occurred that the paragraph that ratio that content in described website dictionary accounts for bout length reaches default proportion threshold value filters out in each paragraph that described staging treating unit is sent.
Described degree of confidence requires to include: the degree of confidence of paragraph reaches default confidence threshold value; Or,
The degree of confidence of paragraph comes the top n in described each paragraph; Or,
The degree of confidence of paragraph reaches default confidence threshold value and comes the top n in described each paragraph; Wherein N is default positive integer.
Further, this device also comprises: key phrases extraction unit;
Described key phrases extraction unit specifically comprises: the second participle subelement, part-of-speech tagging subelement, filtration subelement and descriptor determination subelement;
Described second participle subelement, for carrying out word segmentation processing to described page subject matter paragraph;
Described part-of-speech tagging subelement, sends to described filtration subelement after carrying out part-of-speech tagging to each word obtained after word segmentation processing;
Described filtration subelement, for performing at least one in following filter operation to each word obtained after word segmentation processing:
Is filtered out each word that the word that the inactive vocabulary preset comprises is obtained after word segmentation processing;
Filter out each word that the word of being expressed the meaning by dependent obtains after word segmentation processing;
If there is the word of hyponymy each other in each word obtained after word segmentation processing, then filter out each word upper word obtained after word segmentation processing; And,
Is filtered out each word that page type attribute word is obtained after word segmentation processing;
Described descriptor determination subelement, for being defined as the descriptor of the described page by word remaining after described filtration subelement filtration treatment.
Wherein, if described filtration subelement determines that the described page is default page type, then filter out each word type attribute word of the described page obtained after word segmentation processing; Wherein said default page type comprises: video type, novel types, audio types, type of play or Forum Type.
As can be seen from the above technical solutions, the present invention falls behind in acquisition candidate segment, can candidate's paragraph of segmentation again if existed, then to the candidate segment of segmentation again dropping into row staging treating; Again according to the degree of confidence of each paragraph calculated further, select the paragraph meeting degree of confidence requirement as page subject matter paragraph.This dropping into candidate segment goes further cutting and the mode selecting page subject matter paragraph according to degree of confidence, can determine page subject matter paragraph more exactly, namely reduce the deviation of page subject matter and the actual pages theme extracted.When the page subject matter paragraph extracted is applied in page searching order, can meet consumers' demand more exactly; When being applied in the determination of page subject matter word, page subject matter word can be made to embody page subject matter more exactly.
[accompanying drawing explanation]
The method flow diagram of the extraction page subject matter that Fig. 1 provides for the embodiment of the present invention one;
The method flow diagram of each paragraph degree of confidence of calculating that Fig. 2 provides for the embodiment of the present invention two;
The method flow diagram of the extraction page subject matter word that Fig. 3 provides for the embodiment of the present invention three;
The structure drawing of device of the extraction page subject matter that Fig. 4 provides for the embodiment of the present invention four.
[embodiment]
In order to make the object, technical solutions and advantages of the present invention clearly, describe the present invention below in conjunction with the drawings and specific embodiments.
Embodiment one,
The method flow diagram of the extraction page subject matter that Fig. 1 provides for the embodiment of the present invention one, as shown in Figure 1, the method can comprise the following steps:
Step 101: obtain candidate's paragraph of expressing page subject matter in the page.
In this step, candidate's paragraph of expressing page subject matter in the page refers to that those may embody the paragraph of page subject matter, specifically can include but not limited at least one in following paragraph:
The front chain that the navigation paragraph that page title is capable, label is mypos that label is the page title paragraph of title, label is realtitle, label are preanchor.
Such as, for http:// www.22zw.cn/XH/91H53969KX/the page, therefrom obtaining above-mentioned is that four paragraphs are respectively:
Label is the page title paragraph of title, and content is: the up-to-date chapters and sections bucket in the broken firmament of bucket breaks fast book giant silkworm potato 22 Chinese network soon in the firmament.
Label is that the page title of realtitle is capable, and content is: the broken firmament of bucket.
Label is the navigation paragraph of mypos, does not have corresponding content in this page.
Label is the front chain of preanchor, and content is: the up-to-date chapters and sections in the broken firmament of bucket.
Step 102: the candidate segment of segmentation again can drop into row staging treating in candidate's paragraph of above-mentioned acquisition.
This step is optional step, all can not segmentation again if be respectively selected paragraph, then do not perform this step.
Determine candidate's paragraph whether can again segmentation time, the symbol whether comprising preset kind in each candidate's paragraph can be judged, if comprised, then think that candidate's paragraph can segmentation again, otherwise, think that candidate's paragraph can not segmentation again.Correspondingly, when dropping into row segmentation to candidate segment, the partition strategy of employing can be: with the symbol of preset kind for separator carries out segmentation.
Wherein the symbol of preset kind can include but not limited to: punctuation mark, space, underscore, oblique line, bracket.
Such as, be the page title paragraph of title for label, with the symbol of preset kind for after separator carries out staging treating to it, following four paragraphs can be obtained:
Title paragraph 1: the up-to-date chapters and sections in the broken firmament of bucket
Title paragraph 2: the fast book soon in the broken firmament of bucket
Title paragraph 3: giant silkworm potato
Title paragraph 4:22 Chinese network
Other candidate's paragraphs all can not segmentation again.
Step 103: the degree of confidence of each paragraph obtained after calculation procedure 102.
If certain candidate segment is dropped into and gone staging treating, then calculate the degree of confidence of each paragraph obtained after this candidate segment drops into row staging treating; If certain candidate's paragraph does not carry out staging treating, then calculate the degree of confidence of this candidate's paragraph.
The confidence calculations method of each paragraph will specifically describe in embodiment two.
Before execution step 103 or step 104, a filtration step can also be comprised, be about to the paragraph relevant to website and filter out.Specifically can be realized by the website dictionary preset, in this website dictionary, comprise various site name, if occur in certain paragraph that the ratio of content this bout length shared in website dictionary reaches default proportion threshold value, then this paragraph is filtered out.Such as, the content " 22 Chinese network " of title paragraph 4 is exactly a site name, and this site name can be set in advance in website dictionary, this title paragraph 4 can be filtered out before execution step 104.
Step 104: degree of confidence is met the paragraph of default degree of confidence requirement as page subject matter paragraph (maintitle).
Wherein, the degree of confidence preset requires: the degree of confidence of paragraph reaches default confidence threshold value, or the degree of confidence of paragraph comes the top n in each paragraph, or the degree of confidence of paragraph reaches default confidence threshold value and degree of confidence comes the top n in each paragraph.N is default positive integer.
Suppose after filtration treatment, the degree of confidence of each paragraph is as follows:
Title paragraph 1: the up-to-date chapters and sections degree of confidence in the broken firmament of bucket is 0.9
Title paragraph 2: the fast book degree of confidence soon in the broken firmament of bucket is 0.7
Title paragraph 3: giant silkworm potato degree of confidence is 0.3
Label is that the page title of realtitle is capable: bucket broken firmament degree of confidence is 1.0
Label is the front chain of preanchor: the up-to-date chapters and sections degree of confidence in the broken firmament of bucket is 0.9
Wherein, paragraph that degree of confidence is the highest can be selected as maintitle, namely select " the broken firmament of bucket " as maintitle.Also different descriptions supplementing as page subject matter can be considered, multiple paragraph can be selected namely to extract maintitle arranged side by side as maintitle, such as select degree of confidence paragraph of 2 more than 0.9 and before coming, namely select " the broken firmament of bucket " and " bucket breaks the up-to-date chapters and sections in the firmament " as maintitle.
Embodiment two,
The method flow diagram of each paragraph degree of confidence of calculating that Fig. 2 provides for the embodiment of the present invention two, as shown in Figure 2, the method can comprise the following steps:
Step 201: word segmentation processing is carried out to each paragraph.
More preferably, based on the inactive vocabulary preset, each word obtained after word segmentation processing can also be filtered.Wherein, comprise the word that in usual webpage, frequency of occurrence is very high in vocabulary of stopping using, can include but not limited to: adverbial word, function word, modal particle, auxiliary word, pronoun etc., the usual competency of these words is very low.
Step 202: according to formula D ij=α * S ij+ β * P ij, after calculating word segmentation processing, obtain the degree of confidence of each word.
Wherein, D ijbe the degree of confidence of the jth word obtained after i-th paragraph word segmentation processing, S ijbe the frequency that the jth word obtained after i-th paragraph word segmentation processing occurs in all paragraphs, P ijbe the frequency that the jth word obtained after i-th paragraph word segmentation processing occurs in the page, α and β is default weighting coefficient.
If α is non-vanishing, the paragraph obtained after needing to utilize step 102 described in embodiment one carries out checking mutually and obtains S ij, namely need to add up the frequency occurred in all paragraphs that each word obtains after step 102, frequency of occurrence is higher, and the degree of confidence of word is higher.
If β is non-vanishing, need the competency utilizing word in the page to obtain P ij, namely need the frequency of occurrence of statistics word in the page, frequency of occurrence is higher, and the degree of confidence of word is higher.
Step 203: the degree of confidence utilizing in each paragraph each word comprised, obtains the degree of confidence of each paragraph respectively.
After the degree of confidence of each word comprised in paragraph can being got average, obtain the degree of confidence of paragraph, i.e. the degree of confidence D of i-th paragraph ican be: n is the word number obtained after i-th paragraph word segmentation processing.
Still for the example in embodiment one, after title paragraph 1 carries out word segmentation processing, obtain word: the bucket broken firmament, up-to-date, chapters and sections.When filtering based on the inactive vocabulary preset, owing to not comprising the word in inactive vocabulary, therefore, after filtration, still obtain word: " the broken firmament of bucket ", " up-to-date ", " chapters and sections ".According to the degree of confidence of each word of formulae discovery shown in step 202, wherein due to " the broken firmament of bucket ", in each paragraph, frequency of occurrence is very high, and the frequency occurred in webpage is also very high, and therefore " the broken firmament of bucket " has higher degree of confidence.Then the degree of confidence that average obtains title paragraph 1 is got after the degree of confidence of each word being added.
So far, shown in embodiment two, flow process terminates.
Utilizing after shown in embodiment one, flow process determines maintitle, may be used for the sequence in page search, namely when setting up the index of this page, the word belonging to this maintitle is marked in the index, after searching for, by word match each in query to index in, marked word and belonged to the page corresponding to the index of maintitle and improve the sequencing weight in Search Results.
In addition, the maintitle that flow process shown in embodiment is determined also may be used for extracting page subject matter word (keyword), is described this process below by embodiment three.
Embodiment three,
The method flow diagram of the extraction page subject matter word that Fig. 3 provides for the embodiment of the present invention three, as shown in Figure 3, the method can comprise the following steps:
Step 301: word segmentation processing is carried out to the maintitle that embodiment one is determined.
If the maintitle of the page determined only has one, then only perform the flow process shown in this embodiment three for this maintitle, if the page maintitle determined has multiple, then perform the flow process shown in this embodiment three respectively for each maintitle.
Step 302: part-of-speech tagging is carried out to each word obtained after word segmentation processing.
Step 303: each word obtained after word segmentation processing is filtered based on the inactive vocabulary preset.
This step is exactly filter out each word of obtaining after word segmentation processing of the word that comprised by inactive vocabulary.Wherein, comprise the word that in webpage, frequency of occurrence is very high in vocabulary of stopping using, can include but not limited to: adverbial word, function word, modal particle, auxiliary word, pronoun.
Step 304: the word that dependent expresses the meaning is filtered out to each word obtained after word segmentation processing.
At this, the probable value that the dependent word of expressing the meaning can form a word based on word context and this word combination is determined, if the probability that the word combination that certain word is adjacent with this word forms a word exceedes default threshold value of expressing the meaning, then determine that this word is the word that dependent is expressed the meaning, its word combination that should be adjacent forms a word.
If when carrying out word segmentation processing in step 301, employing be the participle mode of wholegrain degree, then can perform step 304; If when carrying out word segmentation processing in step 301, considered the word that wherein dependent is expressed the meaning, directly the word combination that dependent is expressed the meaning has been formed a word, what namely obtain after word segmentation processing is all the word of independently expressing the meaning, then do not perform this step.
Step 305: the analysis each word obtained after word segmentation processing being carried out to hyponymy, if there is the word of hyponymy each other, will filter out upper word.
When carrying out the analysis of hyponymy, based on the hyponymy vocabulary preset, in this hyponymy vocabulary, contain the context relation between various word.
If there is the word of hyponymy each other in each word obtained after word segmentation processing, because upper word does not have the competency of the next word strong, and the word of bottom has covered upper word meaning usually, therefore, upper word can be filtered out.
Give an example, if after a query carries out word segmentation processing, not only comprise " Guangdong " but also comprise in " Guangzhou ", wherein " Guangdong " is " Guangzhou " upper word, therefore, upper word " Guangdong " can be filtered out, and retains word " Guangzhou ".
Step 306: page type attribute word is filtered out to each word obtained after word segmentation processing.
If the page has default page type, then the type attribute word of this page is filtered out, if the page does not have default page type, then do not perform the filtration of this step.Wherein, the page type preset can include but not limited to: video type, novel types, audio types, type of play, Forum Type.
Such as, if the page is video type, the content that namely this page provides is video, comprises " video " in the word obtained to maintitle after carrying out word segmentation processing, can't there is meaning to the theme of this page in this word " video ", therefore filtered out by this word.If the page is blog page, then will there is meaning to the theme of this page in word " video ", this word would not be filtered out.
It should be noted that, above-mentioned steps 303, step 304, step 305 and step 306 can select an execution, also can perform with the form of combination in any.If performed with the form of combination, then can perform with arbitrary sequencing.
Step 307: the keyword word obtained after carrying out above-mentioned filtration to each word obtained after word segmentation processing being defined as this page.
Give one example for the flow process shown in embodiment three below, suppose that maintitle is: the video having seen real estate Three Musketeers today.
If carry out the word segmentation processing of wholegrain degree for this maintitle, obtain following word: " today ", " seeing ", " ", " real estate ", " three ", " swordsman ", " ", " video ", " Three Musketeers ".Obtain after carrying out part-of-speech tagging: " today " be noun, " seeing " is verb, " " is auxiliary word, " real estate " is noun, " three " are numbers, " " is auxiliary word, " video " is noun, " Three Musketeers " is noun.
Filter based on inactive vocabulary, filter out " ", " ", " seeing ", " today ".
Filter out word " three " and " swordsman " that dependent expresses the meaning.
If the page belonging to this maintitle is content pages, do not belong to default page type, then this maintitle is not carried out to the filtration of page type attribute word.
The keyword finally obtaining this page is: " real estate ", " video ", " Three Musketeers ".
Utilizing after described in example three, mode extracts keyword, keyword in the page can be marked, when sorting to the page in Search Results, if query has hit the keyword of certain page, then can improve the sequencing weight of this page, make the sequence of Search Results more can meet the demand of user, improve search effect.
Be more than the detailed description that method provided by the present invention is carried out, below by embodiment four, device provided by the present invention be described in detail.
Embodiment four,
The structure drawing of device of the extraction page subject matter that Fig. 4 provides for the embodiment of the present invention four, as shown in Figure 4, this device can comprise: paragraph acquiring unit 400, staging treating unit 410, confidence computation unit 420 and theme paragraph determining unit 430.
Paragraph acquiring unit 400, for obtaining in the page candidate's paragraph of expressing page subject matter and being supplied to staging treating unit 410.
Staging treating unit 410, for can not the candidate segment of segmentation shaving one's head and give confidence computation unit 420 again, to can the candidate segment of segmentation again drop into row staging treating after send to confidence computation unit 420.
Confidence computation unit 420, for calculating the degree of confidence of each paragraph that staging treating unit 410 sends.
Theme paragraph determining unit 430, for the result of calculation according to confidence computation unit 420, meets the paragraph of default degree of confidence requirement as maintitle using degree of confidence.
Wherein, the degree of confidence preset requires to comprise: the degree of confidence of paragraph reaches default confidence threshold value; Or the degree of confidence of paragraph comes the top n in each paragraph; Or the degree of confidence of paragraph reaches default confidence threshold value and comes the top n in each paragraph; Wherein N is default positive integer.
Candidate's paragraph that paragraph acquiring unit 400 obtains can comprise following listed at least one:
The navigation paragraph that page title is capable, label is mypos that label is the page title paragraph of title, label is realtitle and label are the front chain of preanchor.
Particularly, if staging treating unit 410 determines to exist candidate's paragraph of the symbol comprising preset kind, then determine that this candidate's paragraph can segmentation again, and with the symbol of preset kind for separator is to the candidate segment of segmentation again dropping into row staging treating.Determine that candidate's paragraph of the symbol not comprising preset kind is can not candidate's paragraph of segmentation again.
The symbol of above-mentioned preset kind can include but not limited to: punctuation mark, space, underscore, oblique line or bracket.
Above-mentioned confidence computation unit 420 can specifically comprise: first participle subelement 421, first computation subunit 422 and the second computation subunit 423.
First participle subelement 421, carries out word segmentation processing for each paragraph sent staging treating unit 410.
First computation subunit 422, for according to formula D ij=α * S ij+ β * P ij, obtain the degree of confidence of each word after calculating first participle subelement 421 word segmentation processing, wherein, D ijbe the degree of confidence of the jth word obtained after i-th paragraph word segmentation processing, S ijbe total frequency that the jth word obtained after i-th paragraph word segmentation processing occurs in each paragraph, P ijbe the frequency that the jth word obtained after i-th paragraph word segmentation processing occurs in the page, α and β is default weighting coefficient.
Second computation subunit 423, for utilizing in each paragraph the degree of confidence of each word comprised, obtains the degree of confidence of each paragraph respectively.
Wherein, the second computation subunit 423 can be according to calculate the degree of confidence D of i-th paragraph i, N is the word number obtained after i-th paragraph word segmentation processing.
Further, this device can also comprise: the first filter element 440, for according to the website dictionary preset, staging treating unit 410 is sent in each paragraph of confidence computation unit 420 and occur that the paragraph that ratio that content in website dictionary accounts for bout length reaches default proportion threshold value filters out.
After utilizing said apparatus to determine maintitle, may be used for the sequence in page search, namely when setting up the index of the page, the word belonging to this maintitle is marked in the index, after searching for, by word match each in query to index in, marked word and belonged to the page corresponding to the index of maintitle and improve the sequencing weight in Search Results.
In addition, the maintitle that said apparatus is determined can also be used for extracting keyword, and now, this device can also comprise: key phrases extraction unit 450.
Key phrases extraction unit 450 can specifically comprise: the second participle subelement 451, part-of-speech tagging subelement 452, filtration subelement 453 and descriptor determination subelement 454.
Second participle subelement 451, carries out word segmentation processing for the maintitle determined theme paragraph determining unit 430.
Part-of-speech tagging subelement 452, sends to filtration subelement 453 after carrying out part-of-speech tagging to each word obtained after word segmentation processing.
Filter subelement 453, for performing at least one in following filter operation to each word obtained after word segmentation processing:
Is filtered out each word that the word that the inactive vocabulary preset comprises is obtained after word segmentation processing;
Filter out each word that the word of being expressed the meaning by dependent obtains after word segmentation processing;
If there is the word of hyponymy each other in each word obtained after word segmentation processing, then filter out each word upper word obtained after word segmentation processing; And,
Is filtered out each word that page type attribute word is obtained after word segmentation processing;
Descriptor determination subelement 454, for being defined as the keyword of the page by remaining word after filtration subelement 453 filtration treatment.
Wherein, if filter subelement 453 to determine that the page is default page type, then filter out each word that the type attribute word of the page can be obtained after word segmentation processing; The page type wherein preset comprises: video type, novel types, audio types, type of play or Forum Type.
After utilizing Fig. 4 shown device to extract keyword, may be used for marking the keyword in the page, when sorting to the page in Search Results, if query has hit the keyword of certain page, then can improve the sequencing weight of this page, make the sequence of Search Results more can meet the demand of user, improve search effect.
Described as can be seen from above, method and apparatus provided by the present invention possesses following advantage:
1) the present invention adopts and drops into the further cutting of row and the mode of foundation degree of confidence selection page subject matter paragraph to candidate segment, can determine page subject matter paragraph more exactly, namely reduce the deviation of page subject matter and the actual pages theme extracted.
2) when extracting page subject matter paragraph, can degree of confidence requirement being set flexibly, thus extracting maintitle arranged side by side, supplementing using different descriptions as page subject matter.
3) when the page subject matter paragraph of extraction being applied to the sequence of page search, the demand of user can be met more exactly, promote Consumer's Experience.
4), when the page subject matter paragraph of extraction being applied to the extraction of page subject matter word further, page subject matter word can be made to embody page subject matter more exactly.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment made, equivalent replacement, improvement etc., all should be included within the scope of protection of the invention.

Claims (16)

1. extract a method for page subject matter, it is characterized in that, the method comprises:
Candidate's paragraph of page subject matter is expressed in A, the acquisition page;
Can candidate's paragraph of segmentation again if B exists, then to the candidate segment of segmentation again dropping into row staging treating; Otherwise perform step C;
The degree of confidence of each paragraph obtained after C, respectively calculation procedure B;
D, degree of confidence met paragraph that default degree of confidence requires as page subject matter paragraph; Wherein,
Described step C specifically comprises:
C1, word segmentation processing is carried out to each paragraph obtained after described step B;
C2, according to formula D ij=α * S ij+ β * P ij, obtain the degree of confidence of each word after calculating word segmentation processing, wherein, D ijbe the degree of confidence of the jth word obtained after i-th paragraph word segmentation processing, S ijbe total frequency that the jth word obtained after i-th paragraph word segmentation processing occurs in described each paragraph, P ijbe the frequency that the jth word obtained after i-th paragraph word segmentation processing occurs in the described page, α and β is default weighting coefficient;
C3, utilize in each paragraph the degree of confidence of each word comprised, obtain the degree of confidence of described each paragraph respectively; Wherein, the degree of confidence D of i-th paragraph ican be: n is the word number obtained after i-th paragraph word segmentation processing.
2. method according to claim 1, is characterized in that, the described candidate's paragraph obtained in described steps A comprises following listed at least one:
The navigation paragraph that page title is capable, label is mypos that label is the page title paragraph of title, label is realtitle and label are the front chain of preanchor.
3. method according to claim 1, it is characterized in that, in described step B, if there is the candidate's paragraph comprising the symbol of preset kind, then determine that this candidate's paragraph can segmentation again, and with the symbol of described preset kind for separator is to the candidate segment of segmentation again dropping into row staging treating.
4. method according to claim 3, is characterized in that, the symbol of described preset kind comprises: punctuation mark, space, underscore, oblique line or bracket.
5. method according to claim 1, is characterized in that, before described step C or described step D, also comprises:
According to the website dictionary preset, will occur in described each paragraph that the paragraph that ratio that content in described website dictionary accounts for bout length reaches default proportion threshold value filters out.
6. method according to claim 1, is characterized in that, described in step D, degree of confidence requires to include: the degree of confidence of paragraph reaches default confidence threshold value; Or,
The degree of confidence of paragraph comes the top n in described each paragraph; Or,
The degree of confidence of paragraph reaches default confidence threshold value and comes the top n in described each paragraph; Wherein N is default positive integer.
7. the method according to the arbitrary claim of claim 1 to 6, is characterized in that, the method also comprises respectively to described page subject matter paragraph execution following steps:
E, word segmentation processing is carried out to described page subject matter paragraph;
F, part-of-speech tagging is carried out to each word obtained after word segmentation processing;
G, at least one in following filter operation is performed to each word obtained after word segmentation processing:
Is filtered out each word that the word that the inactive vocabulary preset comprises is obtained after word segmentation processing;
Filter out each word that the word of being expressed the meaning by dependent obtains after word segmentation processing;
If there is the word of hyponymy each other in each word obtained after word segmentation processing, then filter out each word upper word obtained after word segmentation processing; And,
Is filtered out each word that page type attribute word is obtained after word segmentation processing;
H, each word obtained after word segmentation processing performed step G after remaining word be defined as the descriptor of the described page.
8. method according to claim 7, is characterized in that, filters out and comprise the described each word obtained after word segmentation processing by page type attribute word:
If the described page is default page type, then filter out each word type attribute word of the described page obtained after word segmentation processing; Wherein said default page type comprises: video type, novel types, audio types, type of play or Forum Type.
9. extract a device for page subject matter, it is characterized in that, this device comprises: paragraph acquiring unit, staging treating unit, confidence computation unit and theme paragraph determining unit;
Described paragraph acquiring unit, for obtaining in the page candidate's paragraph of expressing page subject matter and being supplied to described staging treating unit;
Described staging treating unit, for can not the candidate segment of segmentation shaving one's head and give described confidence computation unit again, to can the candidate segment of segmentation again drop into row staging treating after send to described confidence computation unit;
Described confidence computation unit, for calculating the degree of confidence of each paragraph that described staging treating unit sends;
Described theme paragraph determining unit, for the result of calculation according to described confidence computation unit, meets the paragraph of default degree of confidence requirement as page subject matter paragraph using degree of confidence; Wherein,
Described confidence computation unit specifically comprises: first participle subelement, the first computation subunit and the second computation subunit;
Described first participle subelement, carries out word segmentation processing for each paragraph sent described staging treating unit;
Described first computation subunit, for according to formula D ij=α * S ij+ β * P ij, obtain the degree of confidence of each word after calculating described first participle subelement word segmentation processing, wherein, D ijbe the degree of confidence of the jth word obtained after i-th paragraph word segmentation processing, S ijbe total frequency that the jth word obtained after i-th paragraph word segmentation processing occurs in described each paragraph, P ijbe the frequency that the jth word obtained after i-th paragraph word segmentation processing occurs in the page, α and β is default weighting coefficient;
Described second computation subunit, for utilizing in each paragraph the degree of confidence of each word comprised, obtains the degree of confidence of described each paragraph respectively; Wherein, described second computation subunit according to calculate the degree of confidence D of i-th paragraph i, N is the word number obtained after i-th paragraph word segmentation processing.
10. device according to claim 9, is characterized in that, described candidate's paragraph that described paragraph acquiring unit obtains comprises following listed at least one:
The navigation paragraph that page title is capable, label is mypos that label is the page title paragraph of title, label is realtitle and label are the front chain of preanchor.
11. devices according to claim 9, it is characterized in that, if described staging treating unit determines to exist candidate's paragraph of the symbol comprising preset kind, then determine that this candidate's paragraph can segmentation again, and with the symbol of preset kind for separator is to the candidate segment of segmentation again dropping into row staging treating.
12. devices according to claim 11, is characterized in that, the symbol of described preset kind comprises: punctuation mark, space, underscore, oblique line or bracket.
13. devices according to claim 9, it is characterized in that, this device also comprises: the first filter element, for according to the website dictionary preset, in each paragraph that described staging treating unit is sent, is occurred that the paragraph that ratio that content in described website dictionary accounts for bout length reaches default proportion threshold value filters out.
14. devices according to claim 9, is characterized in that, described degree of confidence requires to include: the degree of confidence of paragraph reaches default confidence threshold value; Or,
The degree of confidence of paragraph comes the top n in described each paragraph; Or,
The degree of confidence of paragraph reaches default confidence threshold value and comes the top n in described each paragraph; Wherein N is default positive integer.
15. devices according to the arbitrary claim of claim 9 to 14, it is characterized in that, this device also comprises: key phrases extraction unit;
Described key phrases extraction unit specifically comprises: the second participle subelement, part-of-speech tagging subelement, filtration subelement and descriptor determination subelement;
Described second participle subelement, for carrying out word segmentation processing to described page subject matter paragraph;
Described part-of-speech tagging subelement, sends to described filtration subelement after carrying out part-of-speech tagging to each word obtained after word segmentation processing;
Described filtration subelement, for performing at least one in following filter operation to each word obtained after word segmentation processing:
Is filtered out each word that the word that the inactive vocabulary preset comprises is obtained after word segmentation processing;
Filter out each word that the word of being expressed the meaning by dependent obtains after word segmentation processing;
If there is the word of hyponymy each other in each word obtained after word segmentation processing, then filter out each word upper word obtained after word segmentation processing; And,
Is filtered out each word that page type attribute word is obtained after word segmentation processing;
Described descriptor determination subelement, for being defined as the descriptor of the described page by word remaining after described filtration subelement filtration treatment.
16. devices according to claim 15, is characterized in that, if described filtration subelement determines that the described page is default page type, then filter out each word obtained after word segmentation processing by the type attribute word of the described page; Wherein said default page type comprises: video type, novel types, audio types, type of play or Forum Type.
CN201110080852.2A 2011-03-31 2011-03-31 Method and apparatus for extracting page theme Active CN102737017B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110080852.2A CN102737017B (en) 2011-03-31 2011-03-31 Method and apparatus for extracting page theme

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110080852.2A CN102737017B (en) 2011-03-31 2011-03-31 Method and apparatus for extracting page theme

Publications (2)

Publication Number Publication Date
CN102737017A CN102737017A (en) 2012-10-17
CN102737017B true CN102737017B (en) 2015-03-11

Family

ID=46992542

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110080852.2A Active CN102737017B (en) 2011-03-31 2011-03-31 Method and apparatus for extracting page theme

Country Status (1)

Country Link
CN (1) CN102737017B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103383697B (en) * 2013-06-26 2017-02-15 百度在线网络技术(北京)有限公司 Method and equipment for determining object representation information of object header
CN104572927B (en) * 2014-12-29 2016-06-29 北京奇虎科技有限公司 A kind of method and apparatus extracting novel title from single-page
CN105320734B (en) * 2015-07-14 2019-02-22 中国互联网络信息中心 A kind of web page core content extracting method
CN107273391A (en) * 2016-04-08 2017-10-20 北京国双科技有限公司 Document recommends method and apparatus

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1758245A (en) * 2004-04-30 2006-04-12 微软公司 Method and system for classifying display pages using summaries
CN101093487A (en) * 2006-06-22 2007-12-26 上海新纳广告传媒有限公司 Method for extracting content of text based on HTML characteristics
CN101539923A (en) * 2008-03-18 2009-09-23 北京搜狗科技发展有限公司 Method and device for extracting text segment from file
CN101667194A (en) * 2009-09-29 2010-03-10 北京大学 Automatic abstracting method and system based on user comment text feature

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6473730B1 (en) * 1999-04-12 2002-10-29 The Trustees Of Columbia University In The City Of New York Method and system for topical segmentation, segment significance and segment function
CN100595753C (en) * 2007-05-29 2010-03-24 阿里巴巴集团控股有限公司 Text subject recommending method and device
CN101464898B (en) * 2009-01-12 2011-09-21 腾讯科技(深圳)有限公司 Method for extracting feature word of text

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1758245A (en) * 2004-04-30 2006-04-12 微软公司 Method and system for classifying display pages using summaries
CN101093487A (en) * 2006-06-22 2007-12-26 上海新纳广告传媒有限公司 Method for extracting content of text based on HTML characteristics
CN101539923A (en) * 2008-03-18 2009-09-23 北京搜狗科技发展有限公司 Method and device for extracting text segment from file
CN101667194A (en) * 2009-09-29 2010-03-10 北京大学 Automatic abstracting method and system based on user comment text feature

Also Published As

Publication number Publication date
CN102737017A (en) 2012-10-17

Similar Documents

Publication Publication Date Title
CN102360383B (en) Method for extracting text-oriented field term and term relationship
CN104598577B (en) A kind of extracting method of Web page text
EP2798540B1 (en) Extracting search-focused key n-grams and/or phrases for relevance rankings in searches
CN104881458B (en) A kind of mask method and device of Web page subject
CN109145216A (en) Network public-opinion monitoring method, device and storage medium
US7937338B2 (en) System and method for identifying document structure and associated metainformation
CN108052500B (en) Text key information extraction method and device based on semantic analysis
WO2019136841A1 (en) Method for extracting content tag of live stream rooms, storage medium, electronic device, and system
CN102270206A (en) Method and device for capturing valid web page contents
CN104063387A (en) Device and method abstracting keywords in text
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN101944109A (en) System and method for extracting picture abstract based on page partitioning
CN102135967A (en) Webpage keywords extracting method, device and system
CN103198057A (en) Method and device for adding label onto document automatically
CN103399901A (en) Keyword extraction method
CN101251855A (en) Equipment, system and method for cleaning internet web page
CN102144229A (en) System for extracting term from document containing text segment
CN106294314A (en) Topics Crawling method and device
CN103186556A (en) Method for obtaining and searching structural semantic knowledge and corresponding device
CN102929902A (en) Character splitting method and device based on Chinese retrieval
CN102737017B (en) Method and apparatus for extracting page theme
CN109634436A (en) Association method, device, equipment and the readable storage medium storing program for executing of input method
CN104360993A (en) Method for extracting needed content from text
CN107085568A (en) A kind of text similarity method of discrimination and device
CN103365879A (en) Method and device for obtaining page similarity

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant