CN102737017A - Method and apparatus for extracting page theme - Google Patents

Method and apparatus for extracting page theme Download PDF

Info

Publication number
CN102737017A
CN102737017A CN2011100808522A CN201110080852A CN102737017A CN 102737017 A CN102737017 A CN 102737017A CN 2011100808522 A CN2011100808522 A CN 2011100808522A CN 201110080852 A CN201110080852 A CN 201110080852A CN 102737017 A CN102737017 A CN 102737017A
Authority
CN
China
Prior art keywords
paragraph
word
confidence
page
segmentation processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011100808522A
Other languages
Chinese (zh)
Other versions
CN102737017B (en
Inventor
刘海浪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201110080852.2A priority Critical patent/CN102737017B/en
Publication of CN102737017A publication Critical patent/CN102737017A/en
Application granted granted Critical
Publication of CN102737017B publication Critical patent/CN102737017B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a method and an apparatus for extracting a page theme. The method comprises: A. acquiring candidate paragraphs which convey the page theme; B, if a candidate paragraph which can be re-paragraphed exists, paragraphing the candidate paragraph which can be re-paragraphed; otherwise performing step C; C. calculating the confidences of the paragraphs obtained after the step B respectively; and D. taking the paragraph with a confidence that meets the requirement of a preset confidence as the paragraph of the page theme. By using the method and the apparatus, the page theme can be determined more accurately, and the deviation between an extracted page theme and an actual page theme can be reduced.

Description

A kind of method and apparatus that extracts page theme
[technical field]
The present invention relates to field of computer technology, particularly a kind of method and apparatus that extracts page theme.
[background technology]
Be that the ordering in the page search, definite or other aspects of page descriptor all can relate to obtaining of page theme; For example; In the ordering of page search; Can with the degree of correlation between page theme and the query high more come front more, page descriptor is extracted from page theme usually, or the like.
At present, usually simply with the whole title paragraph (title) of the page as page theme.But possibly have a plurality of paragraphs among the title of the page, some paragraph is the incoherent content of page theme, can cause the skew of page theme.Be applied in the ordering of page search and may not meet consumers' demand exactly, be applied in the confirming of page descriptor and cause definite page descriptor can not embody page theme exactly.
[summary of the invention]
The invention provides a kind of method and apparatus that extracts page theme, so that the deviation of page theme that reduces to extract and actual pages theme.
Concrete technical scheme is following:
A kind of method of extracting page theme, this method comprises:
A, obtain in the page candidate's paragraph of expressing page theme;
If there is candidate's paragraph of segmentation again in B, then the candidate segment of segmentation is again dropped into capable staging treating; Otherwise execution in step C;
The degree of confidence of C, each paragraph of obtaining after the calculation procedure B respectively;
D, degree of confidence is satisfied preset requirement of confidence paragraph as page theme paragraph.
The said candidate's paragraph that obtains among the above-mentioned steps A comprises following listed at least one:
Label is the preceding chain that the page title paragraph of title, the navigation paragraph that page title is capable, label is mypos that label is realtitle and label are preanchor.
Particularly, among the step B,, then confirm this candidate's paragraph segmentation again, and be that separator is dropped into capable staging treating to the candidate segment of segmentation again with the symbol of said preset kind if there is candidate's paragraph of the symbol comprise preset kind.
Wherein, the symbol of said preset kind comprises: punctuation mark, space, underscore, oblique line or bracket.
In addition, said step C specifically comprises:
C1, each paragraph that said step B is obtained afterwards carry out word segmentation processing;
C2, according to formula D Ij=α * S Ij+ β * P Ij, calculate the degree of confidence that obtains each word after the word segmentation processing, wherein, D IjBe the degree of confidence of j word obtaining after i the paragraph word segmentation processing, S IjBe total frequency that j word obtaining after i the paragraph word segmentation processing occurs in said each paragraph, P IjBe j the frequency that word occurs in the said page that obtains after i the paragraph word segmentation processing, α and β are preset weighting coefficient;
C3, utilize the degree of confidence of each word that comprises in each paragraph, obtain the degree of confidence of said each paragraph respectively.In said step C3, the degree of confidence D of i paragraph iCan for:
Figure BDA0000053339760000021
N is the word number that obtains after i the paragraph word segmentation processing.
More excellent ground before said step C or said step D, also comprises:
According to preset website dictionary, filter out occurring the paragraph that ratio that content in the said website dictionary accounts for paragraph length reaches preset proportion threshold value in said each paragraph.
Particularly, requirement of confidence comprises described in the step D: the degree of confidence of paragraph reaches preset confidence threshold value; Perhaps,
The degree of confidence of paragraph comes the top n in said each paragraph; Perhaps,
The degree of confidence of paragraph reaches preset confidence threshold value and comes the top n in said each paragraph; Wherein N is preset positive integer.
Further, this method also comprises respectively to said page theme paragraph execution following steps:
E, said page subject matter segments is dropped into capable word segmentation processing;
F, each word that obtains after the word segmentation processing is carried out part-of-speech tagging;
G, at least one in the following filter operation carried out in each word that obtains after the word segmentation processing:
Filter out each word that the preset word that inactive vocabulary comprised is obtained after word segmentation processing;
Filter out each word that the word that independent is expressed the meaning obtains after word segmentation processing;
If there is the word of hyponymy each other in each word that obtains after the word segmentation processing, filter out each word that then upper word is obtained after word segmentation processing; And,
Filter out each word that page type attribute speech is obtained after word segmentation processing;
The descriptor of the said page confirmed as in H, the word that each the word execution in step G that obtains after the word segmentation processing is remaining afterwards.
Wherein, filter out said each word that page type attribute speech is obtained after word segmentation processing and comprise:
If the said page is the page type of presetting, filter out each word that then the type attribute speech of the said page is obtained after word segmentation processing; Wherein said preset page type comprises: video type, novel types, audio types, type of play or forum's type.
A kind of device that extracts page theme, this device comprises: paragraph acquiring unit, staging treating unit, confidence computation unit and theme paragraph are confirmed the unit;
Said paragraph acquiring unit is used for obtaining the page and expresses candidate's paragraph of page theme and offer said staging treating unit;
Said staging treating unit, the candidate segment that is used for segmentation is again shaved one's head and is given said confidence computation unit, sends to said confidence computation unit after the candidate segment of segmentation is again dropped into capable staging treating;
Said confidence computation unit is used to calculate the degree of confidence of each paragraph that sends said staging treating unit;
Said theme paragraph is confirmed the unit, is used for the result of calculation according to said confidence computation unit, and the paragraph that degree of confidence is satisfied preset requirement of confidence is as page theme paragraph.
Wherein, said candidate's paragraph of obtaining of said paragraph acquiring unit comprises following listed at least one:
Label is the preceding chain that the page title paragraph of title, the navigation paragraph that page title is capable, label is mypos that label is realtitle and label are preanchor.
Particularly,, then confirm this candidate's paragraph segmentation again, and be that separator is dropped into capable staging treating to the candidate segment of segmentation again with the symbol of preset kind if confirm there is candidate's paragraph of the symbol that comprises preset kind in said staging treating unit.
Wherein, the symbol of said preset kind comprises: punctuation mark, space, underscore, oblique line or bracket.
Particularly, said confidence computation unit can comprise: first participle subelement, first computation subunit and second computation subunit;
Said first participle subelement is used for each paragraph that said staging treating unit sends is carried out word segmentation processing;
Said first computation subunit is used for according to formula D Ij=α * S Ij+ β * P Ij, calculate the degree of confidence that obtains each word after the said first participle subelement word segmentation processing, wherein, D IjBe the degree of confidence of j word obtaining after i the paragraph word segmentation processing, S IjBe total frequency that j word obtaining after i the paragraph word segmentation processing occurs in said each paragraph, P IjBe j the frequency that word occurs in the page that obtains after i the paragraph word segmentation processing, α and β are preset weighting coefficient;
Said second computation subunit is used for utilizing the degree of confidence of each word that each paragraph comprises, and obtains the degree of confidence of said each paragraph respectively.
Wherein, said second computation subunit according to
Figure BDA0000053339760000041
Calculate the degree of confidence D of i paragraph i, N is the word number that obtains after i the paragraph word segmentation processing.
More excellent ground, this device also comprises: first filter element, be used for according to preset website dictionary, the paragraph that the ratio that content accounts for paragraph length in the said website dictionary of appearance in each paragraph that said staging treating unit is sent reaches preset proportion threshold value filters out.
Said requirement of confidence comprises: the degree of confidence of paragraph reaches preset confidence threshold value; Perhaps,
The degree of confidence of paragraph comes the top n in said each paragraph; Perhaps,
The degree of confidence of paragraph reaches preset confidence threshold value and comes the top n in said each paragraph; Wherein N is preset positive integer.
Further, this device also comprises: the descriptor extraction unit;
Said descriptor extraction unit specifically comprises: the second participle subelement, part-of-speech tagging subelement, filtration subelement and descriptor are confirmed subelement;
The said second participle subelement is used for said page subject matter segments is dropped into capable word segmentation processing;
Said part-of-speech tagging subelement is used for each word that obtains after the word segmentation processing is carried out sending to said filtration subelement behind the part-of-speech tagging;
Said filtration subelement is used for each word that obtains after the word segmentation processing carried out at least one of following filter operation:
Filter out each word that the preset word that inactive vocabulary comprised is obtained after word segmentation processing;
Filter out each word that the word that independent is expressed the meaning obtains after word segmentation processing;
If there is the word of hyponymy each other in each word that obtains after the word segmentation processing, filter out each word that then upper word is obtained after word segmentation processing; And,
Filter out each word that page type attribute speech is obtained after word segmentation processing;
Said descriptor is confirmed subelement, is used for word remaining after the said filtration subelement filtration treatment is confirmed as the descriptor of the said page.
Wherein, be the page type of presetting if said filtration subelement is confirmed the said page, filter out each word that then the type attribute speech of the said page is obtained after word segmentation processing; Wherein said preset page type comprises: video type, novel types, audio types, type of play or forum's type.
Can find out that by above technical scheme the present invention if there is candidate's paragraph of segmentation again, then drops into capable staging treating to the candidate segment of segmentation again obtaining the candidate segment backwardness; According to the degree of confidence of each paragraph that further calculates, the paragraph of selecting to satisfy requirement of confidence is as page theme paragraph again.This candidate segment is dropped into the further cutting of row and selected the mode of page theme paragraph according to degree of confidence, can confirm page theme paragraph more exactly, the page theme that promptly reduces to extract and the deviation of actual pages theme.When the page theme paragraph that extracts is applied in the page searching order, can meet consumers' demand more exactly; When being applied in the confirming of page descriptor, can make page descriptor embody page theme more exactly.
[description of drawings]
The method flow diagram of the extraction page theme that Fig. 1 provides for the embodiment of the invention one;
The method flow diagram of each paragraph degree of confidence of calculating that Fig. 2 provides for the embodiment of the invention two;
The method flow diagram of the extraction page descriptor that Fig. 3 provides for the embodiment of the invention three;
The structure drawing of device of the extraction page theme that Fig. 4 provides for the embodiment of the invention four.
[embodiment]
In order to make the object of the invention, technical scheme and advantage clearer, describe the present invention below in conjunction with accompanying drawing and specific embodiment.
Embodiment one,
The method flow diagram of the extraction page theme that Fig. 1 provides for the embodiment of the invention one, as shown in Figure 1, this method can may further comprise the steps:
Step 101: obtain candidate's paragraph of expressing page theme in the page.
In this step, candidate's paragraph of expressing page theme in the page is meant that those possibly embody the paragraph of page theme, specifically can include but not limited in the following paragraph at least one:
Label is the page title paragraph of title, the navigation paragraph that page title is capable, label is mypos that label is realtitle, the preceding chain that label is preanchor.
For example, for Http:// www.22zw.cn/XH/91H53969KX/The page, therefrom obtaining above-mentioned is that four paragraphs are respectively:
Label is the page title paragraph of title, and content is: book giant silkworm potato 22 Chinese networks are watched in the broken firmament of up-to-date chapters and sections bucket, the broken firmament of bucket soon.
Label is that the page title of realtitle is capable, and content is: the broken firmament of bucket.
Label is the navigation paragraph of mypos, does not have corresponding content in this page.
Label is the preceding chain of preanchor, and content is: the up-to-date chapters and sections in the broken firmament of bucket.
Step 102: in the above-mentioned candidate's paragraph that obtains again the candidate segment of segmentation drop into capable staging treating.
This step is an optional step, if respectively be selected all segmentations again of paragraph, does not then carry out this step.
Whether again during segmentation, can judge in each candidate's paragraph whether comprise the symbol of preset kind at definite candidate's paragraph,, then think the segmentation again of candidate's paragraph if comprise, otherwise, think the segmentation again of candidate's paragraph.Correspondingly, when candidate segment was dropped into capable segmentation, the partition strategy of employing can be: the symbol with preset kind is that separator carries out segmentation.
Wherein the symbol of preset kind can include but not limited to: punctuation mark, space, underscore, oblique line, bracket.
For example, be the page title paragraph of title for label, be after separator carries out staging treating to it, can obtain following four paragraphs with the symbol of preset kind:
Title paragraph 1: the up-to-date chapters and sections in the broken firmament of bucket
Title paragraph 2: book is watched in the broken firmament of bucket soon
Title paragraph 3: giant silkworm potato
Title paragraph 4:22 Chinese network
The all segmentations again of other candidate's paragraphs.
Step 103: the degree of confidence of each paragraph that obtains after the calculation procedure 102.
If certain candidate segment is dropped into and gone staging treating, then calculate the degree of confidence that this candidate segment is dropped into each paragraph that obtains after the capable staging treating; If certain candidate's paragraph does not carry out staging treating, then calculate the degree of confidence of this candidate's paragraph.
The confidence calculations method of each paragraph will specifically describe in embodiment two.
Before execution in step 103 or step 104, can also comprise a filtration step, be about to the paragraph relevant and filter out with website.Specifically can realize, comprise various site names in this website dictionary, reach preset proportion threshold value, then this paragraph filtered out if occur the ratio of shared this paragraph length of content in the website dictionary in certain paragraph through preset website dictionary.For example, the content of title paragraph 4 " 22 Chinese network " is exactly a site name, and this site name can be arranged in the website dictionary in advance, before execution in step 104, can this title paragraph 4 be filtered out.
Step 104: the paragraph that degree of confidence is satisfied preset requirement of confidence is as page theme paragraph (maintitle).
Wherein, Preset requirement of confidence can for: the degree of confidence of paragraph reaches preset confidence threshold value, and perhaps, the degree of confidence of paragraph comes the top n in each paragraph; Perhaps, the degree of confidence of paragraph reaches preset confidence threshold value and degree of confidence and comes the top n in each paragraph.N is preset positive integer.
Suppose that the degree of confidence of each paragraph is following through after the filtration treatment:
Title paragraph 1: the up-to-date chapters and sections degree of confidence in the broken firmament of bucket is 0.9
Title paragraph 2: the broken firmament of bucket watches that soon the book degree of confidence is 0.7
Title paragraph 3: giant silkworm potato degree of confidence is 0.3
Label is that the page title of realtitle is capable: the broken firmament of bucket degree of confidence is 1.0
Label is the preceding chain of preanchor: the up-to-date chapters and sections degree of confidence in the broken firmament of bucket is 0.9
Wherein, can select the highest paragraph of degree of confidence, promptly select " the broken firmament of bucket " as maintitle as maintitle.Also can consider different the replenishing of describing as page theme; Can select a plurality of paragraphs promptly to extract maintitle arranged side by side as maintitle; For example select degree of confidence more than 0.9 and come preceding 2 paragraph, promptly select " the broken firmament of bucket " and " bucket breaks the up-to-date chapters and sections in the firmament " as maintitle.
Embodiment two,
The method flow diagram of each paragraph degree of confidence of calculating that Fig. 2 provides for the embodiment of the invention two, as shown in Figure 2, this method can may further comprise the steps:
Step 201: each paragraph is carried out word segmentation processing.
More excellent ground can also filter each word that obtains after the word segmentation processing based on preset inactive vocabulary.Wherein, comprise the very high word of frequency of occurrence in the common webpage in the vocabulary of stopping using, can include but not limited to: adverbial word, function word, modal particle, auxiliary word, pronoun etc., these words ability of expressing the meaning usually is very low.
Step 202: according to formula D Ij=α * S Ij+ β * P Ij, calculate the degree of confidence that obtains each word after the word segmentation processing.
Wherein, D IjBe the degree of confidence of j word obtaining after i the paragraph word segmentation processing, S IjBe j the frequency that word occurs in all paragraphs that obtains after i the paragraph word segmentation processing, P IjBe j the frequency that word occurs in the page that obtains after i the paragraph word segmentation processing, α and β are preset weighting coefficient.
If α is non-vanishing, need utilizes the paragraph that obtains after the step 102 described in the embodiment one to verify each other and obtain S Ij, promptly need add up the frequency that occurs in all paragraphs that each word obtains after step 102, frequency of occurrence is high more, and the degree of confidence of word is high more.
If β is non-vanishing, need utilize the express the meaning ability of word in the page to obtain P Ij, promptly need add up the frequency of occurrence of word in the page, frequency of occurrence is high more, and the degree of confidence of word is high more.
Step 203: utilize the degree of confidence of each word that comprises in each paragraph, obtain the degree of confidence of each paragraph respectively.
After can the degree of confidence of each word that comprises in the paragraph being got average, obtain the degree of confidence of paragraph, i.e. the degree of confidence D of i paragraph iCan for:
Figure BDA0000053339760000091
N is the word number that obtains after i the paragraph word segmentation processing.
Be example with the instance among the embodiment one still, title paragraph 1 carries out obtaining word after the word segmentation processing: bucket breaks the firmament, up-to-date, chapters and sections.When filtering,, therefore, still obtain word after the filtration: " the broken firmament of bucket ", " up-to-date ", " chapters and sections " owing to do not comprise the word in the vocabulary of stopping using based on preset inactive vocabulary.According to the degree of confidence of each word of the calculating of formula shown in the step 202, wherein frequency of occurrence is very high in each paragraph owing to " the broken firmament of bucket ", and the frequency that in webpage, occurs is also very high, and therefore " bucket breaks the firmament " has higher degree of confidence.Get the degree of confidence that average obtains title paragraph 1 after the degree of confidence addition with each word then.
So far, flow process shown in the embodiment two finishes.
Utilizing after flow process is determined maintitle shown in the embodiment one; Can be used for the ordering of page search; Promptly when setting up the index of this page, the word that belongs to this maintitle is marked in index, after searching for; With each word match among the query to index in, marked the index corresponding page that word belongs to maintitle and improved the sequencing weight in Search Results.
In addition, the maintitle that flow process is determined shown in the embodiment also can be used to extract page descriptor (keyword), describes through three pairs of these processes of embodiment below.
Embodiment three,
The method flow diagram of the extraction page descriptor that Fig. 3 provides for the embodiment of the invention three, as shown in Figure 3, this method can may further comprise the steps:
Step 301: the maintitle that embodiment one is determined carries out word segmentation processing.
If the maintitle of the page of determining has only one, then only carry out the flow process shown in this embodiment three to this maintitle, a plurality of if the page maintitle that determines has, then carry out the flow process shown in this embodiment three respectively to each maintitle.
Step 302: each word to obtaining after the word segmentation processing carries out part-of-speech tagging.
Step 303: each word that obtains after to word segmentation processing based on preset inactive vocabulary filters.
This step is exactly to filter out each word that the word that the vocabulary of stopping using is comprised is obtained after word segmentation processing.Wherein, comprise the very high word of frequency of occurrence in the webpage in the vocabulary of stopping using, can include but not limited to: adverbial word, function word, modal particle, auxiliary word, pronoun.
Step 304: each word to obtaining after the word segmentation processing filters out the word that independent is expressed the meaning.
At this; The word that independent is expressed the meaning can be confirmed based on the probable value that word context and this word constitute a word; If constituting the probability of a word, the word that certain word is adjacent with this word surpasses the preset threshold value of expressing the meaning; Confirm that then this word is the word that independent is expressed the meaning, its word that should be adjacent constitutes a word.
If when in step 301, carrying out word segmentation processing, employing be the participle mode of wholegrain degree, then can execution in step 304; If when in step 301, carrying out word segmentation processing, considered the word that independent is wherein expressed the meaning, the word of directly independent being expressed the meaning constitutes a word, and what promptly obtain after the word segmentation processing all is the word of independently expressing the meaning, then do not carry out this step.
Step 305: each word that obtains after the word segmentation processing is carried out the analysis of hyponymy, if there is the word of hyponymy each other, with filtering out upper word.
When carrying out the analysis of hyponymy,, in this hyponymy vocabulary, comprised the context relation between the various words based on predefined hyponymy vocabulary.
If there is the word of hyponymy each other in each word that obtains after the word segmentation processing; Because upper word does not have the ability of expressing the meaning of the next word strong; And the next word has covered upper word meaning usually, therefore, can upper word be filtered out.
Give an example, if after a query carries out word segmentation processing, not only comprise " Guangdong " but also comprise " Guangzhou ", wherein " Guangdong " is " Guangzhou " upper word, therefore, can upper word " Guangdong " be filtered out, and keeps word " Guangzhou ".
Step 306: each word that obtains after the word segmentation processing is filtered out page type attribute speech.
If the page has preset page type, then the type attribute speech with this page filters out, if the page does not have preset page type, does not then carry out the filtration of this step.Wherein, preset page type can include but not limited to: video type, novel types, audio types, type of play, forum's type.
For example; If the page is a video type, promptly the content that provides of this page is video, comprises in the word that maintitle is carried out obtaining after the word segmentation processing " video "; Can't there be meaning in this word " video " to the theme of this page, therefore this word is filtered out.If the page is the blog page or leaf, then will there be meaning in word " video " to the theme of this page, just can this word not filtered out.
Need to prove that above-mentioned steps 303, step 304, step 305 and step 306 can be selected an execution, also can carry out with the form of combination in any.If the form with combination is carried out, then can be with sequencing execution arbitrarily.
Step 307: the keyword of this page confirmed as in the word that will carry out obtaining after the above-mentioned filtration to each word that obtains after the word segmentation processing.
Give one example to the flow process shown in the embodiment three below, suppose that maintitle is: the video of having seen the real estate Three Musketeers today.
If carry out the word segmentation processing of wholegrain degree to this maintitle, obtain following word: " today ", " seeing ", " ", " real estate ", " three ", " swordsman ", " ", " video ", " Three Musketeers ".Carry out obtaining behind the part-of-speech tagging: " today " be noun, " seeing " be verb, " " be that auxiliary word, " real estate " they are that noun, " three " they are that number, " " they are that auxiliary word, " video " they are that noun, " Three Musketeers " they are nouns.
Filter based on inactive vocabulary, filter out " ", " ", " seeing ", " today ".
Filter out word " three " and " swordsman " that independent is expressed the meaning.
If the page under this maintitle is a content pages, do not belong to preset page type, then this maintitle is not carried out the filtration of page type attribute speech.
The keyword that finally obtains this page is: " real estate ", " video ", " Three Musketeers ".
After utilizing instance three said modes to extract keyword; Can the keyword in the page be marked; When the page in the Search Results is sorted,, then can improve the sequencing weight of this page if query has hit the keyword of certain page; Make the ordering of Search Results can satisfy user's demand more, improve the search effect.
More than be the detailed description that method provided by the present invention is carried out, be described in detail through four pairs of devices provided by the present invention of embodiment below.
Embodiment four,
The structure drawing of device of the extraction page theme that Fig. 4 provides for the embodiment of the invention four, as shown in Figure 4, this device can comprise: paragraph acquiring unit 400, staging treating unit 410, confidence computation unit 420 and theme paragraph are confirmed unit 430.
Paragraph acquiring unit 400 is used for obtaining the page and expresses candidate's paragraph of page theme and offer staging treating unit 410.
Staging treating unit 410, the candidate segment that is used for segmentation is again shaved one's head and is given confidence computation unit 420, sends to confidence computation unit 420 after the candidate segment of segmentation is again dropped into capable staging treating.
Confidence computation unit 420 is used to calculate the degree of confidence of each paragraph that sends staging treating unit 410.
The theme paragraph is confirmed unit 430, is used for the result of calculation according to confidence computation unit 420, and the paragraph that degree of confidence is satisfied preset requirement of confidence is as maintitle.
Wherein, preset requirement of confidence can comprise: the degree of confidence of paragraph reaches preset confidence threshold value; Perhaps, the degree of confidence of paragraph comes the top n in each paragraph; Perhaps, the degree of confidence of paragraph reaches preset confidence threshold value and comes the top n in each paragraph; Wherein N is preset positive integer.
Candidate's paragraph that paragraph acquiring unit 400 obtains can comprise following listed at least one:
Label is the preceding chain that the page title paragraph of title, the navigation paragraph that page title is capable, label is mypos that label is realtitle and label are preanchor.
Particularly,, then confirm this candidate's paragraph segmentation again, and be that separator is dropped into capable staging treating to the candidate segment of segmentation again with the symbol of preset kind if candidate's paragraphs of the symbol that comprises preset kind are confirmed to exist in staging treating unit 410.Definite candidate's paragraph that does not comprise the symbol of preset kind is again candidate's paragraph of segmentation.
The symbol of above-mentioned preset kind can include but not limited to: punctuation mark, space, underscore, oblique line or bracket.
Above-mentioned confidence computation unit 420 can specifically comprise: first participle subelement 421, first computation subunit 422 and second computation subunit 423.
First participle subelement 421, each paragraph that is used for staging treating unit 410 is sent carries out word segmentation processing.
First computation subunit 422 is used for according to formula D Ij=α * S Ij+ β * P Ij, calculate the degree of confidence that obtains each word after first participle subelement 421 word segmentation processing, wherein, D IjBe the degree of confidence of j word obtaining after i the paragraph word segmentation processing, S IjBe total frequency that j word obtaining after i the paragraph word segmentation processing occurs in each paragraph, P IjBe j the frequency that word occurs in the page that obtains after i the paragraph word segmentation processing, α and β are preset weighting coefficient.
Second computation subunit 423 is used for utilizing the degree of confidence of each word that each paragraph comprises, and obtains the degree of confidence of each paragraph respectively.
Wherein, second computation subunit 423 can according to
Figure BDA0000053339760000131
Calculate the degree of confidence D of i paragraph i, N is the word number that obtains after i the paragraph word segmentation processing.
Further; This device can also comprise: first filter element 440; Be used for according to preset website dictionary, staging treating unit 410 is sent to occur the paragraph that ratio that content in the website dictionary accounts for paragraph length reaches preset proportion threshold value in each paragraph of confidence computation unit 420 and filter out.
After utilizing said apparatus to determine maintitle; Can be used for the ordering of page search; Promptly when setting up the index of the page, the word that belongs to this maintitle is marked in index, after searching for; With each word match among the query to index in, marked the index corresponding page that word belongs to maintitle and improved the sequencing weight in Search Results.
In addition, the maintitle that said apparatus is determined can also be used to extract keyword, and at this moment, this device can also comprise: descriptor extraction unit 450.
Descriptor extraction unit 450 can specifically comprise: the second participle subelement 451, part-of-speech tagging subelement 452, filter subelement 453 and descriptor is confirmed subelement 454.
The second participle subelement 451 is used for the theme paragraph is confirmed that the maintitle that determines unit 430 carries out word segmentation processing.
Part-of-speech tagging subelement 452 is used for each word that obtains after the word segmentation processing is carried out sending to filtration subelement 453 behind the part-of-speech tagging.
Filter subelement 453, be used for each word that obtains after the word segmentation processing carried out at least one of following filter operation:
Filter out each word that the preset word that inactive vocabulary comprised is obtained after word segmentation processing;
Filter out each word that the word that independent is expressed the meaning obtains after word segmentation processing;
If there is the word of hyponymy each other in each word that obtains after the word segmentation processing, filter out each word that then upper word is obtained after word segmentation processing; And,
Filter out each word that page type attribute speech is obtained after word segmentation processing;
Descriptor is confirmed subelement 454, is used for remaining word after filtration subelement 453 filtration treatment is confirmed as the keyword of the page.
Wherein, confirm that the page is the page type of presetting, filter out each word that then can the type attribute speech of the page be obtained after word segmentation processing if filter subelement 453; Wherein preset page type comprises: video type, novel types, audio types, type of play or forum's type.
After utilizing device shown in Figure 4 to extract keyword; Can be used for the keyword of the page is marked; When the page in the Search Results is sorted,, then can improve the sequencing weight of this page if query has hit the keyword of certain page; Make the ordering of Search Results can satisfy user's demand more, improve the search effect.
Can find out that by above description method and apparatus provided by the present invention possesses following advantage:
1) the present invention adopts candidate segment is dropped into the further cutting of row and selected the mode of page theme paragraph according to degree of confidence, can confirm page theme paragraph more exactly, the page theme that promptly reduces to extract and the deviation of actual pages theme.
2) when extracting page theme paragraph, requirement of confidence can be set flexibly, thereby extract maintitle arranged side by side, with different descriptions replenishing as page theme.
When 3) page theme paragraph that extracts being applied to the ordering of page search, can satisfy user's demand more exactly, promote user experience.
When 4) page subject matter segments of extracting being fallen to further being applied to the extraction of page descriptor, can make page descriptor embody page theme more exactly.
The above is merely preferred embodiment of the present invention, and is in order to restriction the present invention, not all within spirit of the present invention and principle, any modification of being made, is equal to replacement, improvement etc., all should be included within the scope that the present invention protects.

Claims (20)

1. a method of extracting page theme is characterized in that, this method comprises:
A, obtain in the page candidate's paragraph of expressing page theme;
If there is candidate's paragraph of segmentation again in B, then the candidate segment of segmentation is again dropped into capable staging treating; Otherwise execution in step C;
The degree of confidence of C, each paragraph of obtaining after the calculation procedure B respectively;
D, degree of confidence is satisfied preset requirement of confidence paragraph as page theme paragraph.
2. method according to claim 1 is characterized in that, the said candidate's paragraph that obtains in the said steps A comprises following listed at least one:
Label is the preceding chain that the page title paragraph of title, the navigation paragraph that page title is capable, label is mypos that label is realtitle and label are preanchor.
3. method according to claim 1; It is characterized in that, among the said step B, if there is candidate's paragraph of the symbol that comprises preset kind; Then confirm this candidate's paragraph segmentation again, and be that separator is dropped into capable staging treating to the candidate segment of segmentation again with the symbol of said preset kind.
4. method according to claim 3 is characterized in that, the symbol of said preset kind comprises: punctuation mark, space, underscore, oblique line or bracket.
5. method according to claim 1 is characterized in that, said step C specifically comprises:
C1, each paragraph that said step B is obtained afterwards carry out word segmentation processing;
C2, according to formula D Ij=α * S Ij+ β * P Ij, calculate the degree of confidence that obtains each word after the word segmentation processing, wherein, D IjBe the degree of confidence of j word obtaining after i the paragraph word segmentation processing, S IjBe total frequency that j word obtaining after i the paragraph word segmentation processing occurs in said each paragraph, P IjBe j the frequency that word occurs in the said page that obtains after i the paragraph word segmentation processing, α and β are preset weighting coefficient;
C3, utilize the degree of confidence of each word that comprises in each paragraph, obtain the degree of confidence of said each paragraph respectively.
6. method according to claim 5 is characterized in that, in said step C3, and the degree of confidence D of i paragraph iCan for:
Figure FDA0000053339750000021
N is the word number that obtains after i the paragraph word segmentation processing.
7. method according to claim 1 is characterized in that, before said step C or said step D, also comprises:
According to preset website dictionary, filter out occurring the paragraph that ratio that content in the said website dictionary accounts for paragraph length reaches preset proportion threshold value in said each paragraph.
8. method according to claim 1 is characterized in that requirement of confidence comprises described in the step D: the degree of confidence of paragraph reaches preset confidence threshold value; Perhaps,
The degree of confidence of paragraph comes the top n in said each paragraph; Perhaps,
The degree of confidence of paragraph reaches preset confidence threshold value and comes the top n in said each paragraph; Wherein N is preset positive integer.
9. according to the described method of the arbitrary claim of claim 1 to 8, it is characterized in that this method also comprises respectively carries out following steps to said page theme paragraph:
E, said page subject matter segments is dropped into capable word segmentation processing;
F, each word that obtains after the word segmentation processing is carried out part-of-speech tagging;
G, at least one in the following filter operation carried out in each word that obtains after the word segmentation processing:
Filter out each word that the preset word that inactive vocabulary comprised is obtained after word segmentation processing;
Filter out each word that the word that independent is expressed the meaning obtains after word segmentation processing;
If there is the word of hyponymy each other in each word that obtains after the word segmentation processing, filter out each word that then upper word is obtained after word segmentation processing; And,
Filter out each word that page type attribute speech is obtained after word segmentation processing;
The descriptor of the said page confirmed as in H, the word that each the word execution in step G that obtains after the word segmentation processing is remaining afterwards.
10. method according to claim 9 is characterized in that, filters out said each word that page type attribute speech is obtained after word segmentation processing to comprise:
If the said page is the page type of presetting, filter out each word that then the type attribute speech of the said page is obtained after word segmentation processing; Wherein said preset page type comprises: video type, novel types, audio types, type of play or forum's type.
11. a device that extracts page theme is characterized in that, this device comprises: paragraph acquiring unit, staging treating unit, confidence computation unit and theme paragraph are confirmed the unit;
Said paragraph acquiring unit is used for obtaining the page and expresses candidate's paragraph of page theme and offer said staging treating unit;
Said staging treating unit, the candidate segment that is used for segmentation is again shaved one's head and is given said confidence computation unit, sends to said confidence computation unit after the candidate segment of segmentation is again dropped into capable staging treating;
Said confidence computation unit is used to calculate the degree of confidence of each paragraph that sends said staging treating unit;
Said theme paragraph is confirmed the unit, is used for the result of calculation according to said confidence computation unit, and the paragraph that degree of confidence is satisfied preset requirement of confidence is as page theme paragraph.
12. device according to claim 11 is characterized in that, said candidate's paragraph that said paragraph acquiring unit obtains comprises following listed at least one:
Label is the preceding chain that the page title paragraph of title, the navigation paragraph that page title is capable, label is mypos that label is realtitle and label are preanchor.
13. device according to claim 11; It is characterized in that; If candidate's paragraph of the symbol that comprises preset kind is confirmed to exist in said staging treating unit; Then confirm this candidate's paragraph segmentation again, and be that separator is dropped into capable staging treating to the candidate segment of segmentation again with the symbol of preset kind.
14. device according to claim 13 is characterized in that, the symbol of said preset kind comprises: punctuation mark, space, underscore, oblique line or bracket.
15. device according to claim 11 is characterized in that, said confidence computation unit specifically comprises: first participle subelement, first computation subunit and second computation subunit;
Said first participle subelement is used for each paragraph that said staging treating unit sends is carried out word segmentation processing;
Said first computation subunit is used for according to formula D Ij=α * S Ij+ β * P Ij, calculate the degree of confidence that obtains each word after the said first participle subelement word segmentation processing, wherein, D IjBe the degree of confidence of j word obtaining after i the paragraph word segmentation processing, S IjBe total frequency that j word obtaining after i the paragraph word segmentation processing occurs in said each paragraph, P IjBe j the frequency that word occurs in the page that obtains after i the paragraph word segmentation processing, α and β are preset weighting coefficient;
Said second computation subunit is used for utilizing the degree of confidence of each word that each paragraph comprises, and obtains the degree of confidence of said each paragraph respectively.
16. device according to claim 15 is characterized in that, said second computation subunit according to
Figure FDA0000053339750000041
Calculate the degree of confidence D of i paragraph i, N is the word number that obtains after i the paragraph word segmentation processing.
17. device according to claim 11; It is characterized in that; This device also comprises: first filter element; Be used for according to preset website dictionary, occur the paragraph that ratio that content in the said website dictionary accounts for paragraph length reaches preset proportion threshold value in each paragraph that said staging treating unit is sent and filter out.
18. device according to claim 11 is characterized in that, said requirement of confidence comprises: the degree of confidence of paragraph reaches preset confidence threshold value; Perhaps,
The degree of confidence of paragraph comes the top n in said each paragraph; Perhaps,
The degree of confidence of paragraph reaches preset confidence threshold value and comes the top n in said each paragraph; Wherein N is preset positive integer.
19., it is characterized in that this device also comprises: the descriptor extraction unit according to the described device of the arbitrary claim of claim 11 to 18;
Said descriptor extraction unit specifically comprises: the second participle subelement, part-of-speech tagging subelement, filtration subelement and descriptor are confirmed subelement;
The said second participle subelement is used for said page subject matter segments is dropped into capable word segmentation processing;
Said part-of-speech tagging subelement is used for each word that obtains after the word segmentation processing is carried out sending to said filtration subelement behind the part-of-speech tagging;
Said filtration subelement is used for each word that obtains after the word segmentation processing carried out at least one of following filter operation:
Filter out each word that the preset word that inactive vocabulary comprised is obtained after word segmentation processing;
Filter out each word that the word that independent is expressed the meaning obtains after word segmentation processing;
If there is the word of hyponymy each other in each word that obtains after the word segmentation processing, filter out each word that then upper word is obtained after word segmentation processing; And,
Filter out each word that page type attribute speech is obtained after word segmentation processing;
Said descriptor is confirmed subelement, is used for word remaining after the said filtration subelement filtration treatment is confirmed as the descriptor of the said page.
20. device according to claim 19 is characterized in that, is the page type of presetting if said filtration subelement is confirmed the said page, filters out each word that then the type attribute speech of the said page is obtained after word segmentation processing; Wherein said preset page type comprises: video type, novel types, audio types, type of play or forum's type.
CN201110080852.2A 2011-03-31 2011-03-31 Method and apparatus for extracting page theme Active CN102737017B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110080852.2A CN102737017B (en) 2011-03-31 2011-03-31 Method and apparatus for extracting page theme

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110080852.2A CN102737017B (en) 2011-03-31 2011-03-31 Method and apparatus for extracting page theme

Publications (2)

Publication Number Publication Date
CN102737017A true CN102737017A (en) 2012-10-17
CN102737017B CN102737017B (en) 2015-03-11

Family

ID=46992542

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110080852.2A Active CN102737017B (en) 2011-03-31 2011-03-31 Method and apparatus for extracting page theme

Country Status (1)

Country Link
CN (1) CN102737017B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103383697A (en) * 2013-06-26 2013-11-06 百度在线网络技术(北京)有限公司 Method and equipment for determining object representation information of object header
CN104572927A (en) * 2014-12-29 2015-04-29 北京奇虎科技有限公司 Method and device extracting novel name from single page
WO2017008448A1 (en) * 2015-07-14 2017-01-19 中国互联网络信息中心 Method for extracting core content of web page
CN107273391A (en) * 2016-04-08 2017-10-20 北京国双科技有限公司 Document recommends method and apparatus

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000062194A2 (en) * 1999-04-12 2000-10-19 The Trustees Of Columbia University In The City Of New York Method and system for topical segmentation, segment significance and segment function
CN1758245A (en) * 2004-04-30 2006-04-12 微软公司 Method and system for classifying display pages using summaries
CN101093487A (en) * 2006-06-22 2007-12-26 上海新纳广告传媒有限公司 Method for extracting content of text based on HTML characteristics
CN101315623A (en) * 2007-05-29 2008-12-03 阿里巴巴集团控股有限公司 Text subject recommending method and device
CN101464898A (en) * 2009-01-12 2009-06-24 腾讯科技(深圳)有限公司 Method for extracting feature word of text
CN101539923A (en) * 2008-03-18 2009-09-23 北京搜狗科技发展有限公司 Method and device for extracting text segment from file
CN101667194A (en) * 2009-09-29 2010-03-10 北京大学 Automatic abstracting method and system based on user comment text feature

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000062194A2 (en) * 1999-04-12 2000-10-19 The Trustees Of Columbia University In The City Of New York Method and system for topical segmentation, segment significance and segment function
CN1758245A (en) * 2004-04-30 2006-04-12 微软公司 Method and system for classifying display pages using summaries
CN101093487A (en) * 2006-06-22 2007-12-26 上海新纳广告传媒有限公司 Method for extracting content of text based on HTML characteristics
CN101315623A (en) * 2007-05-29 2008-12-03 阿里巴巴集团控股有限公司 Text subject recommending method and device
CN101539923A (en) * 2008-03-18 2009-09-23 北京搜狗科技发展有限公司 Method and device for extracting text segment from file
CN101464898A (en) * 2009-01-12 2009-06-24 腾讯科技(深圳)有限公司 Method for extracting feature word of text
CN101667194A (en) * 2009-09-29 2010-03-10 北京大学 Automatic abstracting method and system based on user comment text feature

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103383697A (en) * 2013-06-26 2013-11-06 百度在线网络技术(北京)有限公司 Method and equipment for determining object representation information of object header
CN103383697B (en) * 2013-06-26 2017-02-15 百度在线网络技术(北京)有限公司 Method and equipment for determining object representation information of object header
CN104572927A (en) * 2014-12-29 2015-04-29 北京奇虎科技有限公司 Method and device extracting novel name from single page
CN104572927B (en) * 2014-12-29 2016-06-29 北京奇虎科技有限公司 A kind of method and apparatus extracting novel title from single-page
WO2017008448A1 (en) * 2015-07-14 2017-01-19 中国互联网络信息中心 Method for extracting core content of web page
CN107273391A (en) * 2016-04-08 2017-10-20 北京国双科技有限公司 Document recommends method and apparatus

Also Published As

Publication number Publication date
CN102737017B (en) 2015-03-11

Similar Documents

Publication Publication Date Title
CN101944109B (en) System and method for extracting picture abstract based on page partitioning
CN108829658B (en) Method and device for discovering new words
CN106528532B (en) Text error correction method, device and terminal
CN104598577B (en) A kind of extracting method of Web page text
CN108052500B (en) Text key information extraction method and device based on semantic analysis
US20140258283A1 (en) Computing device and file searching method using the computing device
US20090276378A1 (en) System and Method for Identifying Document Structure and Associated Metainformation and Facilitating Appropriate Processing
CN105426539A (en) Dictionary-based lucene Chinese word segmentation method
CN103198057A (en) Method and device for adding label onto document automatically
CN106156372B (en) A kind of classification method and device of internet site
CN103942340A (en) Microblog user interest recognizing method based on text mining
CN101694658A (en) Method for constructing webpage crawler based on repeated removal of news
CN106294314A (en) Topics Crawling method and device
CN103226576A (en) Comment spam filtering method based on semantic similarity
CN102207961B (en) Automatic web page classification method and device
CN102262625A (en) Method and device for extracting keywords of page
CN105786793A (en) Method and device for analyzing semanteme of spoken language text information
CN104102658B (en) Content of text method for digging and device
CN101673266A (en) Method for searching audio and video contents
CN105550359B (en) Webpage sorting method and device based on vertical search and server
CN103123624A (en) Method of confirming head word, device of confirming head word, searching method and device
CN106202294A (en) The related news computational methods merged based on key word and topic model and device
CN108021667A (en) A kind of file classification method and device
CN102737017B (en) Method and apparatus for extracting page theme
CN103631963A (en) Keyword optimization processing method and device based on big data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant