CN105808561A - Method and device for extracting abstract from webpage - Google Patents
Method and device for extracting abstract from webpage Download PDFInfo
- Publication number
- CN105808561A CN105808561A CN201410843345.3A CN201410843345A CN105808561A CN 105808561 A CN105808561 A CN 105808561A CN 201410843345 A CN201410843345 A CN 201410843345A CN 105808561 A CN105808561 A CN 105808561A
- Authority
- CN
- China
- Prior art keywords
- word
- webpage
- text
- coupling
- sliding window
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method and a device for extracting an abstract from a webpage. The method comprises the following steps: acquiring one or more matching words which are needed to extract the abstract; finding out all matching words from a webpage text of the webpage; based on positions of the matching words, extracting a plurality of text segments from the webpage text of the webpage by employing a sliding window mechanism; and selecting one segment from the plurality of text segments as the abstract of the webpage. According to the technical scheme provided by the invention, the abstract is extracted based on the matching words, the abstract is associated with inquiry and the most satisfactory abstract is found out by employing the sliding window mechanism; the problem that the abstract is unassociated with the inquiry in the prior art is overcome; when a user searches, a more intuitive and exact searched abstract with strong relevance is provided for the user so that the user can find out a needed webpage quickly and effectively and the search requirement is satisfied.
Description
Technical field
The present invention relates to data mining technology field, be specifically related to a kind of method and apparatus extracting summary from webpage.
Background technology
Fast development along with Internet technology, network has become as people and obtains important channel and the hands section of information, magnanimity information in network both brought more convenient, also many problems are brought, in order to find useful information, people often take substantial amounts of time removal search, browse and search, therefore the technical scheme extracting summary in recent years from webpage increasingly causes the concern of people, wherein, the summary of each webpage is proposed out to be shown in the window of Search Results by the program, make user without opening whether webpage just can meet search need by this webpage very clear.
In prior art, major part is extracted the scheme of summary from webpage and is based on static mode generation search summary, and namely search summary is independent of inquiry, regular according to certain, extracts some words at pretreatment stage in advance from web page contents.Such as, 160 bytes of the beginning of intercepting page text (corresponding 80 Chinese characters), or, spell by the first of each paragraph sentence.The summary so formed leaves in Query Subsystem, mates with query term once the document of related web page is selected, just the summary prestored is showed user.Obviously, search engine is the most easily by this mode, it is not necessary to do other process work.But the maximum shortcoming of this mode is: the search word that the summary of offer inputs with user is unrelated, is unsatisfactory for the search need of user.
Summary of the invention
In view of the above problems, it is proposed that the present invention is to provide a kind of a kind of method and apparatus extracting summary from webpage overcoming the problems referred to above or solving the problems referred to above at least in part.
According to one aspect of the present invention, it is provided that a kind of method extracting summary from webpage, the method includes:
Obtain and extract one or more coupling words that summary is required;
Each coupling word is found out from the web page text of webpage;
Based on the position of each coupling word, sliding window mechanism is adopted to extract multiple text chunk from the web page text of this webpage;
One section of summary as this webpage is selected from the plurality of text chunk.
Alternatively, the described position based on each coupling word, adopt sliding window mechanism to extract multiple text chunk from the web page text of this condensed webpages and include:
Sliding window starts word for word to slide backward from the original position of web page text, and the original position often sliding into a coupling word intercepts one section of text of sliding window size;
Wherein, the size of described sliding window is adjustable in preset range, and the text every time intercepted mates word with one to start, and terminates with a coupling word.
Alternatively, the method farther includes:
Before intercepting, further the position of sliding window is modified every time, clear and coherent to ensure the text semantic intercepted.
Alternatively, described before intercepting every time, it is modified including to the position of sliding window further:
Start to move forward from the coupling word of window original position, if running into a coupling word, stop mobile, judge whether there is beginning of the sentence between the coupling word from window original position and a upper coupling word, if existing, window original position being modified to and starts with this beginning of the sentence, if there is no then window original position not being modified;If able to the section of moving forward to head, then window original position is modified to and first opens the beginning with this section;
And/or, the coupling word terminating to put from window starts to be moved rearwards by, if running into next coupling word or the greatest length beyond window, stops mobile.
Alternatively, described from the plurality of text chunk select one section of summary as this webpage include:
Calculate the respective comprehensive weights of the plurality of text chunk respectively, choose the one section of the highest text of the comprehensive weights summary as this webpage.
Alternatively, the rank of coupling word includes at least two rank in as follows, and rank is followed successively by from high to low:
Initial search word;
PHRASE in participle;
Two continuative participle TERM connect together;
Two discontinuous participle TREM connect together;
Continuous three Chinese characters connect together;
TERM in participle;
Two Chinese characters, it does not have stop words;
Two Chinese characters, have stop words;
Individual Chinese character;
Single character.
Alternatively, the described one or more coupling words extracting summary required that obtain include:
Derive according to initial search word and extract one or more coupling words that summary is required, specifically include following in one or more:
Initial search word itself;
Initial search word is carried out each word that word segmentation processing obtains;
Initial search word and initial search word is carried out synonym and/or the error correction term of each word that word segmentation processing obtains;
Two words of continuous print carried out by initial search word in each word that word segmentation processing obtains merge the word obtained;
Three words of continuous print carried out by initial search word in each word that word segmentation processing obtains merge the word obtained;
Two words that initial search word carries out one word in interval in each word that word segmentation processing obtains merge the word obtained;
Each word in initial search word, and continuous two and continuous three words therein;
The word obtained after initial search word is normalized.
Alternatively, described from the web page text of webpage, find out each coupling word, based on the position of each coupling word, adopt sliding window mechanism to extract multiple text chunk from the web page text of this webpage, and described select one section of summary as this webpage to include from the plurality of text chunk:
From the condensed webpages text that webpage is corresponding, find out each coupling word, based on the position of each coupling word, adopt sliding window mechanism to extract multiple text chunk and described one section of summary as this webpage of selection from the plurality of text chunk from this condensed webpages text.
Alternatively, the method farther includes: obtain the step of the condensed webpages of webpage, particularly as follows:
JS and CSS code in webpage is removed, obtains condensed webpages;
Or, for a webpage, the html tag in this webpage is classified;Remove and belong to the html tag specifying classification, retain the html tag being not belonging to specify classification;For being not belonging to specify the html tag of classification, analyze its attribute, retain the one or more attributes specified;It is put into after content of text in the condensed webpages of correspondence by the Content Transformation of the html tag of reservation.
Alternatively, the html tag of described appointment classification includes one or more in following label:
Script label, noscript label, iframe label, and single label, comment tag and comprise the label of display:none.
According to another aspect of the present invention, it is provided that a kind of device extracting summary from webpage, this device includes:
Acquiring unit, is suitable to obtain and extracts one or more coupling words that summary is required;
Abstract extraction unit, is suitable to find out each coupling word from the web page text of webpage;Be suitable to the position based on each coupling word, adopt sliding window mechanism to extract multiple text chunk from the web page text of this webpage;And be suitable to one section of summary as this webpage of selection from the plurality of text chunk.
Alternatively, described abstract extraction unit, it is suitable for use with sliding window and starts word for word to slide backward from the original position of web page text, the original position often sliding into a coupling word intercepts one section of text of sliding window size;
Wherein, the size of described sliding window is adjustable in preset range, and the text every time intercepted mates word with one to start, and terminates with a coupling word.
Alternatively, described abstract extraction unit, be suitable to, before intercepting, further the position of sliding window is modified every time, clear and coherent to ensure the text semantic intercepted.
Alternatively, described abstract extraction unit, be suitable to the coupling word from window original position start to move forward, if running into a coupling word, stop mobile, judge whether there is beginning of the sentence between the coupling word from window original position and a upper coupling word, if existing, window original position being modified to and starts with this beginning of the sentence, if there is no then window original position not being modified;If able to the section of moving forward to head, then window original position is modified to and first opens the beginning with this section;And/or, the coupling word terminating to put from window starts to be moved rearwards by, if running into next coupling word or the greatest length beyond window, stops mobile.
Alternatively, described abstract extraction unit, be suitable to calculate the respective comprehensive weights of the plurality of text chunk respectively, choose the one section of the highest text of the comprehensive weights summary as this webpage.
Alternatively, the rank of coupling word includes at least two rank in as follows, and rank is followed successively by from high to low:
Initial search word;
PHRASE in participle;
Two continuative participle TERM connect together;
Two discontinuous participle TREM connect together;
Continuous three Chinese characters connect together;
TERM in participle;
Two Chinese characters, it does not have stop words;
Two Chinese characters, have stop words;
Individual Chinese character;
Single character.
Alternatively, described acquiring unit, be suitable to derive according to initial search word extract one or more coupling words that summary is required, specifically include following in one or more:
Initial search word itself;
Initial search word is carried out each word that word segmentation processing obtains;
Initial search word and initial search word is carried out synonym and/or the error correction term of each word that word segmentation processing obtains;
Two words of continuous print carried out by initial search word in each word that word segmentation processing obtains merge the word obtained;
Three words of continuous print carried out by initial search word in each word that word segmentation processing obtains merge the word obtained;
Two words that initial search word carries out one word in interval in each word that word segmentation processing obtains merge the word obtained;
Each word in initial search word, and continuous two and continuous three words therein;
The word obtained after initial search word is normalized.
Alternatively, described abstract extraction unit, be suitable to find out each coupling word from condensed webpages text corresponding to webpage, position based on each coupling word, sliding window mechanism is adopted to extract multiple text chunk and described one section of summary as this webpage of selection from the plurality of text chunk from this condensed webpages text.
Alternatively, this device farther includes: placement unit and simplify unit;
Described placement unit, is suitable to obtain each original web page;
Described simplify unit, be suitable to each original web page is simplified, obtain each condensed webpages;Particularly as follows: removed by JS and CSS code in webpage, obtain condensed webpages;Or, for a webpage, the html tag in this webpage is classified;Remove and belong to the html tag specifying classification, retain the html tag being not belonging to specify classification;For being not belonging to specify the html tag of classification, analyze its attribute, retain the one or more attributes specified;It is put into after content of text in the condensed webpages of correspondence by the Content Transformation of the html tag of reservation.
Alternatively, the html tag of described appointment classification includes one or more in following label:
Script label, noscript label, iframe label, and single label, comment tag and comprise the label of display:none.
From web page text, extract the process of summary according to coupling word and sliding window mechanism from the foregoing, technical scheme provided by the invention describes and realize the device of this process, based on this process and device, provide the user search summary service.This programme extracts summary based on coupling word, summary and inquiry are associated, and adopt sliding window mechanism to find the most satisfactory summary, overcome the problem made a summary in prior art independent of inquiry, when user scans for, provide the user search summary more directly perceived, accurate, relatedness is strong, make user can fast and effeciently find the webpage of needs, meet search need.
Described above is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention, and can be practiced according to the content of description, and in order to above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.
Accompanying drawing explanation
By reading hereafter detailed description of the preferred embodiment, various other advantage and benefit those of ordinary skill in the art be will be clear from understanding.Accompanying drawing is only for illustrating the purpose of preferred implementation, and is not considered as limitation of the present invention.And in whole accompanying drawing, it is denoted by the same reference numerals identical parts.In the accompanying drawings:
Fig. 1 illustrates the flow chart of a kind of according to an embodiment of the invention method extracting summary from webpage;
Fig. 2 illustrates the flow chart of the method collecting condensed webpages under a kind of according to an embodiment of the invention line;
Fig. 3 illustrates the schematic diagram of a kind of according to an embodiment of the invention device extracting summary from webpage;
Fig. 4 illustrates the schematic diagram of the device collecting condensed webpages under a kind of according to an embodiment of the invention line;
Fig. 5 A illustrates the schematic diagram of Search Results according to an embodiment of the invention;
Fig. 5 B illustrates the schematic diagram of Search Results in accordance with another embodiment of the present invention.
Detailed description of the invention
It is more fully described the exemplary embodiment of the disclosure below with reference to accompanying drawings.Although accompanying drawing showing the exemplary embodiment of the disclosure, it being understood, however, that may be realized in various forms the disclosure and should do not limited by embodiments set forth here.On the contrary, it is provided that these embodiments are able to be best understood from the disclosure, and complete for the scope of the present disclosure can be conveyed to those skilled in the art.
Fig. 1 illustrates the flow chart of a kind of according to an embodiment of the invention method extracting summary from webpage.As it is shown in figure 1, the method includes:
Step S110, obtains and extracts one or more coupling words that summary is required.
Step S120, finds out each coupling word from the web page text of webpage.
Step S130, based on the position of each coupling word, adopts sliding window mechanism to extract multiple text chunk from the web page text of this webpage.
Step S140, selects one section of summary as this webpage from multiple text chunks.
Visible, the method shown in Fig. 1 describes the process extracting summary according to coupling word and sliding window mechanism from web page text, based on this process, provides the user search summary service.This programme extracts summary based on coupling word, summary and inquiry are associated, and adopt sliding window mechanism to find the most satisfactory summary, overcome the problem made a summary in prior art independent of inquiry, when user scans for, provide the user search summary more directly perceived, accurate, relatedness is strong, make user can fast and effeciently find the webpage of needs, meet search need.
In the service that web search is provided, it is possible to employing method described in Fig. 1 extracts the summary of the webpage of search hit, and web page title and summary is represented as the content in result of page searching.
In one embodiment of the invention, acquired for the step S110 of method shown in Fig. 1 one or more coupling words extracting summary required include: search word;Search word is carried out the word that word segmentation processing obtains;Search word and to synonym and/or the error correction term of stating search word and carry out each word that word segmentation processing obtains;Two words of continuous print carried out by search word in each word that word segmentation processing obtains merge the word obtained;Three words of continuous print carried out by search word in each word that word segmentation processing obtains merge the word obtained;Two words that search word carries out one word in interval in each word that word segmentation processing obtains merge the word obtained;Each word in search word, and continuous two and continuous three words therein;The word obtained after search word is normalized.
In one embodiment of the invention, the step S130 of method shown in Fig. 1 is based on the position of each coupling word, adopting sliding window mechanism to extract multistage text from the web page text of this webpage to include: sliding window starts word for word to slide backward from the original position of web page text, the original position often sliding into a coupling word intercepts one section of text of sliding window size.Wherein, the size of sliding window is adjustable in preset range, and the text every time intercepted mates word with one to start, and terminates with a coupling word.
Clear and coherent in order to ensure the text semantic intercepted, before intercepting every time, further the position of sliding window is modified, specifically include procedure below: start to move forward from the coupling word of window original position, if running into a coupling word, stop mobile, judge whether there is beginning of the sentence between the coupling word from window original position and a upper coupling word, if existing, window original position being modified to and starts with this beginning of the sentence, if there is no then window original position not being modified;If able to the section of moving forward to head, then window original position is modified to and first opens the beginning with this section;And/or, the coupling word terminating to put from window starts to be moved rearwards by, if running into next coupling word or the greatest length beyond window, stops mobile.
In one embodiment of the invention, the step S130 of method shown in Fig. 1 selects one section of summary as this webpage to include from multistage text: calculates the respective comprehensive weights of multistage text respectively, chooses the one section of the highest text of the comprehensive weights summary as this webpage.Wherein, what comprise in following weights in comprehensive weights is one or more:
First weights, corresponding length of window is more long, and the first weights are more high.Second weights, corresponding window position in whole web page text is more forward, and the second weights are more high.3rd weights, therein first coupling word from window original position more close to, the 3rd weights are more high.4th weights, the special symbol comprised is more few, and the 4th weights are more high.5th weights, the coupling word number comprised is more many, and the 5th weights are more high, and same coupling word occurs repeatedly only calculating one here.6th weights, the number of times that the coupling word comprised is quoted by other windows is more few, and the 6th weights are more high.7th weights, the coupling rank of first comprised coupling word is more high, and matching length is more long, and the 7th weights are more high.8th weights, the coupling rank mating highest-ranking coupling word comprised is more high, and the 8th weights are more high.9th weights, the semantic coverage of the coupling word comprised is more high, and the 9th weights are more high.Tenth weights, the link number comprised is more few, and the tenth weights are more high.11st weights, the paragraph number comprised is more few, and the 11st weights are more high.12nd weights, if comprising the overall semantic of initial search word, the 12nd weights are more high.13rd weights, the junk information comprised and mess code are more many, and the 13rd weights are more low.14th weights, if the original position section of the being positioned at head of window, then the 14th weights are more high.15th weights, the content comprised is more high with the similarity of title, and the 15th weights are more low.
Wherein, the coupling rank of coupling word includes at least two rank in as follows, and rank is followed successively by from high to low:
Initial search word;PHRASE in participle;Two continuative participle TERM connect together;Two discontinuous participle TREM connect together;Continuous three Chinese characters connect together;TERM in participle;Two Chinese characters, it does not have stop words;Two Chinese characters, have stop words;Individual Chinese character;Single character.
In one embodiment of the invention, after selection one section from multiple text chunks is as the summary of this webpage, learn intuitively for the ease of user and make a summary and the associating of search word, method shown in Fig. 1 farther includes: for the summary selected as condensed webpages, and according to certain strategy, coupling word therein is carried out general rise of prices of the stocks and other securities process.In one embodiment of the invention, this general rise of prices of the stocks and other securities processes one or more process included in as follows:
Multiple for continuous print coupling words are carried out general rise of prices of the stocks and other securities as a string;Find out the highest-ranking coupling word of coupling, reduce the general rise of prices of the stocks and other securities density before and after it, join highest-ranking coupling word highlighting this;When overall general rise of prices of the stocks and other securities density is more than predetermined threshold value, start to remove its general rise of prices of the stocks and other securities from the coupling word that weights are low, until general rise of prices of the stocks and other securities density is lower than predetermined threshold value.Wherein, mate the weights of word by mate the length of word, coupling rank, mate from front and back word distance and be whether other coupling strings these four factors of substring in one or more determine.
In one embodiment of the invention, method shown in Fig. 1 finds out each coupling word from the web page text of webpage, position based on each coupling word, sliding window mechanism is adopted to extract multiple text chunk from the web page text of this webpage, and described from the plurality of text chunk select one section of summary as this webpage include: from the condensed webpages text that webpage is corresponding, find out each coupling word, position based on each coupling word, sliding window mechanism is adopted to extract multiple text chunk from this condensed webpages text, and described from the plurality of text chunk select one section of summary as this webpage.
Fig. 2 illustrates the flow chart of the method collecting condensed webpages under a kind of according to an embodiment of the invention line.As in figure 2 it is shown, this method describe before the summary extracting each webpage, preparing the process of standby condensed webpages in specified database, the method includes:
Step S210, obtains each original web page.
This step utilizes web crawlers to crawl each original web page.
Step S220, simplifies each original web page, obtains each condensed webpages.
In one embodiment, this step includes: removed by JS and CSS code in each original web page.
Step S230, is saved in URL correspondence corresponding for each condensed webpages in specified database.
In one embodiment of the invention, each original web page is simplified by the step S320 of method shown in Fig. 3, it is thus achieved that the process of each condensed webpages comprises the steps:
Step S221, for an original web page, classifies to the html tag in this webpage.
Step S222, removes and belongs to the html tag specifying classification, retains the html tag being not belonging to specify classification.
In this step, the html tag specifying classification is the label unrelated with providing search summary service, including one or more in following label: Script label, noscript label, iframe label, and single label, comment tag and comprise the label of display:none.Wherein, Script label is used for defining client script, such as JavaScript, updates for image manipulation, form validation and dynamic content;Noscript label is used for the replacement defined when script is not performed, for can recognize that Script label but cannot supporting the browser of script therein;Iframe element to be used for creating the inline frame (at once inner frame) comprising another one document;Single label such as br label is used for inserting a simple newline;Comment tag for inserting annotation in source code;The label comprising display:none is for stashing certain element on webpage.
Step S223, for being not belonging to specify the html tag of classification, analyzes its attribute, retains the one or more attributes specified.
In this step, the attribute of html tag is specified in starting label, for representing character and the characteristic of current html tag.In certain embodiments, for condensed webpages content of trying one's best, most attribute to be skipped, only retain id attribute, easy-to-look-up problem.
Step S224, is put into after content of text in the condensed webpages of correspondence by the Content Transformation of the html tag of reservation.
In a specific embodiment, each original web page is simplified by above-mentioned steps S221 to step S224, the process obtaining each condensed webpages can adopt the mode of state machine to realize, and adopts the mode of state machine to analyze the html tag in an original web page one by one, and this state machine includes following state:
Original state: start byte-by-byte being analyzed from current location, time initial, current location is the original position of web page contents.
Label starts state: be the discovery that when html tag starts, it is judged that whether this html tag belongs to the html tag specifying classification, if it is skips this html tag, transfers to for original state, if not then transferring attribute status to.
Attribute status: analyze the attribute of this html tag, retains the one or more attributes specified, and enters text status.
Text status: be put into after content of text in the condensed webpages of correspondence by the Content Transformation that retains of this html tag, enters label done state.
Label done state: be the discovery that when html tag terminates, enters label done state, then turns to original state.
By that analogy, until the whole html tags in original web page are analyzed complete, it is thus achieved that final condensed webpages, corresponding URL correspondence is saved in specified database.
Illustrate to extract from webpage the flow process of summary with a specific embodiment, initial search word is " Chinese people ", the coupling word got according to this initial search word includes: " Chinese people ", " China ", " people ", " in ", " state ", " people " and " people ", Fig. 5 A illustrates the schematic diagram of Search Results according to an embodiment of the invention, obtain the condensed webpages of this webpage, as shown in Figure 5A, its web page text is: but without " Chinese people " this word during the pre-Qin days, " China " and " people " was all single use at that time, the meaning of a word is also different from today." China " original idea refers to capital, such as " the elegant people's labor of the Book of Songs " cloud: " this China of favour, with peaceful four directions." Mao Shichuan just instruction is interpreted as: " China, capital is also." later, " China " also amplifies the center etc. referring to central plain area, the world.And " people " and " people ", also it is distinct two concepts in the pre-Qin days, " origin of Chinese character " is said: " people, born nature your person is also." " people, crowd sprouts also.”
Above-mentioned web page text and coupling word are carried out multimode matching, find out each coupling word, position based on each coupling word, sliding window mechanism is adopted to extract multistage text from above-mentioned text, what take sliding window is sized to 70-90 Chinese character, result is as follows, wherein with the word of underscore for mating word:
1、Chinese people" this word, at that time "China" and "The people" be all single use, the meaning of a word is also different from today.“China" original idea refers to capital, such as " the elegant people's labor of the Book of Songs " cloud: " favour thisChina, with peaceful four directions." Mao Shichuan just instruction is interpreted as: "China
2、China" and "The people" be all single use, the meaning of a word is also different from today.“China" original idea refers to capital, such as " the elegant people's labor of the Book of Songs " cloud: " favour thisChina, with peaceful four directions." Mao Shichuan just instruction is interpreted as: "China, capital is also." later, "China
3、The people" be all single use, the meaning of a word is also different from today.“China" original idea refers to capital, such as " the elegant people's labor of the Book of Songs " cloud: " favour thisChina, with peaceful four directions." Mao Shichuan just instruction is interpreted as: "China, capital is also." later, "China" also amplification refer toIn
4、China" original idea refers to capital, such as " the elegant people's labor of the Book of Songs " cloud: " favour thisChina, with peaceful four directions." Mao Shichuan just instruction is interpreted as: "China, capital is also." later, "China" also amplification refer toInOriginal area, sky purgationInThe heart etc..And "People" with "The people
5、China, with peaceful four directions." Mao Shichuan just instruction is interpreted as: "China, capital is also." later, "China" also amplification refer toInOriginal area, sky purgationInThe heart etc..And "People" with "The people", also it is distinct two concepts in the pre-Qin days, " origin of Chinese character " is said: "People
6、China, capital is also." later, "China" also amplification refer toInOriginal area, sky purgationInThe heart etc..And "People" with "The people", also it is distinct two concepts in the pre-Qin days, " origin of Chinese character " is said: "People, born nature your person is also.”“The people
Calculating the comprehensive weights of above-mentioned multistage text, the comprehensive weights obtaining the 1st section of text are the highest, semantic clear and coherent in order to ensure, the 1st section of text is modified, window original position is modified to and first opens the beginning with the section of former web page text.Result is:
During the pre-Qin days but without "Chinese people" this word, at that time "China" and "The people" be all single use, the meaning of a word is also different from today.“China" original idea refers to capital, such as " the elegant people's labor of the Book of Songs " cloud: " favour thisChina, with peaceful four directions." Mao Shichuan just instruction is interpreted as: "China
The coupling word of epimere text being carried out general rise of prices of the stocks and other securities, improves visual experience to reduce general rise of prices of the stocks and other securities density, the coupling word " China " that the weights of the section of will be located in tail are low removes three general rises of prices of the stocks and other securities, obtains final search summary and is:
But without " Chinese people " this word during the pre-Qin days, " China " and " people " was all single use at that time, and the meaning of a word is also different from today." China " original idea refers to capital, such as " the elegant people's labor of the Book of Songs " cloud: " this China of favour, with peaceful four directions." Mao Shichuan just instruction is interpreted as: " China
Fig. 5 B illustrates the schematic diagram of Search Results in accordance with another embodiment of the present invention, and the summary that the present embodiment extracts shows in figure 5b.
Fig. 3 illustrates the schematic diagram of a kind of according to an embodiment of the invention device extracting summary from webpage.As it is shown on figure 3, the device 300 that should extract summary from webpage includes:
Acquiring unit 310, is suitable to obtain and extracts one or more coupling words that summary is required;
Abstract extraction unit 320, is suitable to find out each coupling word from the web page text of webpage;Be suitable to the position based on each coupling word, adopt sliding window mechanism to extract multiple text chunk from the web page text of this webpage;And be suitable to one section of summary as this webpage of selection from multiple text chunks.
Visible, the device shown in Fig. 3 is cooperated by each unit, completes the process extracting summary according to coupling word and sliding window mechanism from web page text, based on this process, provides the user search summary service.This programme extracts summary based on coupling word, summary and inquiry are associated, and adopt sliding window mechanism to find the most satisfactory summary, overcome the problem made a summary in prior art independent of inquiry, when user scans for, provide the user search summary more directly perceived, accurate, relatedness is strong, make user can fast and effeciently find the webpage of needs, meet search need.
In one embodiment of the invention, acquiring unit 310 is suitable to derive according to initial search word extract one or more coupling words that summary is required, specifically include following in one or more: initial search word itself;Initial search word is carried out each word that word segmentation processing obtains;Initial search word and initial search word is carried out synonym and/or the error correction term of each word that word segmentation processing obtains;Two words of continuous print carried out by initial search word in each word that word segmentation processing obtains merge the word obtained;Three words of continuous print carried out by initial search word in each word that word segmentation processing obtains merge the word obtained;Two words that initial search word carries out one word in interval in each word that word segmentation processing obtains merge the word obtained;Each word in initial search word, and continuous two and continuous three words therein;The word obtained after initial search word is normalized.
In one embodiment of the invention, abstract extraction unit 320 is suitable to utilize sliding window to start word for word to slide backward from the original position of web page text, and the original position often sliding into a coupling word intercepts one section of text of sliding window size.Wherein, the size of sliding window is adjustable in preset range, and the text every time intercepted mates word with one to start, and terminates with a coupling word.
Clear and coherent in order to ensure the text semantic intercepted, abstract extraction unit 320 is suitable to before intercepting every time, further the position of sliding window is modified, specifically include procedure below: start to move forward from the coupling word of window original position, if running into a coupling word, stop mobile, judge whether there is beginning of the sentence between the coupling word from window original position and a upper coupling word, if existing, window original position being modified to and starts with this beginning of the sentence, if there is no then window original position not being modified;If able to the section of moving forward to head, then window original position is modified to and first opens the beginning with this section;And/or, the coupling word terminating to put from window starts to be moved rearwards by, if running into next coupling word or the greatest length beyond window, stops mobile.
In one embodiment of the invention, abstract extraction unit 320 is suitable to the calculating respective comprehensive weights of multistage text respectively, chooses the one section of the highest text of the comprehensive weights summary as this webpage.Wherein, the weights comprised in comprehensive weights and the coupling rank of coupling word describe in detail hereinbefore, do not repeat them here.
In one embodiment of the invention, select from multistage text one section as the summary of webpage after, learn intuitively for the ease of user and make a summary and the associating of search word, abstract extraction unit 320 is further adapted for for the summary selected as webpage, and according to certain strategy, coupling word therein is carried out general rise of prices of the stocks and other securities process.Concrete general rise of prices of the stocks and other securities processing procedure describes in detail hereinbefore, does not repeat them here.
In a specific embodiment, should extract from webpage in Fig. 5 A hereinbefore of the process performed by device 300 of summary and the embodiment at Fig. 5 B place based on search word and describe in detail, not repeat them here.
Fig. 4 illustrates the schematic diagram of the device collecting condensed webpages under a kind of according to an embodiment of the invention line.As shown in Figure 4, collect the device 400 of condensed webpages under this line and complete before the summary extracting each webpage, specified database prepares the process of standby condensed webpages, including: placement unit 410 with simplify unit 420.
Placement unit 410, is suitable to obtain each original web page.
This unit is suitable to utilize web crawlers to crawl each original web page.
Simplify unit 420, be suitable to each original web page is simplified, it is thus achieved that each condensed webpages;And be suitable to be saved in specified database URL correspondence corresponding for each condensed webpages.
This unit is suitable to remove JS and CSS code in each original web page.
In one embodiment of the invention, simplify unit 420 to be suitable to, for an original web page, the html tag in this webpage be classified;Remove and belong to the html tag specifying classification, retain the html tag being not belonging to specify classification;For being not belonging to specify the html tag of classification, analyze its attribute, retain the one or more attributes specified;It is put into after content of text in the condensed webpages of correspondence by the Content Transformation of the html tag of reservation.Wherein, the html tag specifying classification is the label unrelated with providing search summary service, including one or more in following label: Script label, noscript label, iframe label, and single label, comment tag and comprise the label of display:none.
In one embodiment of the invention, simplifying unit 420 and be suitable for use with the mode of state machine and analyze the html tag in an original web page one by one, the state of state machine describes in detail in the preceding article, does not repeat them here.
Accordingly, abstract extraction unit 320 shown in Fig. 3 is suitable to find out each coupling word from condensed webpages text corresponding to webpage, position based on each coupling word, sliding window mechanism is adopted to extract multiple text chunk and described one section of summary as this webpage of selection from the plurality of text chunk from this condensed webpages text.
In sum, technical scheme provided by the invention describes to be extracted the process of summary from web page text according to coupling word and sliding window mechanism and realizes the device of this process, based on this process and device, provides the user search summary service.The coupling word that this programme derives based on search word extracts summary, summary and search word are associated, and adopt sliding window mechanism to find the most satisfactory summary, overcome and prior art is made a summary independent of search word problem, when user scans for, provide the user search summary more directly perceived, accurate, relatedness is strong, make user can fast and effeciently find the webpage of needs, meet search need.
It should be understood that
Not intrinsic to any certain computer, virtual bench or miscellaneous equipment relevant in algorithm and the display of this offer.Various fexible units can also with use based on together with this teaching.As described above, the structure constructed required by this kind of device is apparent from.Additionally, the present invention is also not for any certain programmed language.It is understood that, it is possible to utilize various programming language to realize the content of invention described herein, and the description above language-specific done is the preferred forms in order to disclose the present invention.
In description mentioned herein, describe a large amount of detail.It is to be appreciated, however, that embodiments of the invention can be put into practice when not having these details.In some instances, known method, structure and technology it are not shown specifically, in order to do not obscure the understanding of this description.
Similarly, it is to be understood that, one or more in order to what simplify that the disclosure helping understands in each inventive aspect, herein above in the description of the exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or descriptions thereof sometimes.But, the method for the disclosure should be construed to and reflect an intention that namely the present invention for required protection requires feature more more than the feature being expressly recited in each claim.More precisely, as the following claims reflect, inventive aspect is in that all features less than single embodiment disclosed above.Therefore, it then follows claims of detailed description of the invention are thus expressly incorporated in this detailed description of the invention, wherein each claim itself as the independent embodiment of the present invention.
Those skilled in the art are appreciated that, it is possible to carry out the module in the equipment in embodiment adaptively changing and they being arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and multiple submodule or subelement or sub-component can be put them in addition.Except at least some in such feature and/or process or unit excludes each other, it is possible to adopt any combination that all processes or the unit of all features disclosed in this specification (including adjoint claim, summary and accompanying drawing) and so disclosed any method or equipment are combined.Unless expressly stated otherwise, each feature disclosed in this specification (including adjoint claim, summary and accompanying drawing) can be replaced by the alternative features providing purpose identical, equivalent or similar.
In addition, those skilled in the art it will be appreciated that, although embodiments more described herein include some feature included in other embodiments rather than further feature, but the combination of the feature of different embodiment means to be within the scope of the present invention and form different embodiments.Such as, in the following claims, the one of any of embodiment required for protection can mode use in any combination.
The all parts embodiment of the present invention can realize with hardware, or realizes with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions of the some or all parts that microprocessor or digital signal processor (DSP) can be used in practice to realize in a kind of device extracting summary from webpage according to embodiments of the present invention.The present invention is also implemented as part or all the equipment for performing method as described herein or device program (such as, computer program and computer program).The program of such present invention of realization can store on a computer-readable medium, or can have the form of one or more signal.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.
The present invention will be described rather than limits the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment without departing from the scope of the appended claims.In the claims, any reference marks that should not will be located between bracket is configured to limitations on claims.Word " comprises " and does not exclude the presence of the element or step not arranged in the claims.Word "a" or "an" before being positioned at element does not exclude the presence of multiple such element.The present invention by means of including the hardware of some different elements and can realize by means of properly programmed computer.In the unit claim listing some devices, several in these devices can be through same hardware branch and specifically embody.Word first, second and third use do not indicate that any order.Can be title by these word explanations.
Claims (10)
1. the method extracting summary from webpage, wherein, the method includes:
Obtain and extract one or more coupling words that summary is required;
Each coupling word is found out from the web page text of webpage;
Based on the position of each coupling word, sliding window mechanism is adopted to extract multiple text chunk from the web page text of this webpage;
One section of summary as this webpage is selected from the plurality of text chunk.
2. the method for claim 1, wherein described position based on each coupling word, adopts sliding window mechanism to extract multiple text chunk from the web page text of this webpage and includes:
Sliding window starts word for word to slide backward from the original position of web page text, and the original position often sliding into a coupling word intercepts one section of text of sliding window size;
Wherein, the size of described sliding window is adjustable in preset range, and the text every time intercepted mates word with one to start, and terminates with a coupling word.
3. the method as described in any one of claim 1-2, wherein, the method farther includes:
Before intercepting, further the position of sliding window is modified every time, clear and coherent to ensure the text semantic intercepted.
4. the method as described in any one of claim 1-3, wherein, described before intercepting every time, it is modified including to the position of sliding window further:
Start to move forward from the coupling word of window original position, if running into a coupling word, stop mobile, judge whether there is beginning of the sentence between the coupling word from window original position and a upper coupling word, if existing, window original position being modified to and starts with this beginning of the sentence, if there is no then window original position not being modified;If able to the section of moving forward to head, then window original position is modified to and first opens the beginning with this section;
And/or, the coupling word terminating to put from window starts to be moved rearwards by, if running into next coupling word or the greatest length beyond window, stops mobile.
5. the method as described in any one of claim 1-4, wherein, described from the plurality of text chunk select one section of summary as this webpage include:
Calculate the respective comprehensive weights of the plurality of text chunk respectively, choose the one section of the highest text of the comprehensive weights summary as this condensed webpages.
6. the method as described in any one of claim 1-5, wherein, the rank of coupling word includes at least two rank in as follows, and rank is followed successively by from high to low:
Initial search word;
PHRASE in participle;
Two continuative participle TERM connect together;
Two discontinuous participle TREM connect together;
Continuous three Chinese characters connect together;
TERM in participle;
Two Chinese characters, it does not have stop words;
Two Chinese characters, have stop words;
Individual Chinese character;
Single character.
7. the method as described in any one of claim 1-6, wherein, the described one or more coupling words extracting summary required that obtain include:
Derive according to initial search word and extract one or more coupling words that summary is required, specifically include following in one or more:
Initial search word itself;
Initial search word is carried out each word that word segmentation processing obtains;
Initial search word and initial search word is carried out synonym and/or the error correction term of each word that word segmentation processing obtains;
Two words of continuous print carried out by initial search word in each word that word segmentation processing obtains merge the word obtained;
Three words of continuous print carried out by initial search word in each word that word segmentation processing obtains merge the word obtained;
Two words that initial search word carries out one word in interval in each word that word segmentation processing obtains merge the word obtained;
Each word in initial search word, and continuous two and continuous three words therein;
The word obtained after initial search word is normalized.
8. the method as described in any one of claim 1-7, wherein, described from the web page text of webpage, find out each coupling word, position based on each coupling word, adopt sliding window mechanism to extract multiple text chunk from the web page text of this webpage, and described select one section of summary as this webpage to include from the plurality of text chunk:
From the condensed webpages text that webpage is corresponding, find out each coupling word, based on the position of each coupling word, adopt sliding window mechanism to extract multiple text chunk and described one section of summary as this webpage of selection from the plurality of text chunk from this condensed webpages text.
9. the method as described in any one of claim 1-8, wherein, the method farther includes: obtain the step of the condensed webpages of webpage, particularly as follows:
JS and CSS code in webpage is removed, obtains condensed webpages;
Or, for a webpage, the html tag in this webpage is classified;Remove and belong to the html tag specifying classification, retain the html tag being not belonging to specify classification;For being not belonging to specify the html tag of classification, analyze its attribute, retain the one or more attributes specified;It is put into after content of text in the condensed webpages of correspondence by the Content Transformation of the html tag of reservation.
10. extracting a device for summary from webpage, wherein, this device includes:
Acquiring unit, is suitable to obtain and extracts one or more coupling words that summary is required;
Abstract extraction unit, is suitable to find out each coupling word from the web page text of webpage;Be suitable to the position based on each coupling word, adopt sliding window mechanism to extract multiple text chunk from the web page text of this webpage;And be suitable to one section of summary as this webpage of selection from the plurality of text chunk.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410843345.3A CN105808561A (en) | 2014-12-30 | 2014-12-30 | Method and device for extracting abstract from webpage |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410843345.3A CN105808561A (en) | 2014-12-30 | 2014-12-30 | Method and device for extracting abstract from webpage |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105808561A true CN105808561A (en) | 2016-07-27 |
Family
ID=56421009
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410843345.3A Pending CN105808561A (en) | 2014-12-30 | 2014-12-30 | Method and device for extracting abstract from webpage |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105808561A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106294733A (en) * | 2016-08-10 | 2017-01-04 | 成都轻车快马网络科技有限公司 | Page detection method based on text analyzing |
CN106503056A (en) * | 2016-09-27 | 2017-03-15 | 北京百度网讯科技有限公司 | Generation method and device that Search Results based on artificial intelligence are made a summary |
CN106528532A (en) * | 2016-11-07 | 2017-03-22 | 上海智臻智能网络科技股份有限公司 | Text error correction method and device and terminal |
CN108959452A (en) * | 2018-06-14 | 2018-12-07 | 阿里巴巴集团控股有限公司 | A kind of determination method, display methods and the device of summary info |
CN112559729A (en) * | 2020-12-08 | 2021-03-26 | 申德周 | Document abstract calculation method based on hierarchical multi-dimensional transformer model |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101458718A (en) * | 2009-01-05 | 2009-06-17 | 北京大学 | Search engine dynamic summarization extracting method |
CN101539923A (en) * | 2008-03-18 | 2009-09-23 | 北京搜狗科技发展有限公司 | Method and device for extracting text segment from file |
CN101620596A (en) * | 2008-06-30 | 2010-01-06 | 东北大学 | Multi-document auto-abstracting method facing to inquiry |
CN102411621A (en) * | 2011-11-22 | 2012-04-11 | 华中师范大学 | Chinese inquiry oriented multi-document automatic abstraction method based on cloud mode |
CN103034633A (en) * | 2011-09-30 | 2013-04-10 | 国际商业机器公司 | Method for generating expanded search result page summary and device for generating expanded search result page summary |
CN104077388A (en) * | 2014-06-27 | 2014-10-01 | 北京奇虎科技有限公司 | Summary information extraction method and device based on search engine and search engine |
-
2014
- 2014-12-30 CN CN201410843345.3A patent/CN105808561A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101539923A (en) * | 2008-03-18 | 2009-09-23 | 北京搜狗科技发展有限公司 | Method and device for extracting text segment from file |
CN101620596A (en) * | 2008-06-30 | 2010-01-06 | 东北大学 | Multi-document auto-abstracting method facing to inquiry |
CN101458718A (en) * | 2009-01-05 | 2009-06-17 | 北京大学 | Search engine dynamic summarization extracting method |
CN103034633A (en) * | 2011-09-30 | 2013-04-10 | 国际商业机器公司 | Method for generating expanded search result page summary and device for generating expanded search result page summary |
CN102411621A (en) * | 2011-11-22 | 2012-04-11 | 华中师范大学 | Chinese inquiry oriented multi-document automatic abstraction method based on cloud mode |
CN104077388A (en) * | 2014-06-27 | 2014-10-01 | 北京奇虎科技有限公司 | Summary information extraction method and device based on search engine and search engine |
Non-Patent Citations (3)
Title |
---|
沈怡涛: "基于视觉特征和文本结构分析的中文网页自动摘要技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
熊芝: "中文网页自动摘要系统设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
蔡建山 等: "基于滑动窗口的动态摘要算法", 《计算机工程》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106294733A (en) * | 2016-08-10 | 2017-01-04 | 成都轻车快马网络科技有限公司 | Page detection method based on text analyzing |
CN106294733B (en) * | 2016-08-10 | 2019-05-07 | 成都轻车快马网络科技有限公司 | Page detection method based on text analyzing |
CN106503056A (en) * | 2016-09-27 | 2017-03-15 | 北京百度网讯科技有限公司 | Generation method and device that Search Results based on artificial intelligence are made a summary |
CN106503056B (en) * | 2016-09-27 | 2019-08-27 | 北京百度网讯科技有限公司 | The generation method and device of search result abstract based on artificial intelligence |
CN106528532A (en) * | 2016-11-07 | 2017-03-22 | 上海智臻智能网络科技股份有限公司 | Text error correction method and device and terminal |
CN106528532B (en) * | 2016-11-07 | 2019-03-12 | 上海智臻智能网络科技股份有限公司 | Text error correction method, device and terminal |
CN108959452A (en) * | 2018-06-14 | 2018-12-07 | 阿里巴巴集团控股有限公司 | A kind of determination method, display methods and the device of summary info |
CN112559729A (en) * | 2020-12-08 | 2021-03-26 | 申德周 | Document abstract calculation method based on hierarchical multi-dimensional transformer model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109543126B (en) | Webpage text information extraction method based on block character ratio | |
CN104598577B (en) | A kind of extracting method of Web page text | |
CN102035883B (en) | Method and device for optimizing webpage in network equipment | |
CN104636465A (en) | Webpage abstract generating methods and displaying methods and corresponding devices | |
CN103166981B (en) | A kind of radio web page code-transferring method and device | |
CN105808561A (en) | Method and device for extracting abstract from webpage | |
CN106503211B (en) | Method for automatically generating mobile version facing information publishing website | |
CN102135967A (en) | Webpage keywords extracting method, device and system | |
CN103544176A (en) | Method and device for generating page structure template corresponding to multiple pages | |
CN110390038A (en) | Segment method, apparatus, equipment and storage medium based on dom tree | |
CN110457579B (en) | Webpage denoising method and system based on cooperative work of template and classifier | |
CN103870486A (en) | Webpage type confirming method and device | |
CN106021392A (en) | News key information extraction method and system | |
CN110020312B (en) | Method and device for extracting webpage text | |
WO2014000130A1 (en) | Method or system for automated extraction of hyper-local events from one or more web pages | |
CN104268283A (en) | Method for automatically analyzing Internet web page | |
CN104462532A (en) | Method and device for extracting webpage text | |
CN105808615A (en) | Document index generation method and device based on word segment weights | |
CN104915422A (en) | Webpage collecting method and device based on browser | |
CN103729178A (en) | Method and system for processing multiple tabs of browsers | |
CN106897289B (en) | Information search optimization method and device | |
CN107436931B (en) | Webpage text extraction method and device | |
CN107145591B (en) | Title-based webpage effective metadata content extraction method | |
CN117312711A (en) | Search engine optimization method and system based on AI analysis | |
CN103455572B (en) | Obtain the method and device of video display main body in webpage |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160727 |