CN104699847A - Method and device for extracting summaries from web pages - Google Patents

Method and device for extracting summaries from web pages Download PDF

Info

Publication number
CN104699847A
CN104699847A CN201510153451.3A CN201510153451A CN104699847A CN 104699847 A CN104699847 A CN 104699847A CN 201510153451 A CN201510153451 A CN 201510153451A CN 104699847 A CN104699847 A CN 104699847A
Authority
CN
China
Prior art keywords
word
sentence
subquery
mark
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510153451.3A
Other languages
Chinese (zh)
Other versions
CN104699847B (en
Inventor
刘庆朝
何文娟
周娜
冯艳伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liu Xiulei
Original Assignee
Liu Xiulei
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liu Xiulei filed Critical Liu Xiulei
Priority to CN201510153451.3A priority Critical patent/CN104699847B/en
Publication of CN104699847A publication Critical patent/CN104699847A/en
Application granted granted Critical
Publication of CN104699847B publication Critical patent/CN104699847B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for extracting summaries from web pages. The method is used for extracting the summaries from the web pages which are obtained by means of inputting query terms selected by users and are related to the query terms. The method for extracting the summaries from the web pages has the advantages that the summaries can be accurately extracted from the web pages, whether the web pages are relevant or not and whether the web pages are about to be further checked or not can be quickly determined by the users, and accordingly the problem of trouble when users need to quickly search relevant documents can be solved. The invention further provides a device for extracting the summaries from the web pages.

Description

Web-page summarization abstracting method and device thereof
Technical field
The present invention relates to a kind of web-page summarization abstracting method and device thereof, particularly relate to a kind of web-page summarization abstracting method based on keyword and device thereof.
Background technology
Following a few class is had at present: 1. based on the method for text set for web-page summarization abstracting method.The method introduces the method for machine learning, from text set rule relevant to summary learning.Mainly be divided into two stages: training stage and test phase.Training stage concentrates the key character extracting text from training text, and by learning algorithm create-rule; The rule that training stage acquires then is applied to test text collection by test phase, generates summary.The deficiency of the method needs the rule utilizing training to generate could generate summary, too relies on training text collection, is difficult to the accuracy ensureing to generate summary.2. based on the method for domain body: the method analyzes text mainly through body for the internal relations of the semantic information on application level and body, filter irrelevant information, make theme more outstanding, the extraction of summary is more accurate, the deficiency of the method is the shortage due to Chinese domain body, need oneself to build domain body, workload is large.
Therefore, need to occur a kind of accuracy that can ensure to generate summary, the appearance of the web-page summarization abstracting method of workload can be reduced again.
Summary of the invention
For the problems referred to above, the invention provides a kind of method and the device thereof that from webpage, correctly can extract summary.
The invention provides a kind of web-page summarization abstracting method, make a summary for extracting from the webpage relevant to described query word obtained by the query word selected of input user, it is characterized in that, comprise the following steps: (1) query word pre-treatment step: utilize participle instrument to be several subquery words by query word participle, add up frequency and the distributed intelligence of each subquery word in query word of each subquery word appearance; (2) Text Pretreatment step: pretreated web page text is divided into several sentences, utilizes participle instrument each sentence is carried out participle and preserves the distributed intelligence of each word in web page text, and adds up the number of word in sentence number and text; (3) word frequency and word distributed intelligence statistic procedure: the kind of subquery word of adding up the number of the subquery word that each sentence comprises, the number of continuous subquery word comprised and comprising, and at sentence the subquery word occurred marks and indicated and be specially that subquery word; (4) scoring step: based on the statistics of step (3), utilize scoring instrument to mark to each sentence, calculate the mark of each sentence; (5) sequence and generation summary step: from high to low sentence is sorted by mark, choose a front K sentence to generate summary, wherein 0≤K≤3.
The present invention also provides a kind of web-page summarization draw-out device, it is characterized in that, comprise: query word pretreatment unit, utilize participle instrument to be several subquery words by query word participle, add up frequency and the distributed intelligence of each subquery word in query word of each subquery word appearance; Text Pretreatment unit, is divided into several sentences by text, utilize participle instrument each sentence is carried out participle and preserves the distributed intelligence of each word in web page text, and add up the number of word in sentence number and text; Word frequency and word distributed intelligence unit, add up the sentence number occurring subquery word described at least one in several sentences, add up the frequency that described in certain, subquery word occurs in each sentence, and add up the kind occurring subquery word in each sentence, and at sentence, the subquery word occurred is marked and indicated and is specially that subquery word; Scoring unit, utilizes scoring instrument to mark to each sentence, calculates the mark of each sentence; Sequencing unit, sorts to sentence from high to low by mark; And tissue summary unit, choose a front K sentence to generate summary, wherein 0≤K≤3.
The present invention can when response user inquiry, according to the position that query word occurs in a document, extract word relevant around query word and return to user, because one section of document can be recalled by different query words, therefore, the present invention according to different query words, can form different summary words to same document.Meanwhile, the key word in given text is carried out to overstriking, adds the display modes such as red, highlighted the difference of it and plain text.Thus enable the relevant information needed for user's fast finding.
Accompanying drawing explanation
Fig. 1 is the schematic diagram of web-page summarization abstracting method of the present invention.
Fig. 2 is the schematic diagram of web-page summarization draw-out device of the present invention.
Embodiment
Aim of the present invention extracts summary exactly from webpage, mainly solves following problem:
1. summarize the approximate contents of the webpage searched out, whether user is determined fast, and whether webpage is correlated with, check further.Thus solve the puzzlement of user's fast finding relevant documentation.
2., for search system, in time responding user's inquiry, according to the position that query word occurs in a document, extract word relevant around query word and return to user.
3. the accuracy rate problem of abstract extraction.The key elements such as word frequency, title, word distribution are mainly extracted important sentence as summary as the statement passing judgment on sentence importance by the present invention.Effectively raise the accuracy rate problem of abstract extraction.
For solving the problems of the technologies described above, web-page summarization abstracting method of the present invention have employed following technology:
(1) text-processing technology.In the present invention, main two places use this technology, and one is query word or title pretreatment unit, and another is Text Pretreatment unit.
(2) sort result technology.First the present invention carries out subordinate sentence to given document, subordinate sentence Main Basis in ".! .! " wait the appearance of symbol.Then each sentence is marked, following three factors of scoring Main Basis: the number 1) comprising the subquery word after query word participle in sentence, comprises the subquery word repeated; 2) number of continuous subquery word is comprised in sentence; 3) kind of the subquery word comprised in sentence.Finally sort according to scoring, find out the score that comes front K higher sentence.
For making skilled person understand feature of the present invention and technology contents further, below in conjunction with drawings and Examples, technical scheme of the present invention is described in detail.
Generally, after user chooses term according to the information for obtaining, by this term input frame retrieval, the related content about this term can be found in magnanimity webpage in search engine usual very short time, these contents directly will not feed back to user, thus user is difficult to know that in retrieved webpage, those are required for oneself, if each webpage is browsed, will waste the plenty of time.Thus, how fast and efficiently the content in webpage is presented to user with the form of summary, so that user understands in time, whether webpage is relevant seems most important to judge.
Fig. 1 is the schematic diagram of web-page summarization abstracting method of the present invention.As shown in Figure 1, web-page summarization abstracting method of the present invention comprises:
Step S1: query word pre-treatment step
Participle instrument is utilized to be several subquery words by query word participle;
Particularly, word frequency statistics instrument can be utilized, if query word participle is several subquery words by IKAnalyzer segmenter, ICTCLAS, Ansj, stammerer participle etc., add up frequency and the distributed intelligence of each subquery word in query word of the appearance of each subquery word, comprise the position of each subquery word in query word and the information of front and back neighbouring relations, and identify the subquery word repeated in query word;
Step S2: Text Pretreatment step
Pretreated web page text is divided into several sentences, utilizes participle instrument each sentence is carried out participle and preserves the distributed intelligence of each word in web page text;
Particularly, foundation ".! .! " wait the appearance of symbol to carry out subordinate sentence process to text, then utilize participle instrument, as IKAnalyzer segmenter carries out word segmentation processing to each sentence, and preserve each word positional information in the text, comprise starting position and end position.
Distributed intelligence is worked to sentence scoring, mark according to the distributed intelligence of subquery word, such as, suppose that query word comprises subquery word A, B, C, so occur shape as ... AB...C... or ... the such sentence of ABC... certainly than shape as ... the scoring of the sentence that A...B...C... is such wants high.
Step S3: word frequency and word distributed intelligence statistic procedure
The kind of subquery word of adding up the number of the subquery word that each sentence comprises, the number of continuous subquery word comprised and comprising, and at sentence the subquery word occurred marks and indicated and be specially that subquery word; Wherein, the kind of subquery word refers to the number of the subquery word of appearance.
Step S4: scoring step
Based on the statistics of step S3, utilize scoring instrument to mark to each sentence, calculate the mark of each sentence;
Particularly, utilize BM25 algorithm to be the mark that each sentence calculates BM25 contribution, and be the mark that each sentence calculates word distributed intelligence contribution, the mark of the mark that weighting BM25 contributes and word distributed intelligence contribution obtains the mark of each sentence.
BM25 algorithm, is commonly used to divide equally as relevance of searches.Its main thought is: carry out morpheme parsing to Query, generates morpheme qi; Then, for each Search Results D, calculate the Relevance scores of each morpheme qi and D, finally, qi is weighted summation relative to the Relevance scores of D, thus obtains the Relevance scores of Query and D.
The present invention adopts BM25 formula to calculate the mark of BM25 contribution, and BM25 formula is as follows:
Score ( Q , d ) = Σ i = 1 n w i * R ( q i , d ) - - - ( 1 )
Wherein Q represents query word, and d represents sentence.When Score (Q, d) represents input inquiry word Q, the mark of each sentence d.By query word participle, obtain the set { q of query word i, the Lian Jia on the right represents the score value of each query word to each sentence, finally add up namely all query words to the score value of sentence, w irepresent query word q iweight; R (qi, d) represents the Relevance scores of query word qi and sentence d.
R (q in formula i, d) represent a query word to the mark of a sentence, concrete restriction can with different formula according to different application.By the weight w of each subquery word in the present invention iall be set to 1, the formula of embody rule is: score=kinds+ (continutation+occur)/100.0 (2)
Wherein kinds is the kind comprising subquery word in sentence; Continuation is the number of continuous subquery word; Occur is the number comprising subquery word in sentence.
For each sentence calculate the mark of word distributed intelligence contribution time, the segment information of Main Basis subquery word is marked, if several subquery words come together in order, then gives higher mark.In the present invention, distributed intelligence is actually the continuation in above-mentioned formula (2), and if there is the situation that two sub-query words are connected in a sentence, the score value of this sentence just adds 1.Concrete, continuation initial value is set to 0, has two connected query word continuation just to add 1, as ... the continuation=1 of AB...C... ... the continuation=2 of ABC... ... the ontinuation=0 of A...B...C....
Step S5: sequence and generation summary step
Sort to sentence from high to low by mark, choose a front K sentence to generate summary, wherein 0≤K≤3, generally, K is 3.
In step s 5, if the length of the summary of K sentence generation has exceeded user-defined length, then the content exceeding self-defining length is deleted, to obtain suitable summary, and each subquery word is marked, as with overstriking, add the display mode such as red it marked, to highlight the difference of it and plain text.
For setting forth web-page summarization abstracting method of the present invention further, illustrate below.
Below for query word " the joyous New Year's Day ", extract web-page summarization according to above-mentioned steps.
The first step: input inquiry word and text
Query word: the joyous New Year's Day
Text:
[it is said, the Zhu Zhishan of the Ming Dynasty, once once teased such with people.When the joyous New Year's Day, wish that a family rich family is crossed in front of the door in a hill path, see on red-painted gate and against a red paper.He thinks, and just makes people bring pen and ink, has write 12 large words: this year, very good bad luck was entered completely without wealth above.That family owner out at a glance, reads: " this year, very good bad luck, entered completely without wealth." One's face was ghastly pale immediately.Does he think that this fuzzy is my bad luck tactile? so he will just start to beat Zhu Zhishan.And Zhu Zhishan laughs at him, say unhurriedly: " what I obviously write is good word, and yourself fails to understand! " finish, he just with pen put several under, just become: " this year is very good, unlucky completely without, wealth is entered." that owner seen, can't help turning anger for happiness, say "good, good" again and again.】
Second step: query word pre-service
By query word participle, be divided into " the new year " and " happy festival time " two words.
3rd step: Text Pretreatment
To text subordinate sentence, result is as follows, every a line:
[it is said, the Zhu Zhishan of the Ming Dynasty, once once teased such with people.
When the joyous New Year's Day, wish that a family rich family is crossed in front of the door in a hill path, see on red-painted gate and against a red paper.
He thinks, and just makes people bring pen and ink, has write 12 large words: this year, very good bad luck was entered completely without wealth above.
That family owner out at a glance, reads: " this year, very good bad luck, entered completely without wealth." One's face was ghastly pale immediately.
Does he think that this fuzzy is my bad luck tactile?
So he will just start to beat Zhu Zhishan.
And Zhu Zhishan laughs at him, say unhurriedly: " what I obviously write is good word, and yourself fails to understand! " finish, he just with pen put several under, just become: " this year is very good, unlucky completely without, wealth is entered." that owner seen, can't help turning anger for happiness, say "good, good" again and again.】
4th step: sentence is marked
Evaluate formula is: score=kinds+ (continutation+occur)/100.0.
Kinds is the kind comprising subquery word in sentence; Continuation is the number of continuous subquery word; Occur is the number comprising subquery word in sentence, calculates kinds, continuation, occur tri-value to each sentence.Below first two are illustrated:
1, query word: the joyous New Year's Day
Sentence: it is said, the Zhu Zhishan of the Ming Dynasty, once once teased such with people.
Whether whether kinds is initialized as 0, check in sentence and have " the new year ", do not have, and kinds value is constant, check in sentence and have " happy festival time ", do not have, and kinds value is constant, kinds=0;
Continuation is initialized as 0, does not occur continuous print query word, continuation=0 in sentence;
Occur in sentence that query word number is 0, occur=0;
So, the score score=0 of the words.
2, query word: the joyous New Year's Day
Sentence: when the joyous New Year's Day, wishes that a family rich family is crossed in front of the door in a hill path, sees on red-painted gate and againsts a red paper.
Whether whether kinds is initialized as 0, check in sentence and have " the new year ", have, kinds adds 1, check in sentence and have " happy festival time ", have, kinds adds 1, kinds=2;
Continuation is initialized as 0, occurs continuous print query word, continuation=1 in sentence;
Occur in sentence that query word number is 2, occur=2;
So, the score score=2.03 of the words.
According to the method described above, except second, all the other sentence scores are all 0.
4th step: generate summary
Choose first three conduct summary that score is the highest, and red to query word mark.Result is as follows:
When <font color='red'> the joyous New Year's Day </font>, wish that a family rich family is crossed in front of the door in a hill path, see on red-painted gate and against a red paper.He thinks, and just makes people bring pen and ink, has write 12 large words: this year, very good bad luck was entered completely without wealth above.That family owner out at a glance, reads: " this year, very good bad luck, entered completely without wealth.
The present invention also provides a kind of web-page summarization draw-out device.As shown in Figure 2, web-page summarization draw-out device of the present invention comprises query word pretreatment unit 1, it utilizes participle instrument to be several subquery words by query word participle, adds up frequency and the distributed intelligence of each subquery word in query word of the appearance of each subquery word; Text Pretreatment unit 2, for text is divided into several sentences, utilizes participle instrument carry out participle to each sentence and preserve the distributed intelligence of each word in web page text, and adds up the number of word in sentence number and text; Word frequency and word distributed intelligence unit 3, for adding up in several sentences the sentence number occurring at least one sub-query word, add up the frequency that certain subquery word occurs in each sentence, and add up the kind occurring subquery word in each sentence, and at sentence, the subquery word occurred is marked and indicates to be specially which subquery word; Scoring unit 4, utilizes scoring instrument to mark to each sentence, calculates the mark of each sentence; Sequencing unit 5, sorts to sentence from high to low by mark; And tissue summary unit 6, choose a front K sentence to generate summary, wherein 0≤K≤3.
In the present invention, if the length of summarization of K sentence generation is greater than the length of summarization of setting, then sequencing unit 5 is deleted the content of the sentence exceeding length of summarization, and marks the subquery word occurred in summary.In addition, the distributed intelligence of each subquery word in query word comprises the position of each subquery word in query word and front and back neighbor information, and the distributed intelligence of each word in web page text comprises each word starting position in the text and end position information.
Scoring unit 4 calculates the mark of BM25 contribution to each sentence and every sentence is calculated to the mark of word distributed intelligence contribution, and the mark of the mark that weighting BM25 contributes and word distributed intelligence contribution, obtains the mark of each sentence.
In the present invention, word frequency statistics instrument can be utilized, as IKAnalyzer segmenter carries out participle.
Web-page summarization draw-out device of the present invention can extract summary accurately from webpage, and use for reference, considerably reduces the search time of user.
Although illustrate and describe embodiments of the invention, those having ordinary skill in the art will appreciate that: can carry out multiple change, amendment, replacement and modification to these embodiments when not departing from principle of the present invention and aim, scope of the present invention is by claim and equivalents thereof.

Claims (7)

1. a web-page summarization abstracting method, making a summary for extracting from the webpage relevant to described query word obtained by the query word selected of input user, it is characterized in that, comprise the following steps:
(1) query word pre-treatment step
Participle instrument is utilized to be several subquery words by query word participle;
(2) Text Pretreatment step
Pretreated web page text is divided into several sentences, utilizes participle instrument that each sentence is carried out participle and preserve the distributed intelligence of each subquery word in web page text;
(3) word frequency and word distributed intelligence statistic procedure
The kind of subquery word of adding up the number of the subquery word that each sentence comprises, the number of continuous subquery word comprised and comprising, and at sentence the subquery word occurred marks and indicated and be specially that subquery word;
(4) scoring step
Based on the statistics of step (3), utilize scoring instrument to mark to each sentence, calculate the mark of each sentence;
(5) sequence and generation summary step
From high to low sentence is sorted by mark, choose a front K sentence to generate summary, wherein 0≤K≤3.
2. web-page summarization abstracting method according to claim 1, it is characterized in that, described step (5) also comprises: if the length of summarization of a described K sentence generation is greater than the length of summarization of setting, then the content of the sentence exceeding described length of summarization is deleted, and the subquery word occurred in summary is marked.
3. web-page summarization abstracting method according to claim 1, it is characterized in that, the distributed intelligence of described each subquery word in query word comprises the position of each subquery word in query word and front and back neighbor information, and the distributed intelligence of described each subquery word in web page text comprises each subquery word starting position in the text and end position information.
4. web-page summarization abstracting method according to claim 1, it is characterized in that, described scoring instrument in described step (4) is BM25 algorithm, be specially: the mark of BM25 contribution is calculated to each sentence and every sentence is calculated to the mark of word distributed intelligence contribution, the mark of the mark that weighting BM25 contributes and word distributed intelligence contribution, obtains the mark of each sentence.
5. web-page summarization abstracting method according to claim 4, it is characterized in that, adopt formula score=kinds+ (continutation+occur)/100.0 to carry out to calculate each sentence the mark of BM25 contribution, wherein kinds is the kind comprising subquery word in sentence; Continuation is the number of continuous subquery word; Occur is the number comprising subquery word in sentence.
6. the web-page summarization abstracting method according to any one of Claims 1-4, is characterized in that, described participle instrument is IKAnalyzer segmenter, ICTCLAS, Ansj or stammerer participle.
7. a web-page summarization draw-out device, is characterized in that, comprising:
Query word pretreatment unit, utilizes participle instrument to be several subquery words by query word participle, adds up frequency and the distributed intelligence of each subquery word in query word of the appearance of each subquery word;
Text Pretreatment unit, is divided into several sentences by text, utilize participle instrument each sentence is carried out participle and preserves the distributed intelligence of each word in web page text, and add up the number of word in sentence number and text;
Word frequency and word distributed intelligence unit, add up the sentence number occurring subquery word described at least one in several sentences, add up the frequency that described in certain, subquery word occurs in each sentence, and add up the kind occurring subquery word in each sentence, and at sentence, the subquery word occurred is marked and indicated and is specially that subquery word;
Scoring unit, utilizes scoring instrument to mark to each sentence, calculates the mark of each sentence;
Sequencing unit, sorts to sentence from high to low by mark; And
Tissue summary unit, chooses a front K sentence to generate summary, wherein 0≤K≤3.
CN201510153451.3A 2015-02-13 2015-04-02 Web-page summarization abstracting method and its device Active CN104699847B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510153451.3A CN104699847B (en) 2015-02-13 2015-04-02 Web-page summarization abstracting method and its device

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201510077258 2015-02-13
CN2015100772586 2015-02-13
CN201510153451.3A CN104699847B (en) 2015-02-13 2015-04-02 Web-page summarization abstracting method and its device

Publications (2)

Publication Number Publication Date
CN104699847A true CN104699847A (en) 2015-06-10
CN104699847B CN104699847B (en) 2018-02-06

Family

ID=53346967

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510153451.3A Active CN104699847B (en) 2015-02-13 2015-04-02 Web-page summarization abstracting method and its device

Country Status (1)

Country Link
CN (1) CN104699847B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512335A (en) * 2015-12-29 2016-04-20 腾讯科技(深圳)有限公司 Abstract searching method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001086490A1 (en) * 2000-05-09 2001-11-15 Justsystem Corporation Method and apparatus for rapidly producing document summaries and document browsing aids
CN101620596A (en) * 2008-06-30 2010-01-06 东北大学 Multi-document auto-abstracting method facing to inquiry
CN102375813A (en) * 2010-08-09 2012-03-14 腾讯科技(深圳)有限公司 Duplicate detection system and method for search engines

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001086490A1 (en) * 2000-05-09 2001-11-15 Justsystem Corporation Method and apparatus for rapidly producing document summaries and document browsing aids
CN101620596A (en) * 2008-06-30 2010-01-06 东北大学 Multi-document auto-abstracting method facing to inquiry
CN102375813A (en) * 2010-08-09 2012-03-14 腾讯科技(深圳)有限公司 Duplicate detection system and method for search engines

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵东生 等: "基于查询词出现的相关度改进", 《情报学报》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512335A (en) * 2015-12-29 2016-04-20 腾讯科技(深圳)有限公司 Abstract searching method and device
CN105512335B (en) * 2015-12-29 2020-01-31 腾讯科技(深圳)有限公司 abstract searching method and device

Also Published As

Publication number Publication date
CN104699847B (en) 2018-02-06

Similar Documents

Publication Publication Date Title
CN104679730A (en) Webpage summarization extraction method and device thereof
Cafarella et al. Webtables: exploring the power of tables on the web
Wang et al. Understanding tables on the web
CN105868313A (en) Mapping knowledge domain questioning and answering system and method based on template matching technique
Piperski et al. Big and diverse is beautiful: A large corpus of Russian to study linguistic variation
Lu et al. A dataset search engine for the research document corpus
Alpizar-Chacon et al. Knowledge models from PDF textbooks
Gregorowicz et al. Mining a large-scale term-concept network from Wikipedia
Kreaa et al. Arabic words stemming approach using Arabic WordNet
Zhao et al. Collocation polarity disambiguation using web-based pseudo contexts
Wongso et al. A literature review of question answering system using named entity recognition
Sunercan et al. Wikipedia missing link discovery: A comparative study
CN104699847A (en) Method and device for extracting summaries from web pages
Roche et al. AcroDef: A quality measure for discriminating expansions of ambiguous acronyms
Eberius et al. Publish-time data integration for open data platforms
Rodrigues et al. Rapport—a portuguese question-answering system
Spirin et al. Unsupervised approach to generate informative structured snippets for job search engines
Pembe et al. Heading-based sectional hierarchy identification for HTML documents
Thanadechteemapat et al. Thai word segmentation for visualization of thai web sites
Manne et al. A Feature Terms based Method for Improving Text Summarization with Supervised POS Tagging
Reddy et al. Cross lingual information retrieval using search engine and data mining
Holzmann et al. Named entity evolution recognition on the Blogosphere
Zhang et al. English-Chinese bi-directional OOV translation based on web mining and supervised learning
Ganesh et al. Exploiting structure and content of wikipedia for query expansion in the context
Yoshida et al. Reformatting Web Documents via Header Trees

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant