CN103853834A - Text structure analysis-based Web document abstract generation method - Google Patents

Text structure analysis-based Web document abstract generation method Download PDF

Info

Publication number
CN103853834A
CN103853834A CN201410090200.0A CN201410090200A CN103853834A CN 103853834 A CN103853834 A CN 103853834A CN 201410090200 A CN201410090200 A CN 201410090200A CN 103853834 A CN103853834 A CN 103853834A
Authority
CN
China
Prior art keywords
text
sentence
semantic
cut
point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410090200.0A
Other languages
Chinese (zh)
Other versions
CN103853834B (en
Inventor
沈怡涛
顾君忠
林晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN201410090200.0A priority Critical patent/CN103853834B/en
Publication of CN103853834A publication Critical patent/CN103853834A/en
Application granted granted Critical
Publication of CN103853834B publication Critical patent/CN103853834B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a text structure analysis-based Web document abstract generation method. The method comprises the steps of using a URL (uniform resource locator) as input, integrating the webpage main bodies of visual features and text features for extraction, partitioning the main bodies into a plurality of semantic paragraphs, and abstracting each semantic paragraph, so the generated abstract has higher coverage rate. The text structure analysis-based Web document summary generation method realizes the generation of the text abstract with better quality from a Webpage aiming at the conditions that the Webpage structure is complex, the main body is hard to identify and the Chinese automatic abstract is still positioned in the probe stage.

Description

The generation method of the Web documentation summary of analyzing based on text structure
Technical field
The present invention relates to Web page text extraction, natural language processing, Chinese Text Summarization technical field, specifically a kind of generation method of the Web documentation summary of analyzing based on text structure.
Background technology
At present, Internet has become the main source of people's obtaining information.The particularly develop rapidly of user-generated content (UGC) in recent years, the information on Internet is just in explosive growth.Although search engine can require to return Search Results according to user.But user still need to find the webpage of the most applicable oneself needs from search listing, particularly due to a large amount of search engine optimizations that exist on internet with reprint phenomenon, bring very large difficulty to user's searching information fast and accurately.
Automatic abstracting system is to utilize computing machine fast processing Web document, therefrom captures out the core content of Web document by certain ratio of compression, and user can therefrom obtain subject information and judge the value of this Web document, has improved the efficiency of user search information.
In Web document, exist in a large number noise information, information as irrelevant in advertisement, navigation bar, user function bar, associated recommendation, copyright information etc. and theme.Web document is a kind of semi-structured information, although have a fixed structure, semanteme cannot be determined.The page that the expression of content in html source code and final rendering obtain has very large difference.The extensive application of JS and AJAX technology in recent years, making web data is no longer static HTML code, but dynamically generate, even also can produce corresponding change for user's operation behavior.So how to extract from Web document and content Topic relative and that structure is correct, exist certain difficulty.
The history of the nearly more than two decades of Chinese Text Summarization systematic research, but at present also in the exploratory stage, the result of autoabstract also far away can not be satisfactory.The method of autoabstract is mainly divided into two large classes, the automatic abstract based on understanding and the automatic abstract based on extracting.Because natural language processing technique does not still have important breakthrough, realize automatic abstract so the method based on understanding can not be real.
And shorter towards the research history of the autoabstract technology of Web document, " compared with traditional text; the text structure of webpage is loose; title name is relatively so not rigorous; a sentence finishes also may not have end mark; and there is a large amount of and the incoherent content of text, this brings certain difficulty to generation of summary.”
Summary of the invention
The object of this invention is to provide a kind of generation method of the Web documentation summary of analyzing based on text structure, the method integrated use the technology such as visual signature analysis, natural language analysis, text structure analysis, for the each webpage in Search Results generates based on semanteme, the good webpage summary of quality, for user provides reference.
The object of the present invention is achieved like this:
A generation method for the Web documentation summary of analyzing based on text structure, it comprises the following steps:
1) input the URL of webpage to be made a summary;
2) extract Web page text from webpage to be made a summary based on visual analysis, specifically comprise;
2.1) adopt browser core that Web document is resolved and played up;
2.2) adopt Visual tree (VIPS) algorithm to carry out piecemeal to webpage, obtain position, the area of each block;
2.3) each block is carried out to participle;
2.4) each block is analyzed to text feature;
2.5) whether each block being comprised to text gives a mark;
2.6) score is linked in sequence higher than the text of a certain threshold value;
2.7) output Web document text;
3) text extracting is carried out to the autoabstract of analyzing based on text structure, specifically comprises:
3.1) by step 2) obtain Web page text;
3.2) text is carried out to participle and part-of-speech tagging;
3.3) carry out text pre-service: the basic structure in identification text, identify article title, sentence completion, paragraph cutting;
3.4) text is carried out to the cutting of semanteme section, by the semantic position changing of text structure analysis identification, as the mark of semantic section cutting;
3.5) to each semantic section, utilize the promotion method of TFIDF, the importance to each sentence in the semantic section in place is measured, and then according to abstract word number requirement, extracts some the sentences that can represent this semanteme section theme;
3.6) each sentence is linked in sequence, output digest.
Described step 2.4) in text feature be number of words, font size, declarative sentence quantity, non-declarative sentence quantity and text fragment quantity.
Described step 2.5) described in judge whether each block comprises text and give a mark, use following formula to calculate the score value of marking:
V ( s ) = S 2 * P ( x 1 , x 2 , x 3 , x 4 ) N + 1
Wherein S represents declarative sentence quantity, and N represents non-declarative sentence quantity, and P is a value big or small according to block and that position calculation obtains, x 1, y 1represent the coordinate in the block upper left corner, x 2, y 2represent the coordinate in the block lower right corner.
Described step 3.4) in the analysis identification of the position that changes of semanteme be:
1) document D is carried out to subordinate sentence, between every two adjacent sentences, be cut-point undetermined;
2) each cut-point undetermined is given a mark, its formula is:
Q ( p i ) = &Sigma; i + 1 < j &le; i + a R ( s i , s j ) - &Sigma; i - a < = j < i R ( s i , s j )
Wherein, R (s i, s j) expression sentence s iwith sentence s jsentence between semantic relevancy; p irepresent that cut-point is at sentence s iand s i-1between, if Q is (p i) > Q (p i-1) and Q (p i) > Q (p i+1), p is described ithe maximum point of cut-point weights, so p iit is the cut-point between semantic section in the text.A is an adjustable empirical parameter, and the scope of the semantic analysis while being illustrated in identification cut-point represents to consider cut-point front and back each a sentence.
3) if the score value of cut-point is greater than a certain threshold value, and be local maximum, score value is higher than the score value of former and later two cut-points, and this cut-point is exactly the cut-off of semantic section, i.e. step 3.4) described in the semantic position changing.
The analysis identification step 2 of the position that described semanteme changes) between sentence the calculating of semantic relevancy comprise the following steps:
1) sentence is cut into the set of word;
2) use following formula to calculate semantic relevancy between sentence
R ( s 1 , s 2 ) = &Sigma; w i &Element; s 1 max ( R ( w i , w j ) ) ( w j &Element; s 2 )
Wherein R (w i, w j) expression word w iwith word w jword between semantic relevancy.
Described step 3.5) in to each sentence the importance in the semantic section in place measure use below formula calculating:
V(S l)=sum(w∈S 1)*TFIDF(w)
Wherein, while calculating TFIDF (w), each paragraph is considered as to independently file, several paragraphs that entire article is comprised are considered as file set.
The present invention can filter out in webpage and irrelevant word, the link etc. of theme, identifies the article text comprising in webpage, and accuracy rate is higher, and has higher robustness.Autoabstract flow process has adopted the automatic Summarization Technique of analyzing based on text structure, and the summary coverage rate of generation is high and summary is comparatively smooth.
The present invention can, for Web document, by the ratio of compression requirement of user's appointment, only need to input the URL address of webpage to be made a summary, just can be within the time of several seconds, formation can cover the original text meaning, and comparatively accurate, smooth summary helps user's searching information in internet fast and accurately.
Brief description of the drawings
Fig. 1 is process flow diagram of the present invention;
Fig. 2 is webpage pretreatment process figure of the present invention;
Fig. 3 is autoabstract process flow diagram of the present invention
Embodiment
The invention discloses a kind of Web documentation summary generation method of Search Engine-Oriented, can Web webpage of automatic analysis, and the text snippet of reaction of formation Web page subject.
The present invention comprises a Web page text that combines visual signature and text feature and extracts and an autotext summary based on carry out sub-topics division by text structure analysis.
The present invention, using a URL as input, through Web page text extraction, two stages of autoabstract, finally generates text snippet.
Specific algorithm to described two stages below, is described further for example in conjunction with a news web page is made a summary:
Fig. 1 has described from URL to be made a summary to the overall procedure that generates summary, comprising webpage pretreatment process and autoabstract flow process.
Particularly, in an embodiment, the present invention obtains the URL of news web page to be made a summary in webpage pretreatment process (see figure 2) URL input step.Webpage pretreatment process, by analyzing visual signature, can find the body part in webpage more accurately, has more high robust than additive method.Consider other features such as text feature, the analysis of text-dependent degree, html tag feature, semantic feature simultaneously, further improve the accuracy that Web Web page text extracts.
Webpage is played up step and is responsible for reading webpage corresponding to input URL, in this embodiment, adopts IE11 browser core to process html tag, and plays up this webpage.On the basis of playing up at webpage, Visual tree analytical procedure adopts VIPS algorithm, and webpage is carried out to Visual tree analysis, obtains position, the area of each block.In this embodiment, news web page to be made a summary is divided into 6 blocks by this step: a top block, a bottom block, navigation block, an advertisement block and two blocks that comprise text.Participle step is responsible for each block to carry out participle.Then, text feature analytical procedure is carried out text feature analysis to word segmentation result.Feature and the text feature of last comprehensive analytical procedure each block that analysis obtains to Visual tree are comprehensively analyzed, output body.
In this embodiment, adopt following formula to calculate P (x 1, y 1, x 2, y 2).
P(x l,y l,x 2y 2)=(x 2-x 1)*(y 2-y 1)-x 1*y 1
Wherein x 1, y 1represent the coordinate in the block upper left corner, x 2, y 2represent the coordinate in the block lower right corner.Then calculate V (s) value of each block:
V ( s ) = S 2 * P ( x 1 , x 2 , x 3 , x 4 ) N + 1
V (s) value of above-mentioned 6 blocks is respectively 3.7 × 10 from big to small 6, 2.3 × 10 6, 7.5 × 10 5, 5.4 × 10 6, 3.7 × 10 5, 1.6 × 10 5, 1.2 × 10 4.
In this embodiment, the threshold value of employing is 10 6so, choose V (s) and be greater than 10 6block, i.e. two maximum blocks of V (s) value.In this embodiment, two maximum blocks of V (s) value are exactly two blocks that comprise text, so correctly extracted body.
Extracting after body, then carrying out autoabstract flow process (see figure 3), comprising that relatedness computation between relatedness computation between text pre-service, word, sentence, semantic section are cut apart, these steps of summarization generation.
A text pre-treatment step, the basic structure in identification text, identifies article title, sentence completion, paragraph cutting.In this embodiment, body comprises 8 paragraphs altogether, 23 sentences.
Between word, relatedness computation step, based on knowing that the computing semantic that net provides gains knowledge, obtains the degree of correlation of two words by calculating the former similarity of justice of two words.The formula adopting is as follows:
R(w l,w 2)=max(Rele(C i,C j))(C i∈w 1,C j∈w 2)
Wherein R (w 1, w 2) represent semantic relevancy between two words, Rele (C i, C j) represent two degrees of correlation that justice is former, get the semantic relevancy that its maximal value represents two words.
Between sentence, degree of correlation step obtains the degree of correlation of two sentences by analyzing the degree of correlation between word in two sentences.
R ( s 1 , s 2 ) = &Sigma; w i &Element; s 1 max ( R ( w i , w j ) ) ( w j &Element; s 2 )
Wherein R (s 1, s 2) represent the degree of correlation between two sentences, be the word in each sentence 1, look for the maximum word of associated degree in sentence 2, calculate the degree of correlation between these two words.Finally, by these maximal value summations, obtain the degree of correlation between these two sentences.
A semantic section segmentation step, carries out text structure analysis with reference to document " the Text Structure Analysis research of content-based relatedness computation ".Between semantic section, the feature of cut-point is first sentence after cut-point and the degree of correlation of some sentences is very little before, and larger with the degree of correlation of several sentences afterwards.Adopt the score value of following formula to 22 cut-point computed segmentation points between 23 sentences in this embodiment, and find function Q (p i) maximum point:
Q ( p i ) = &Sigma; i + 1 < j &le; i + a R ( s i , s j ) - &Sigma; i - a < = j < i R ( s i , s j )
In this embodiment, Q (p i) comprise 2 maximum points, according to these two maximum points, this news is divided into 3 semantic sections.The sub-topics that each semantic section has comprised news, in this embodiment, first semantic section is the general introduction to media event, latter two semantic section is that two sides divide other comment to this media event.
A summarization generation step according to user's requirement, extracts by a certain percentage summary from the text of text formatting.
In this embodiment, this summarization generation step, by relatedness computation step between sentence, is calculated sentence in each sub-topics and the degree of correlation sum of article title sequence of words, thereby determines the value of each sub-topics.The quantity that extracts sentence from sub-topics is directly proportional with the degree of correlation of this sub-topics and article title.
In this embodiment, the ratio that user specifies is 0.2, and 5 words of extracting in 23 form summary.By the value of 3 sub-topicses is calculated, determine from 3 semantic sections and extract respectively 2,1,1 sentences.Finally, described summarization generation step is linked in sequence 5 summary sentences choosing, forms and makes a summary and export.

Claims (6)

1. a generation method for the Web documentation summary of analyzing based on text structure, is characterized in that: the method comprises the following steps:
1) input the URL of webpage to be made a summary;
2) extract Web page text from webpage to be made a summary based on visual analysis, specifically comprise;
2.1) adopt browser core that Web document is resolved and played up;
2.2) adopt Visual tree algorithm to carry out piecemeal to webpage, obtain position, the area of each block;
2.3) each block is carried out to participle;
2.4) each block is analyzed to text feature;
2.5) whether each block being comprised to text gives a mark;
2.6) score is linked in sequence higher than the text of a certain threshold value;
2.7) output Web document text;
3) text extracting is carried out to the autoabstract of analyzing based on text structure, specifically comprises:
3.1) by step 2) obtain Web page text;
3.2) text is carried out to participle and part-of-speech tagging;
3.3) carry out text pre-service: the basic structure in identification text, identify article title, sentence completion, paragraph cutting;
3.4) text is carried out to the cutting of semanteme section, by the semantic position changing of text structure analysis identification, as the mark of semantic section cutting;
3.5) to each semantic section, utilize the promotion method of TFIDF, the importance to each sentence in the semantic section in place is measured, and then according to abstract word number requirement, extracts some the sentences that can represent this semanteme section theme;
3.6) each sentence is linked in sequence, output digest.
2. method according to claim 1, is characterized in that: step 2.4) described in text feature be number of words, font size, declarative sentence quantity, non-declarative sentence quantity and text fragment quantity.
3. method according to claim 1, is characterized in that: step 2.5) described in judge whether each block comprises text and give a mark, use following formula to calculate the score value of marking:
V ( s ) = S 2 * P ( x 1 , x 2 , x 3 , x 4 ) N + 1
Wherein S represents declarative sentence quantity, and N represents non-declarative sentence quantity, and P is a value big or small according to block and that position calculation obtains, x 1, y 1represent the coordinate in the block upper left corner, x 2, y 2represent the coordinate in the block lower right corner.
4. method according to claim 1, is characterized in that: step 3.4) described in the analysis identification of the semantic position changing be:
1) document D is carried out to subordinate sentence, between every two adjacent sentences, be cut-point undetermined;
2) each cut-point undetermined is given a mark, its formula is:
Q ( p i ) = &Sigma; i + 1 < j &le; i + a R ( s i , s j ) - &Sigma; i - a < = j < i R ( s i , s j )
Wherein, R (s i, s j) expression sentence s iwith sentence s jsentence between semantic relevancy; p irepresent that cut-point is at sentence s iand s i-1between, if Q is (p i) > Q (p i-1) and 2 (p i) > Q (p i+1), p is described ithe maximum point of cut-point weights, so p iit is the cut-point between semantic section in the text; A is an adjustable empirical parameter, and the scope of the semantic analysis while being illustrated in identification cut-point represents to consider cut-point front and back each a sentence;
3) if the score value of cut-point is greater than a certain threshold value, and be local maximum, score value is higher than the score value of former and later two cut-points, and this cut-point is exactly the cut-off of semantic section, i.e. step 3.4) described in the semantic position changing.
5. method according to claim 4, is characterized in that: step 2) described between sentence the calculating of semantic relevancy comprise the following steps:
1) sentence is cut into the set of word;
2) use following formula to calculate semantic relevancy between sentence
R ( s 1 , s 2 ) = &Sigma; w i &Element; s 1 max ( R ( w i , w j ) ) ( w j &Element; s 2 )
Wherein R (w i, w j) expression word w iwith word w jword between semantic relevancy.
6. method according to claim 1, is characterized in that: step 3.5) described in to each sentence the importance in the semantic section in place measure use below formula calculating:
V(S 1)=sum(w∈S 1)*TFIDF(w)
Wherein, while calculating TFIDF (w), each paragraph is considered as to independently file, several paragraphs that entire article is comprised are considered as file set.
CN201410090200.0A 2014-03-12 2014-03-12 Text structure analysis-based Web document abstract generation method Expired - Fee Related CN103853834B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410090200.0A CN103853834B (en) 2014-03-12 2014-03-12 Text structure analysis-based Web document abstract generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410090200.0A CN103853834B (en) 2014-03-12 2014-03-12 Text structure analysis-based Web document abstract generation method

Publications (2)

Publication Number Publication Date
CN103853834A true CN103853834A (en) 2014-06-11
CN103853834B CN103853834B (en) 2017-02-08

Family

ID=50861489

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410090200.0A Expired - Fee Related CN103853834B (en) 2014-03-12 2014-03-12 Text structure analysis-based Web document abstract generation method

Country Status (1)

Country Link
CN (1) CN103853834B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677764A (en) * 2015-12-30 2016-06-15 百度在线网络技术(北京)有限公司 Information extraction method and device
CN106484768A (en) * 2016-09-09 2017-03-08 天津海量信息技术股份有限公司 The local feature abstracting method of content of text salient region and system
CN106844340A (en) * 2017-01-10 2017-06-13 北京百度网讯科技有限公司 News in brief generation and display methods, apparatus and system based on artificial intelligence
CN107346335A (en) * 2017-06-28 2017-11-14 浙江大学 A kind of Web page subject block identifying method based on assemblage characteristic
CN107622046A (en) * 2017-09-01 2018-01-23 广州慧睿思通信息科技有限公司 A kind of algorithm according to keyword abstraction text snippet
CN107766325A (en) * 2017-09-27 2018-03-06 百度在线网络技术(北京)有限公司 Text joining method and its device
CN108427761A (en) * 2018-03-21 2018-08-21 腾讯科技(深圳)有限公司 A kind of method, terminal, server and the storage medium of media event processing
CN110889280A (en) * 2018-09-06 2020-03-17 上海智臻智能网络科技股份有限公司 Knowledge base construction method and device based on document splitting
CN110968752A (en) * 2018-09-28 2020-04-07 珠海格力电器股份有限公司 Data acquisition method and device, storage medium and electronic equipment
US10929452B2 (en) 2017-05-23 2021-02-23 Huawei Technologies Co., Ltd. Multi-document summary generation method and apparatus, and terminal
CN113515627A (en) * 2021-05-19 2021-10-19 北京世纪好未来教育科技有限公司 Document detection method, device, equipment and storage medium
CN114417808A (en) * 2022-02-25 2022-04-29 北京百度网讯科技有限公司 Article generation method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1536483A (en) * 2003-04-04 2004-10-13 陈文中 Method for extracting and processing network information and its system
US20090210381A1 (en) * 2008-02-15 2009-08-20 Yahoo! Inc. Search result abstract quality using community metadata
CN102446191A (en) * 2010-10-13 2012-05-09 北京创新方舟科技有限公司 Method for generating webpage content abstracts and equipment and system adopting same

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1536483A (en) * 2003-04-04 2004-10-13 陈文中 Method for extracting and processing network information and its system
US20090210381A1 (en) * 2008-02-15 2009-08-20 Yahoo! Inc. Search result abstract quality using community metadata
CN102446191A (en) * 2010-10-13 2012-05-09 北京创新方舟科技有限公司 Method for generating webpage content abstracts and equipment and system adopting same

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
何媛媛: ""基于潜在语义分析的多网页自动文摘研究"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
钟茂生: ""基于内容相关度计算的文本结构分析方法研究"", 《中国博士学位论文全文数据库信息科技辑》 *
黄文蓓 等: ""基于分块的网页正文信息提取算法研究"", 《计算机应用》 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677764A (en) * 2015-12-30 2016-06-15 百度在线网络技术(北京)有限公司 Information extraction method and device
CN105677764B (en) * 2015-12-30 2020-05-08 百度在线网络技术(北京)有限公司 Information extraction method and device
CN106484768A (en) * 2016-09-09 2017-03-08 天津海量信息技术股份有限公司 The local feature abstracting method of content of text salient region and system
CN106484768B (en) * 2016-09-09 2019-12-31 天津海量信息技术股份有限公司 Local feature extraction method and system for text content saliency region
CN106844340B (en) * 2017-01-10 2020-04-07 北京百度网讯科技有限公司 News abstract generating and displaying method, device and system based on artificial intelligence
CN106844340A (en) * 2017-01-10 2017-06-13 北京百度网讯科技有限公司 News in brief generation and display methods, apparatus and system based on artificial intelligence
US10929452B2 (en) 2017-05-23 2021-02-23 Huawei Technologies Co., Ltd. Multi-document summary generation method and apparatus, and terminal
CN107346335A (en) * 2017-06-28 2017-11-14 浙江大学 A kind of Web page subject block identifying method based on assemblage characteristic
CN107346335B (en) * 2017-06-28 2020-04-14 浙江大学 Webpage theme block identification method based on combination characteristics
CN107622046A (en) * 2017-09-01 2018-01-23 广州慧睿思通信息科技有限公司 A kind of algorithm according to keyword abstraction text snippet
CN107766325A (en) * 2017-09-27 2018-03-06 百度在线网络技术(北京)有限公司 Text joining method and its device
CN108427761A (en) * 2018-03-21 2018-08-21 腾讯科技(深圳)有限公司 A kind of method, terminal, server and the storage medium of media event processing
CN108427761B (en) * 2018-03-21 2022-01-14 腾讯科技(深圳)有限公司 News event processing method, terminal, server and storage medium
CN110889280A (en) * 2018-09-06 2020-03-17 上海智臻智能网络科技股份有限公司 Knowledge base construction method and device based on document splitting
CN110889280B (en) * 2018-09-06 2023-09-26 上海智臻智能网络科技股份有限公司 Knowledge base construction method and device based on document splitting
CN110968752A (en) * 2018-09-28 2020-04-07 珠海格力电器股份有限公司 Data acquisition method and device, storage medium and electronic equipment
CN113515627A (en) * 2021-05-19 2021-10-19 北京世纪好未来教育科技有限公司 Document detection method, device, equipment and storage medium
CN113515627B (en) * 2021-05-19 2023-07-25 北京世纪好未来教育科技有限公司 Document detection method, device, equipment and storage medium
CN114417808A (en) * 2022-02-25 2022-04-29 北京百度网讯科技有限公司 Article generation method and device, electronic equipment and storage medium
CN114417808B (en) * 2022-02-25 2023-04-07 北京百度网讯科技有限公司 Article generation method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN103853834B (en) 2017-02-08

Similar Documents

Publication Publication Date Title
CN103853834B (en) Text structure analysis-based Web document abstract generation method
CN104933027B (en) A kind of open Chinese entity relation extraction method of utilization dependency analysis
US8463786B2 (en) Extracting topically related keywords from related documents
Peters et al. Content extraction using diverse feature sets
TWI695277B (en) Automatic website data collection method
CN103927397B (en) Recognition method for Web page link blocks based on block tree
CN104199972A (en) Named entity relation extraction and construction method based on deep learning
US20230229714A1 (en) Identifying Information Using Referenced Text
Piperski et al. Big and diverse is beautiful: A large corpus of Russian to study linguistic variation
Asadi et al. Pseudo test collections for learning web search ranking functions
CN104035972B (en) A kind of knowledge recommendation method and system based on microblogging
CN103294781A (en) Method and equipment used for processing page data
CN103838796A (en) Webpage structured information extraction method
CN103294664A (en) Method and system for discovering new words in open fields
CN102750390A (en) Automatic news webpage element extracting method
CN101887443A (en) Method and device for classifying texts
CN103559234A (en) System and method for automated semantic annotation of RESTful Web services
CN107479879A (en) The API and its use recommendation method that a kind of software-oriented function is safeguarded
JP5427694B2 (en) Related content presentation apparatus and program
Nethra et al. WEB CONTENT EXTRACTION USING HYBRID APPROACH.
CN106168947A (en) A kind of related entities method for digging and system
CN103377207B (en) Microblog users relation acquisition method based on script engine
Lin et al. Combining a segmentation-like approach and a density-based approach in content extraction
CN109933791A (en) Material recommended method, device, computer equipment and computer readable storage medium
Conde et al. Inferring user intent in web search by exploiting social annotations

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170208

Termination date: 20200312

CF01 Termination of patent right due to non-payment of annual fee