CN103853834A - Text structure analysis-based Web document abstract generation method - Google Patents
Text structure analysis-based Web document abstract generation method Download PDFInfo
- Publication number
- CN103853834A CN103853834A CN201410090200.0A CN201410090200A CN103853834A CN 103853834 A CN103853834 A CN 103853834A CN 201410090200 A CN201410090200 A CN 201410090200A CN 103853834 A CN103853834 A CN 103853834A
- Authority
- CN
- China
- Prior art keywords
- text
- sentence
- semantic
- cut
- point
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/80—Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention discloses a text structure analysis-based Web document abstract generation method. The method comprises the steps of using a URL (uniform resource locator) as input, integrating the webpage main bodies of visual features and text features for extraction, partitioning the main bodies into a plurality of semantic paragraphs, and abstracting each semantic paragraph, so the generated abstract has higher coverage rate. The text structure analysis-based Web document summary generation method realizes the generation of the text abstract with better quality from a Webpage aiming at the conditions that the Webpage structure is complex, the main body is hard to identify and the Chinese automatic abstract is still positioned in the probe stage.
Description
Technical field
The present invention relates to Web page text extraction, natural language processing, Chinese Text Summarization technical field, specifically a kind of generation method of the Web documentation summary of analyzing based on text structure.
Background technology
At present, Internet has become the main source of people's obtaining information.The particularly develop rapidly of user-generated content (UGC) in recent years, the information on Internet is just in explosive growth.Although search engine can require to return Search Results according to user.But user still need to find the webpage of the most applicable oneself needs from search listing, particularly due to a large amount of search engine optimizations that exist on internet with reprint phenomenon, bring very large difficulty to user's searching information fast and accurately.
Automatic abstracting system is to utilize computing machine fast processing Web document, therefrom captures out the core content of Web document by certain ratio of compression, and user can therefrom obtain subject information and judge the value of this Web document, has improved the efficiency of user search information.
In Web document, exist in a large number noise information, information as irrelevant in advertisement, navigation bar, user function bar, associated recommendation, copyright information etc. and theme.Web document is a kind of semi-structured information, although have a fixed structure, semanteme cannot be determined.The page that the expression of content in html source code and final rendering obtain has very large difference.The extensive application of JS and AJAX technology in recent years, making web data is no longer static HTML code, but dynamically generate, even also can produce corresponding change for user's operation behavior.So how to extract from Web document and content Topic relative and that structure is correct, exist certain difficulty.
The history of the nearly more than two decades of Chinese Text Summarization systematic research, but at present also in the exploratory stage, the result of autoabstract also far away can not be satisfactory.The method of autoabstract is mainly divided into two large classes, the automatic abstract based on understanding and the automatic abstract based on extracting.Because natural language processing technique does not still have important breakthrough, realize automatic abstract so the method based on understanding can not be real.
And shorter towards the research history of the autoabstract technology of Web document, " compared with traditional text; the text structure of webpage is loose; title name is relatively so not rigorous; a sentence finishes also may not have end mark; and there is a large amount of and the incoherent content of text, this brings certain difficulty to generation of summary.”
Summary of the invention
The object of this invention is to provide a kind of generation method of the Web documentation summary of analyzing based on text structure, the method integrated use the technology such as visual signature analysis, natural language analysis, text structure analysis, for the each webpage in Search Results generates based on semanteme, the good webpage summary of quality, for user provides reference.
The object of the present invention is achieved like this:
A generation method for the Web documentation summary of analyzing based on text structure, it comprises the following steps:
1) input the URL of webpage to be made a summary;
2) extract Web page text from webpage to be made a summary based on visual analysis, specifically comprise;
2.1) adopt browser core that Web document is resolved and played up;
2.2) adopt Visual tree (VIPS) algorithm to carry out piecemeal to webpage, obtain position, the area of each block;
2.3) each block is carried out to participle;
2.4) each block is analyzed to text feature;
2.5) whether each block being comprised to text gives a mark;
2.6) score is linked in sequence higher than the text of a certain threshold value;
2.7) output Web document text;
3) text extracting is carried out to the autoabstract of analyzing based on text structure, specifically comprises:
3.1) by step 2) obtain Web page text;
3.2) text is carried out to participle and part-of-speech tagging;
3.3) carry out text pre-service: the basic structure in identification text, identify article title, sentence completion, paragraph cutting;
3.4) text is carried out to the cutting of semanteme section, by the semantic position changing of text structure analysis identification, as the mark of semantic section cutting;
3.5) to each semantic section, utilize the promotion method of TFIDF, the importance to each sentence in the semantic section in place is measured, and then according to abstract word number requirement, extracts some the sentences that can represent this semanteme section theme;
3.6) each sentence is linked in sequence, output digest.
Described step 2.4) in text feature be number of words, font size, declarative sentence quantity, non-declarative sentence quantity and text fragment quantity.
Described step 2.5) described in judge whether each block comprises text and give a mark, use following formula to calculate the score value of marking:
Wherein S represents declarative sentence quantity, and N represents non-declarative sentence quantity, and P is a value big or small according to block and that position calculation obtains, x
1, y
1represent the coordinate in the block upper left corner, x
2, y
2represent the coordinate in the block lower right corner.
Described step 3.4) in the analysis identification of the position that changes of semanteme be:
1) document D is carried out to subordinate sentence, between every two adjacent sentences, be cut-point undetermined;
2) each cut-point undetermined is given a mark, its formula is:
Wherein, R (s
i, s
j) expression sentence s
iwith sentence s
jsentence between semantic relevancy; p
irepresent that cut-point is at sentence s
iand s
i-1between, if Q is (p
i) > Q (p
i-1) and Q (p
i) > Q (p
i+1), p is described
ithe maximum point of cut-point weights, so p
iit is the cut-point between semantic section in the text.A is an adjustable empirical parameter, and the scope of the semantic analysis while being illustrated in identification cut-point represents to consider cut-point front and back each a sentence.
3) if the score value of cut-point is greater than a certain threshold value, and be local maximum, score value is higher than the score value of former and later two cut-points, and this cut-point is exactly the cut-off of semantic section, i.e. step 3.4) described in the semantic position changing.
The analysis identification step 2 of the position that described semanteme changes) between sentence the calculating of semantic relevancy comprise the following steps:
1) sentence is cut into the set of word;
2) use following formula to calculate semantic relevancy between sentence
Wherein R (w
i, w
j) expression word w
iwith word w
jword between semantic relevancy.
Described step 3.5) in to each sentence the importance in the semantic section in place measure use below formula calculating:
V(S
l)=sum(w∈S
1)*TFIDF(w)
Wherein, while calculating TFIDF (w), each paragraph is considered as to independently file, several paragraphs that entire article is comprised are considered as file set.
The present invention can filter out in webpage and irrelevant word, the link etc. of theme, identifies the article text comprising in webpage, and accuracy rate is higher, and has higher robustness.Autoabstract flow process has adopted the automatic Summarization Technique of analyzing based on text structure, and the summary coverage rate of generation is high and summary is comparatively smooth.
The present invention can, for Web document, by the ratio of compression requirement of user's appointment, only need to input the URL address of webpage to be made a summary, just can be within the time of several seconds, formation can cover the original text meaning, and comparatively accurate, smooth summary helps user's searching information in internet fast and accurately.
Brief description of the drawings
Fig. 1 is process flow diagram of the present invention;
Fig. 2 is webpage pretreatment process figure of the present invention;
Fig. 3 is autoabstract process flow diagram of the present invention
Embodiment
The invention discloses a kind of Web documentation summary generation method of Search Engine-Oriented, can Web webpage of automatic analysis, and the text snippet of reaction of formation Web page subject.
The present invention comprises a Web page text that combines visual signature and text feature and extracts and an autotext summary based on carry out sub-topics division by text structure analysis.
The present invention, using a URL as input, through Web page text extraction, two stages of autoabstract, finally generates text snippet.
Specific algorithm to described two stages below, is described further for example in conjunction with a news web page is made a summary:
Fig. 1 has described from URL to be made a summary to the overall procedure that generates summary, comprising webpage pretreatment process and autoabstract flow process.
Particularly, in an embodiment, the present invention obtains the URL of news web page to be made a summary in webpage pretreatment process (see figure 2) URL input step.Webpage pretreatment process, by analyzing visual signature, can find the body part in webpage more accurately, has more high robust than additive method.Consider other features such as text feature, the analysis of text-dependent degree, html tag feature, semantic feature simultaneously, further improve the accuracy that Web Web page text extracts.
Webpage is played up step and is responsible for reading webpage corresponding to input URL, in this embodiment, adopts IE11 browser core to process html tag, and plays up this webpage.On the basis of playing up at webpage, Visual tree analytical procedure adopts VIPS algorithm, and webpage is carried out to Visual tree analysis, obtains position, the area of each block.In this embodiment, news web page to be made a summary is divided into 6 blocks by this step: a top block, a bottom block, navigation block, an advertisement block and two blocks that comprise text.Participle step is responsible for each block to carry out participle.Then, text feature analytical procedure is carried out text feature analysis to word segmentation result.Feature and the text feature of last comprehensive analytical procedure each block that analysis obtains to Visual tree are comprehensively analyzed, output body.
In this embodiment, adopt following formula to calculate P (x
1, y
1, x
2, y
2).
P(x
l,y
l,x
2y
2)=(x
2-x
1)*(y
2-y
1)-x
1*y
1
Wherein x
1, y
1represent the coordinate in the block upper left corner, x
2, y
2represent the coordinate in the block lower right corner.Then calculate V (s) value of each block:
V (s) value of above-mentioned 6 blocks is respectively 3.7 × 10 from big to small
6, 2.3 × 10
6, 7.5 × 10
5, 5.4 × 10
6, 3.7 × 10
5, 1.6 × 10
5, 1.2 × 10
4.
In this embodiment, the threshold value of employing is 10
6so, choose V (s) and be greater than 10
6block, i.e. two maximum blocks of V (s) value.In this embodiment, two maximum blocks of V (s) value are exactly two blocks that comprise text, so correctly extracted body.
Extracting after body, then carrying out autoabstract flow process (see figure 3), comprising that relatedness computation between relatedness computation between text pre-service, word, sentence, semantic section are cut apart, these steps of summarization generation.
A text pre-treatment step, the basic structure in identification text, identifies article title, sentence completion, paragraph cutting.In this embodiment, body comprises 8 paragraphs altogether, 23 sentences.
Between word, relatedness computation step, based on knowing that the computing semantic that net provides gains knowledge, obtains the degree of correlation of two words by calculating the former similarity of justice of two words.The formula adopting is as follows:
R(w
l,w
2)=max(Rele(C
i,C
j))(C
i∈w
1,C
j∈w
2)
Wherein R (w
1, w
2) represent semantic relevancy between two words, Rele (C
i, C
j) represent two degrees of correlation that justice is former, get the semantic relevancy that its maximal value represents two words.
Between sentence, degree of correlation step obtains the degree of correlation of two sentences by analyzing the degree of correlation between word in two sentences.
Wherein R (s
1, s
2) represent the degree of correlation between two sentences, be the word in each sentence 1, look for the maximum word of associated degree in sentence 2, calculate the degree of correlation between these two words.Finally, by these maximal value summations, obtain the degree of correlation between these two sentences.
A semantic section segmentation step, carries out text structure analysis with reference to document " the Text Structure Analysis research of content-based relatedness computation ".Between semantic section, the feature of cut-point is first sentence after cut-point and the degree of correlation of some sentences is very little before, and larger with the degree of correlation of several sentences afterwards.Adopt the score value of following formula to 22 cut-point computed segmentation points between 23 sentences in this embodiment, and find function Q (p
i) maximum point:
In this embodiment, Q (p
i) comprise 2 maximum points, according to these two maximum points, this news is divided into 3 semantic sections.The sub-topics that each semantic section has comprised news, in this embodiment, first semantic section is the general introduction to media event, latter two semantic section is that two sides divide other comment to this media event.
A summarization generation step according to user's requirement, extracts by a certain percentage summary from the text of text formatting.
In this embodiment, this summarization generation step, by relatedness computation step between sentence, is calculated sentence in each sub-topics and the degree of correlation sum of article title sequence of words, thereby determines the value of each sub-topics.The quantity that extracts sentence from sub-topics is directly proportional with the degree of correlation of this sub-topics and article title.
In this embodiment, the ratio that user specifies is 0.2, and 5 words of extracting in 23 form summary.By the value of 3 sub-topicses is calculated, determine from 3 semantic sections and extract respectively 2,1,1 sentences.Finally, described summarization generation step is linked in sequence 5 summary sentences choosing, forms and makes a summary and export.
Claims (6)
1. a generation method for the Web documentation summary of analyzing based on text structure, is characterized in that: the method comprises the following steps:
1) input the URL of webpage to be made a summary;
2) extract Web page text from webpage to be made a summary based on visual analysis, specifically comprise;
2.1) adopt browser core that Web document is resolved and played up;
2.2) adopt Visual tree algorithm to carry out piecemeal to webpage, obtain position, the area of each block;
2.3) each block is carried out to participle;
2.4) each block is analyzed to text feature;
2.5) whether each block being comprised to text gives a mark;
2.6) score is linked in sequence higher than the text of a certain threshold value;
2.7) output Web document text;
3) text extracting is carried out to the autoabstract of analyzing based on text structure, specifically comprises:
3.1) by step 2) obtain Web page text;
3.2) text is carried out to participle and part-of-speech tagging;
3.3) carry out text pre-service: the basic structure in identification text, identify article title, sentence completion, paragraph cutting;
3.4) text is carried out to the cutting of semanteme section, by the semantic position changing of text structure analysis identification, as the mark of semantic section cutting;
3.5) to each semantic section, utilize the promotion method of TFIDF, the importance to each sentence in the semantic section in place is measured, and then according to abstract word number requirement, extracts some the sentences that can represent this semanteme section theme;
3.6) each sentence is linked in sequence, output digest.
2. method according to claim 1, is characterized in that: step 2.4) described in text feature be number of words, font size, declarative sentence quantity, non-declarative sentence quantity and text fragment quantity.
3. method according to claim 1, is characterized in that: step 2.5) described in judge whether each block comprises text and give a mark, use following formula to calculate the score value of marking:
Wherein S represents declarative sentence quantity, and N represents non-declarative sentence quantity, and P is a value big or small according to block and that position calculation obtains, x
1, y
1represent the coordinate in the block upper left corner, x
2, y
2represent the coordinate in the block lower right corner.
4. method according to claim 1, is characterized in that: step 3.4) described in the analysis identification of the semantic position changing be:
1) document D is carried out to subordinate sentence, between every two adjacent sentences, be cut-point undetermined;
2) each cut-point undetermined is given a mark, its formula is:
Wherein, R (s
i, s
j) expression sentence s
iwith sentence s
jsentence between semantic relevancy; p
irepresent that cut-point is at sentence s
iand s
i-1between, if Q is (p
i) > Q (p
i-1) and 2 (p
i) > Q (p
i+1), p is described
ithe maximum point of cut-point weights, so p
iit is the cut-point between semantic section in the text; A is an adjustable empirical parameter, and the scope of the semantic analysis while being illustrated in identification cut-point represents to consider cut-point front and back each a sentence;
3) if the score value of cut-point is greater than a certain threshold value, and be local maximum, score value is higher than the score value of former and later two cut-points, and this cut-point is exactly the cut-off of semantic section, i.e. step 3.4) described in the semantic position changing.
5. method according to claim 4, is characterized in that: step 2) described between sentence the calculating of semantic relevancy comprise the following steps:
1) sentence is cut into the set of word;
2) use following formula to calculate semantic relevancy between sentence
Wherein R (w
i, w
j) expression word w
iwith word w
jword between semantic relevancy.
6. method according to claim 1, is characterized in that: step 3.5) described in to each sentence the importance in the semantic section in place measure use below formula calculating:
V(S
1)=sum(w∈S
1)*TFIDF(w)
Wherein, while calculating TFIDF (w), each paragraph is considered as to independently file, several paragraphs that entire article is comprised are considered as file set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410090200.0A CN103853834B (en) | 2014-03-12 | 2014-03-12 | Text structure analysis-based Web document abstract generation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410090200.0A CN103853834B (en) | 2014-03-12 | 2014-03-12 | Text structure analysis-based Web document abstract generation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103853834A true CN103853834A (en) | 2014-06-11 |
CN103853834B CN103853834B (en) | 2017-02-08 |
Family
ID=50861489
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410090200.0A Expired - Fee Related CN103853834B (en) | 2014-03-12 | 2014-03-12 | Text structure analysis-based Web document abstract generation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103853834B (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105677764A (en) * | 2015-12-30 | 2016-06-15 | 百度在线网络技术(北京)有限公司 | Information extraction method and device |
CN106484768A (en) * | 2016-09-09 | 2017-03-08 | 天津海量信息技术股份有限公司 | The local feature abstracting method of content of text salient region and system |
CN106844340A (en) * | 2017-01-10 | 2017-06-13 | 北京百度网讯科技有限公司 | News in brief generation and display methods, apparatus and system based on artificial intelligence |
CN107346335A (en) * | 2017-06-28 | 2017-11-14 | 浙江大学 | A kind of Web page subject block identifying method based on assemblage characteristic |
CN107622046A (en) * | 2017-09-01 | 2018-01-23 | 广州慧睿思通信息科技有限公司 | A kind of algorithm according to keyword abstraction text snippet |
CN107766325A (en) * | 2017-09-27 | 2018-03-06 | 百度在线网络技术(北京)有限公司 | Text joining method and its device |
CN108427761A (en) * | 2018-03-21 | 2018-08-21 | 腾讯科技(深圳)有限公司 | A kind of method, terminal, server and the storage medium of media event processing |
CN110889280A (en) * | 2018-09-06 | 2020-03-17 | 上海智臻智能网络科技股份有限公司 | Knowledge base construction method and device based on document splitting |
CN110968752A (en) * | 2018-09-28 | 2020-04-07 | 珠海格力电器股份有限公司 | Data acquisition method and device, storage medium and electronic equipment |
US10929452B2 (en) | 2017-05-23 | 2021-02-23 | Huawei Technologies Co., Ltd. | Multi-document summary generation method and apparatus, and terminal |
CN113515627A (en) * | 2021-05-19 | 2021-10-19 | 北京世纪好未来教育科技有限公司 | Document detection method, device, equipment and storage medium |
CN114417808A (en) * | 2022-02-25 | 2022-04-29 | 北京百度网讯科技有限公司 | Article generation method and device, electronic equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1536483A (en) * | 2003-04-04 | 2004-10-13 | 陈文中 | Method for extracting and processing network information and its system |
US20090210381A1 (en) * | 2008-02-15 | 2009-08-20 | Yahoo! Inc. | Search result abstract quality using community metadata |
CN102446191A (en) * | 2010-10-13 | 2012-05-09 | 北京创新方舟科技有限公司 | Method for generating webpage content abstracts and equipment and system adopting same |
-
2014
- 2014-03-12 CN CN201410090200.0A patent/CN103853834B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1536483A (en) * | 2003-04-04 | 2004-10-13 | 陈文中 | Method for extracting and processing network information and its system |
US20090210381A1 (en) * | 2008-02-15 | 2009-08-20 | Yahoo! Inc. | Search result abstract quality using community metadata |
CN102446191A (en) * | 2010-10-13 | 2012-05-09 | 北京创新方舟科技有限公司 | Method for generating webpage content abstracts and equipment and system adopting same |
Non-Patent Citations (3)
Title |
---|
何媛媛: ""基于潜在语义分析的多网页自动文摘研究"", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
钟茂生: ""基于内容相关度计算的文本结构分析方法研究"", 《中国博士学位论文全文数据库信息科技辑》 * |
黄文蓓 等: ""基于分块的网页正文信息提取算法研究"", 《计算机应用》 * |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105677764A (en) * | 2015-12-30 | 2016-06-15 | 百度在线网络技术(北京)有限公司 | Information extraction method and device |
CN105677764B (en) * | 2015-12-30 | 2020-05-08 | 百度在线网络技术(北京)有限公司 | Information extraction method and device |
CN106484768A (en) * | 2016-09-09 | 2017-03-08 | 天津海量信息技术股份有限公司 | The local feature abstracting method of content of text salient region and system |
CN106484768B (en) * | 2016-09-09 | 2019-12-31 | 天津海量信息技术股份有限公司 | Local feature extraction method and system for text content saliency region |
CN106844340B (en) * | 2017-01-10 | 2020-04-07 | 北京百度网讯科技有限公司 | News abstract generating and displaying method, device and system based on artificial intelligence |
CN106844340A (en) * | 2017-01-10 | 2017-06-13 | 北京百度网讯科技有限公司 | News in brief generation and display methods, apparatus and system based on artificial intelligence |
US10929452B2 (en) | 2017-05-23 | 2021-02-23 | Huawei Technologies Co., Ltd. | Multi-document summary generation method and apparatus, and terminal |
CN107346335A (en) * | 2017-06-28 | 2017-11-14 | 浙江大学 | A kind of Web page subject block identifying method based on assemblage characteristic |
CN107346335B (en) * | 2017-06-28 | 2020-04-14 | 浙江大学 | Webpage theme block identification method based on combination characteristics |
CN107622046A (en) * | 2017-09-01 | 2018-01-23 | 广州慧睿思通信息科技有限公司 | A kind of algorithm according to keyword abstraction text snippet |
CN107766325A (en) * | 2017-09-27 | 2018-03-06 | 百度在线网络技术(北京)有限公司 | Text joining method and its device |
CN108427761A (en) * | 2018-03-21 | 2018-08-21 | 腾讯科技(深圳)有限公司 | A kind of method, terminal, server and the storage medium of media event processing |
CN108427761B (en) * | 2018-03-21 | 2022-01-14 | 腾讯科技(深圳)有限公司 | News event processing method, terminal, server and storage medium |
CN110889280A (en) * | 2018-09-06 | 2020-03-17 | 上海智臻智能网络科技股份有限公司 | Knowledge base construction method and device based on document splitting |
CN110889280B (en) * | 2018-09-06 | 2023-09-26 | 上海智臻智能网络科技股份有限公司 | Knowledge base construction method and device based on document splitting |
CN110968752A (en) * | 2018-09-28 | 2020-04-07 | 珠海格力电器股份有限公司 | Data acquisition method and device, storage medium and electronic equipment |
CN113515627A (en) * | 2021-05-19 | 2021-10-19 | 北京世纪好未来教育科技有限公司 | Document detection method, device, equipment and storage medium |
CN113515627B (en) * | 2021-05-19 | 2023-07-25 | 北京世纪好未来教育科技有限公司 | Document detection method, device, equipment and storage medium |
CN114417808A (en) * | 2022-02-25 | 2022-04-29 | 北京百度网讯科技有限公司 | Article generation method and device, electronic equipment and storage medium |
CN114417808B (en) * | 2022-02-25 | 2023-04-07 | 北京百度网讯科技有限公司 | Article generation method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN103853834B (en) | 2017-02-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103853834B (en) | Text structure analysis-based Web document abstract generation method | |
CN104933027B (en) | A kind of open Chinese entity relation extraction method of utilization dependency analysis | |
US8463786B2 (en) | Extracting topically related keywords from related documents | |
Peters et al. | Content extraction using diverse feature sets | |
TWI695277B (en) | Automatic website data collection method | |
CN103927397B (en) | Recognition method for Web page link blocks based on block tree | |
CN104199972A (en) | Named entity relation extraction and construction method based on deep learning | |
US20230229714A1 (en) | Identifying Information Using Referenced Text | |
Piperski et al. | Big and diverse is beautiful: A large corpus of Russian to study linguistic variation | |
Asadi et al. | Pseudo test collections for learning web search ranking functions | |
CN104035972B (en) | A kind of knowledge recommendation method and system based on microblogging | |
CN103294781A (en) | Method and equipment used for processing page data | |
CN103838796A (en) | Webpage structured information extraction method | |
CN103294664A (en) | Method and system for discovering new words in open fields | |
CN102750390A (en) | Automatic news webpage element extracting method | |
CN101887443A (en) | Method and device for classifying texts | |
CN103559234A (en) | System and method for automated semantic annotation of RESTful Web services | |
CN107479879A (en) | The API and its use recommendation method that a kind of software-oriented function is safeguarded | |
JP5427694B2 (en) | Related content presentation apparatus and program | |
Nethra et al. | WEB CONTENT EXTRACTION USING HYBRID APPROACH. | |
CN106168947A (en) | A kind of related entities method for digging and system | |
CN103377207B (en) | Microblog users relation acquisition method based on script engine | |
Lin et al. | Combining a segmentation-like approach and a density-based approach in content extraction | |
CN109933791A (en) | Material recommended method, device, computer equipment and computer readable storage medium | |
Conde et al. | Inferring user intent in web search by exploiting social annotations |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20170208 Termination date: 20200312 |
|
CF01 | Termination of patent right due to non-payment of annual fee |