CN110069622A - A kind of personal share bulletin abstract intelligent extract method - Google Patents

A kind of personal share bulletin abstract intelligent extract method Download PDF

Info

Publication number
CN110069622A
CN110069622A CN201710646956.2A CN201710646956A CN110069622A CN 110069622 A CN110069622 A CN 110069622A CN 201710646956 A CN201710646956 A CN 201710646956A CN 110069622 A CN110069622 A CN 110069622A
Authority
CN
China
Prior art keywords
paragraph
abstract
bulletin
template
personal share
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710646956.2A
Other languages
Chinese (zh)
Inventor
方明
陈平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Ding Ting Information Technology Co Ltd
Original Assignee
Wuhan Ding Ting Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Ding Ting Information Technology Co Ltd filed Critical Wuhan Ding Ting Information Technology Co Ltd
Priority to CN201710646956.2A priority Critical patent/CN110069622A/en
Publication of CN110069622A publication Critical patent/CN110069622A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance

Abstract

This method discloses one kind and extracts method similar with text fragment by table, to extract the abstract of personal share bulletin.Using first separating the strategy remerged, separation bulletin table and plain text, carry out structuring processing to table, carry out paragraph division processing to plain text, then in conjunction with predefined abstract template (keyword template), extracting keywords achievement data and filling template from structuring table;From dividing, searching and the most like top of template in paragraph are N number of as summary candidate paragraph, if matching finds most like paragraph from candidate paragraph and make a summary as a son less than keyword in structuring table.This method greatly improves the accuracy of abstract, improves the editorial efficiency of human editor, and the accuracy rate extracted by continuous feedback poppet is finally truly realized automation.

Description

A kind of personal share bulletin abstract intelligent extract method
Technical field
The present invention relates to computer software fields, announce its summary info pumping more particularly to the personal share of listed company's publication The scene taken.
Background technique
Currently, personal share announces numerous types, each type bulletin emphasis event is different, each type of personal share bulletin It is various.As investor, for number one, understanding the personal share bulletin content that listed company discloses in time becomes very urgent.But It is that each type of personal share announces numerous, length redundancy.Investor merely desires to understand core event therein and data (are plucked Want), rather than take a significant amount of time energy and go downloading each bulletin content of browsing.
Technically solve the problems, such as that the method is the event information extraction based on Events Frame used, just based on a whole set of Then expression formula (Expert Rules) go specific event and data in matching bulletin.But this technology ceiling is low, expert Rulemaking is many and diverse, cannot cover all situations, matches fallibility, and matching performance is poor.Therefore, many companies pluck for personal share bulletin The extraction wanted mainly still is extracted by people, and efficiency is lower.
It is announced and is found by observational study personal share, these personal shares bulletin content is mainly made of list data and text, Its table entries data introduction standardizes very much, and structure height is similar.Therefore, we invent a kind of intelligent extract method, specific aim Ground extracts the core paragraph or sentence in the list data and remaining text (rejecting table) of personal share bulletin, according still further to certain Template (field product specified) be organized into abstract.
Summary of the invention
The purpose of this method be in order to solve the technological deficiency in current method, it is at high cost, the problem of low efficiency, design A kind of method for the abstract that quickly, can effectively directly generate customization.
To solve the above-mentioned problems, this method the technical solution adopted is that:
Firstly, personal share bulletin content is converted into html format by certain technology;
Then, it identifies the table table label in html, passes through row tr, the column td etc. in further cutting table label Label extracts the entry and data of table;
Again, the remaining text of html (rejecting html label) is extracted, is cut into sentence according to punctuation mark, and each sentence Son is cut into keyword, according to BM25 algorithm, extracts and several sentences most like to solid plate;
Finally, being organized into abstract according to the sentence of extraction and table entries data.
Due to being using the beneficial effect of the above method, this method:
(1) this method is extracted for bulletin table, can extract detailed entry data, accuracy rate is high, and speed is fast, scalability By force;
(2) sentence similar with specified module is calculated using text Similarity algorithm, does not need to formulate many and diverse rule;
(3) specifying module (field product is specified) only need to include keyword, should not Expert Rules.
Detailed description of the invention
Fig. 1 is this method system framework figure.
Fig. 2 is the method specific implementation flow chart.
Fig. 3 is that auto chart is implemented in the method product side.
Fig. 4 is a kind of regular bulletin figure.
Specific embodiment
This method system architecture diagram is as shown in Figure 1, wherein the function declaration of modules is as follows:
1: configuration crawl origin url and rules for grasping;
2: according to the crawl origin url and rules for grasping of configuration, the crawl announced;
3: using PDF2HTML open source library, the bulletin of crawl being converted into html format;
4: label, the pattern etc. of redundancy in cleaning HTML;
5: extracting Table label in HTML, store into tabular form tableList;
6: extracting the plain text information of HTML, store sentenceList at list according to the Segmentation of Punctuation of setting;
7: the processing of each tableau formatization, the entry and its data in table are extracted, with<Key, Value>form storage;
8: according to preset abstract keyword module, data in tableList being gone out according to keyword abstraction and fill module.For Extract less than the case where, most like sentence replacement is found from sentenceList.
This method is applied to quality product-information-bulletin abstract editing platform at present, makes for bulletin abstract human editor audit reference With the embodiment of product side is as follows:
Firstly, bulletin abstract human editor enters bulletin abstract editing system, human editor inquires the bulletin of a certain type;
Bulletin is edited secondly, human editor clicks bulletin title, system recommendation goes out the abstract of the bulletin, and human editor can be adopted With the abstract, the abstract can also be refused, by gradually feeding back, gradually improved method extracts abstract accuracy for we;
Finally, passing through the further feedback of human editor, successive optimization process finally accomplishes the automation (such as Fig. 3) of abstract extraction.
As shown in Fig. 2, for there is bulletin existing for table, we extract in table emphasis the process flow diagram of this method Data and its meaning, for the bulletin of table is not present, our emphasis extract its similar paragraph.
This method novelty mainly has two o'clock.
First innovative point: the core data in bulletin is extracted by extracting table.Especially investor is most concerned Data substantially increase the data supporting of abstract.By research personal share bulletin discovery, for regular reporting, the types such as flash report are public It accuses, wherein the probability containing table is 99%, periodically bulletin figure, the such bulletin of almost all pass through as shown in Figure 4 Form expresses core data, and expression way is similar, very with the characteristic of structuring.
Second innovative point: by natural language processing technique, using segmentation, participle, BM25 Similarity Algorithm calculate with The most matched sentence of template.Each bulletin classification configures a kind of template, and module includes keyword, does not need Expert Rules and refers to Configuration is led, manpower physical strength is greatlyd save.
The text similarity BM25 algorithm that this method is used, the algorithm pay close attention to the presence or absence of keyword, are not concerned with phase As semantic word, it is semantic similar with template keyword of making a summary that we by term vector (word embedding) calculate sentence Degree further promotes the accuracy extracted.

Claims (5)

1. a kind of personal share announces intelligent abstract extraction method, observation finds personal share and announces feature, finds mostly by table and pure text This composition, same type of personal share bulletin tableau format is similar,
To personal share bulletin using the strategy closed again is first divided, table and plain text are first separated,
After carrying out processing independent respectively, the result after processing is remerged.
2. according to the method described in claim 1, the processing to table is, the corresponding numerical value of each entry in table, knot are extracted Structureization storage.
3. defining abstract template according to the method described in claim 1, the processing to plain text is first to have divided paragraph, use Text similar method calculates the similarity of each paragraph and template of making a summary, and takes the preceding N number of paragraph of top as summary candidate paragraph.
4. according to method described in claim 2,3, characterized in that found in structuring table according to abstract template keyword Corresponding data, and fill.
5. according to the method described in claim 2, selecting one in summary candidate paragraph for keyword not in the table Most like paragraph.
CN201710646956.2A 2017-08-01 2017-08-01 A kind of personal share bulletin abstract intelligent extract method Pending CN110069622A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710646956.2A CN110069622A (en) 2017-08-01 2017-08-01 A kind of personal share bulletin abstract intelligent extract method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710646956.2A CN110069622A (en) 2017-08-01 2017-08-01 A kind of personal share bulletin abstract intelligent extract method

Publications (1)

Publication Number Publication Date
CN110069622A true CN110069622A (en) 2019-07-30

Family

ID=67364540

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710646956.2A Pending CN110069622A (en) 2017-08-01 2017-08-01 A kind of personal share bulletin abstract intelligent extract method

Country Status (1)

Country Link
CN (1) CN110069622A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110956041A (en) * 2019-11-27 2020-04-03 重庆邮电大学 Depth learning-based co-purchase recombination bulletin summarization method
CN113836941A (en) * 2021-09-27 2021-12-24 上海合合信息科技股份有限公司 Contract navigation method and device
CN113918708A (en) * 2021-12-15 2022-01-11 深圳市迪博企业风险管理技术有限公司 Abstract extraction method
CN117216245A (en) * 2023-11-09 2023-12-12 华南理工大学 Table abstract generation method based on deep learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004334382A (en) * 2003-05-02 2004-11-25 Ricoh Co Ltd Structured document summarizing apparatus, program, and recording medium
CN104572849A (en) * 2014-12-17 2015-04-29 西安美林数据技术股份有限公司 Automatic standardized filing method based on text semantic mining
CN105389338A (en) * 2015-10-20 2016-03-09 北京用友政务软件有限公司 Analysis method of procurement bid wining data
CN106227722A (en) * 2016-09-12 2016-12-14 中山大学 A kind of extraction method based on listed company's bulletin summary

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004334382A (en) * 2003-05-02 2004-11-25 Ricoh Co Ltd Structured document summarizing apparatus, program, and recording medium
CN104572849A (en) * 2014-12-17 2015-04-29 西安美林数据技术股份有限公司 Automatic standardized filing method based on text semantic mining
CN105389338A (en) * 2015-10-20 2016-03-09 北京用友政务软件有限公司 Analysis method of procurement bid wining data
CN106227722A (en) * 2016-09-12 2016-12-14 中山大学 A kind of extraction method based on listed company's bulletin summary

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
胡可云等: "《数据挖掘理论与应用》", 30 April 2008 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110956041A (en) * 2019-11-27 2020-04-03 重庆邮电大学 Depth learning-based co-purchase recombination bulletin summarization method
CN113836941A (en) * 2021-09-27 2021-12-24 上海合合信息科技股份有限公司 Contract navigation method and device
CN113836941B (en) * 2021-09-27 2023-11-14 上海合合信息科技股份有限公司 Contract navigation method and device
CN113918708A (en) * 2021-12-15 2022-01-11 深圳市迪博企业风险管理技术有限公司 Abstract extraction method
CN117216245A (en) * 2023-11-09 2023-12-12 华南理工大学 Table abstract generation method based on deep learning
CN117216245B (en) * 2023-11-09 2024-01-26 华南理工大学 Table abstract generation method based on deep learning

Similar Documents

Publication Publication Date Title
CN110069622A (en) A kind of personal share bulletin abstract intelligent extract method
US7953601B2 (en) Method and apparatus for preparing a document to be read by text-to-speech reader
CN100423004C (en) Video search dispatching system based on content
CN102207948B (en) Method for generating incident statement sentence material base
CN104021198B (en) The relational database information search method and device indexed based on Ontology
US20080052262A1 (en) Method for personalized named entity recognition
CN104281702A (en) Power keyword segmentation based data retrieval method and device
CN103544266B (en) A kind of method and device for searching for suggestion word generation
CN106250513A (en) A kind of event personalization sorting technique based on event modeling and system
CN103164471A (en) Recommendation method and system of video text labels
CN106354860A (en) Method for automatically labelling and pushing information resource based on label sets
CN101794308B (en) Method for extracting repeated strings facing meaningful string mining and device
CN104536830A (en) KNN text classification method based on MapReduce
Sarmento et al. Automatic extraction of quotes and topics from news feeds
Kumar et al. Paragraph summarization based on word frequency using NLP techniques
Kocayusufoglu et al. Riser: Learning better representations for richly structured emails
CN110309355A (en) Generation method, device, equipment and the storage medium of content tab
CN110020056A (en) A kind of stock information intelligent extract method
CN110414680A (en) Knowledge system of processing based on crowdsourcing mark
CN103593690A (en) User intelligent tagging system
AlFarasani et al. ATAM: arabic traffic analysis model for Twitter
CN113688233A (en) Text understanding method for semantic search of knowledge graph
CN102207947A (en) Direct speech material library generation method
Yanai et al. Debating artificial intelligence
Saeedi et al. Creating quranic question taxonomy

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190730