CN110069622A - A kind of personal share bulletin abstract intelligent extract method - Google Patents
A kind of personal share bulletin abstract intelligent extract method Download PDFInfo
- Publication number
- CN110069622A CN110069622A CN201710646956.2A CN201710646956A CN110069622A CN 110069622 A CN110069622 A CN 110069622A CN 201710646956 A CN201710646956 A CN 201710646956A CN 110069622 A CN110069622 A CN 110069622A
- Authority
- CN
- China
- Prior art keywords
- paragraph
- abstract
- bulletin
- template
- personal share
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/08—Insurance
Abstract
This method discloses one kind and extracts method similar with text fragment by table, to extract the abstract of personal share bulletin.Using first separating the strategy remerged, separation bulletin table and plain text, carry out structuring processing to table, carry out paragraph division processing to plain text, then in conjunction with predefined abstract template (keyword template), extracting keywords achievement data and filling template from structuring table;From dividing, searching and the most like top of template in paragraph are N number of as summary candidate paragraph, if matching finds most like paragraph from candidate paragraph and make a summary as a son less than keyword in structuring table.This method greatly improves the accuracy of abstract, improves the editorial efficiency of human editor, and the accuracy rate extracted by continuous feedback poppet is finally truly realized automation.
Description
Technical field
The present invention relates to computer software fields, announce its summary info pumping more particularly to the personal share of listed company's publication
The scene taken.
Background technique
Currently, personal share announces numerous types, each type bulletin emphasis event is different, each type of personal share bulletin
It is various.As investor, for number one, understanding the personal share bulletin content that listed company discloses in time becomes very urgent.But
It is that each type of personal share announces numerous, length redundancy.Investor merely desires to understand core event therein and data (are plucked
Want), rather than take a significant amount of time energy and go downloading each bulletin content of browsing.
Technically solve the problems, such as that the method is the event information extraction based on Events Frame used, just based on a whole set of
Then expression formula (Expert Rules) go specific event and data in matching bulletin.But this technology ceiling is low, expert
Rulemaking is many and diverse, cannot cover all situations, matches fallibility, and matching performance is poor.Therefore, many companies pluck for personal share bulletin
The extraction wanted mainly still is extracted by people, and efficiency is lower.
It is announced and is found by observational study personal share, these personal shares bulletin content is mainly made of list data and text,
Its table entries data introduction standardizes very much, and structure height is similar.Therefore, we invent a kind of intelligent extract method, specific aim
Ground extracts the core paragraph or sentence in the list data and remaining text (rejecting table) of personal share bulletin, according still further to certain
Template (field product specified) be organized into abstract.
Summary of the invention
The purpose of this method be in order to solve the technological deficiency in current method, it is at high cost, the problem of low efficiency, design
A kind of method for the abstract that quickly, can effectively directly generate customization.
To solve the above-mentioned problems, this method the technical solution adopted is that:
Firstly, personal share bulletin content is converted into html format by certain technology;
Then, it identifies the table table label in html, passes through row tr, the column td etc. in further cutting table label
Label extracts the entry and data of table;
Again, the remaining text of html (rejecting html label) is extracted, is cut into sentence according to punctuation mark, and each sentence
Son is cut into keyword, according to BM25 algorithm, extracts and several sentences most like to solid plate;
Finally, being organized into abstract according to the sentence of extraction and table entries data.
Due to being using the beneficial effect of the above method, this method:
(1) this method is extracted for bulletin table, can extract detailed entry data, accuracy rate is high, and speed is fast, scalability
By force;
(2) sentence similar with specified module is calculated using text Similarity algorithm, does not need to formulate many and diverse rule;
(3) specifying module (field product is specified) only need to include keyword, should not Expert Rules.
Detailed description of the invention
Fig. 1 is this method system framework figure.
Fig. 2 is the method specific implementation flow chart.
Fig. 3 is that auto chart is implemented in the method product side.
Fig. 4 is a kind of regular bulletin figure.
Specific embodiment
This method system architecture diagram is as shown in Figure 1, wherein the function declaration of modules is as follows:
1: configuration crawl origin url and rules for grasping;
2: according to the crawl origin url and rules for grasping of configuration, the crawl announced;
3: using PDF2HTML open source library, the bulletin of crawl being converted into html format;
4: label, the pattern etc. of redundancy in cleaning HTML;
5: extracting Table label in HTML, store into tabular form tableList;
6: extracting the plain text information of HTML, store sentenceList at list according to the Segmentation of Punctuation of setting;
7: the processing of each tableau formatization, the entry and its data in table are extracted, with<Key, Value>form storage;
8: according to preset abstract keyword module, data in tableList being gone out according to keyword abstraction and fill module.For
Extract less than the case where, most like sentence replacement is found from sentenceList.
This method is applied to quality product-information-bulletin abstract editing platform at present, makes for bulletin abstract human editor audit reference
With the embodiment of product side is as follows:
Firstly, bulletin abstract human editor enters bulletin abstract editing system, human editor inquires the bulletin of a certain type;
Bulletin is edited secondly, human editor clicks bulletin title, system recommendation goes out the abstract of the bulletin, and human editor can be adopted
With the abstract, the abstract can also be refused, by gradually feeding back, gradually improved method extracts abstract accuracy for we;
Finally, passing through the further feedback of human editor, successive optimization process finally accomplishes the automation (such as Fig. 3) of abstract extraction.
As shown in Fig. 2, for there is bulletin existing for table, we extract in table emphasis the process flow diagram of this method
Data and its meaning, for the bulletin of table is not present, our emphasis extract its similar paragraph.
This method novelty mainly has two o'clock.
First innovative point: the core data in bulletin is extracted by extracting table.Especially investor is most concerned
Data substantially increase the data supporting of abstract.By research personal share bulletin discovery, for regular reporting, the types such as flash report are public
It accuses, wherein the probability containing table is 99%, periodically bulletin figure, the such bulletin of almost all pass through as shown in Figure 4
Form expresses core data, and expression way is similar, very with the characteristic of structuring.
Second innovative point: by natural language processing technique, using segmentation, participle, BM25 Similarity Algorithm calculate with
The most matched sentence of template.Each bulletin classification configures a kind of template, and module includes keyword, does not need Expert Rules and refers to
Configuration is led, manpower physical strength is greatlyd save.
The text similarity BM25 algorithm that this method is used, the algorithm pay close attention to the presence or absence of keyword, are not concerned with phase
As semantic word, it is semantic similar with template keyword of making a summary that we by term vector (word embedding) calculate sentence
Degree further promotes the accuracy extracted.
Claims (5)
1. a kind of personal share announces intelligent abstract extraction method, observation finds personal share and announces feature, finds mostly by table and pure text
This composition, same type of personal share bulletin tableau format is similar,
To personal share bulletin using the strategy closed again is first divided, table and plain text are first separated,
After carrying out processing independent respectively, the result after processing is remerged.
2. according to the method described in claim 1, the processing to table is, the corresponding numerical value of each entry in table, knot are extracted
Structureization storage.
3. defining abstract template according to the method described in claim 1, the processing to plain text is first to have divided paragraph, use
Text similar method calculates the similarity of each paragraph and template of making a summary, and takes the preceding N number of paragraph of top as summary candidate paragraph.
4. according to method described in claim 2,3, characterized in that found in structuring table according to abstract template keyword
Corresponding data, and fill.
5. according to the method described in claim 2, selecting one in summary candidate paragraph for keyword not in the table
Most like paragraph.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710646956.2A CN110069622A (en) | 2017-08-01 | 2017-08-01 | A kind of personal share bulletin abstract intelligent extract method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710646956.2A CN110069622A (en) | 2017-08-01 | 2017-08-01 | A kind of personal share bulletin abstract intelligent extract method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110069622A true CN110069622A (en) | 2019-07-30 |
Family
ID=67364540
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710646956.2A Pending CN110069622A (en) | 2017-08-01 | 2017-08-01 | A kind of personal share bulletin abstract intelligent extract method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110069622A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110956041A (en) * | 2019-11-27 | 2020-04-03 | 重庆邮电大学 | Depth learning-based co-purchase recombination bulletin summarization method |
CN113836941A (en) * | 2021-09-27 | 2021-12-24 | 上海合合信息科技股份有限公司 | Contract navigation method and device |
CN113918708A (en) * | 2021-12-15 | 2022-01-11 | 深圳市迪博企业风险管理技术有限公司 | Abstract extraction method |
CN117216245A (en) * | 2023-11-09 | 2023-12-12 | 华南理工大学 | Table abstract generation method based on deep learning |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004334382A (en) * | 2003-05-02 | 2004-11-25 | Ricoh Co Ltd | Structured document summarizing apparatus, program, and recording medium |
CN104572849A (en) * | 2014-12-17 | 2015-04-29 | 西安美林数据技术股份有限公司 | Automatic standardized filing method based on text semantic mining |
CN105389338A (en) * | 2015-10-20 | 2016-03-09 | 北京用友政务软件有限公司 | Analysis method of procurement bid wining data |
CN106227722A (en) * | 2016-09-12 | 2016-12-14 | 中山大学 | A kind of extraction method based on listed company's bulletin summary |
-
2017
- 2017-08-01 CN CN201710646956.2A patent/CN110069622A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004334382A (en) * | 2003-05-02 | 2004-11-25 | Ricoh Co Ltd | Structured document summarizing apparatus, program, and recording medium |
CN104572849A (en) * | 2014-12-17 | 2015-04-29 | 西安美林数据技术股份有限公司 | Automatic standardized filing method based on text semantic mining |
CN105389338A (en) * | 2015-10-20 | 2016-03-09 | 北京用友政务软件有限公司 | Analysis method of procurement bid wining data |
CN106227722A (en) * | 2016-09-12 | 2016-12-14 | 中山大学 | A kind of extraction method based on listed company's bulletin summary |
Non-Patent Citations (1)
Title |
---|
胡可云等: "《数据挖掘理论与应用》", 30 April 2008 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110956041A (en) * | 2019-11-27 | 2020-04-03 | 重庆邮电大学 | Depth learning-based co-purchase recombination bulletin summarization method |
CN113836941A (en) * | 2021-09-27 | 2021-12-24 | 上海合合信息科技股份有限公司 | Contract navigation method and device |
CN113836941B (en) * | 2021-09-27 | 2023-11-14 | 上海合合信息科技股份有限公司 | Contract navigation method and device |
CN113918708A (en) * | 2021-12-15 | 2022-01-11 | 深圳市迪博企业风险管理技术有限公司 | Abstract extraction method |
CN117216245A (en) * | 2023-11-09 | 2023-12-12 | 华南理工大学 | Table abstract generation method based on deep learning |
CN117216245B (en) * | 2023-11-09 | 2024-01-26 | 华南理工大学 | Table abstract generation method based on deep learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110069622A (en) | A kind of personal share bulletin abstract intelligent extract method | |
US7953601B2 (en) | Method and apparatus for preparing a document to be read by text-to-speech reader | |
CN100423004C (en) | Video search dispatching system based on content | |
CN102207948B (en) | Method for generating incident statement sentence material base | |
CN104021198B (en) | The relational database information search method and device indexed based on Ontology | |
US20080052262A1 (en) | Method for personalized named entity recognition | |
CN104281702A (en) | Power keyword segmentation based data retrieval method and device | |
CN103544266B (en) | A kind of method and device for searching for suggestion word generation | |
CN106250513A (en) | A kind of event personalization sorting technique based on event modeling and system | |
CN103164471A (en) | Recommendation method and system of video text labels | |
CN106354860A (en) | Method for automatically labelling and pushing information resource based on label sets | |
CN101794308B (en) | Method for extracting repeated strings facing meaningful string mining and device | |
CN104536830A (en) | KNN text classification method based on MapReduce | |
Sarmento et al. | Automatic extraction of quotes and topics from news feeds | |
Kumar et al. | Paragraph summarization based on word frequency using NLP techniques | |
Kocayusufoglu et al. | Riser: Learning better representations for richly structured emails | |
CN110309355A (en) | Generation method, device, equipment and the storage medium of content tab | |
CN110020056A (en) | A kind of stock information intelligent extract method | |
CN110414680A (en) | Knowledge system of processing based on crowdsourcing mark | |
CN103593690A (en) | User intelligent tagging system | |
AlFarasani et al. | ATAM: arabic traffic analysis model for Twitter | |
CN113688233A (en) | Text understanding method for semantic search of knowledge graph | |
CN102207947A (en) | Direct speech material library generation method | |
Yanai et al. | Debating artificial intelligence | |
Saeedi et al. | Creating quranic question taxonomy |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190730 |