CN109684473A

CN109684473A - A kind of automatic bulletin generation method and system

Info

Publication number: CN109684473A
Application number: CN201811616494.0A
Authority: CN
Inventors: 李博; 李庆顺; 冯泽群; 李道远; 卢鑫; 刘赢
Original assignee: Danhan Intelligent Technology (shanghai) Co Ltd
Current assignee: Danhan Intelligent Technology (shanghai) Co Ltd
Priority date: 2018-12-28
Filing date: 2018-12-28
Publication date: 2019-04-26

Abstract

It include: the data that obtain information present invention relates particularly to a kind of automatic bulletin generation method and system, method；Feature extraction is carried out to the information data of acquisition；The corpus in matching knowledge corpus data library and logic corpus data library is removed using the feature of extraction；The corpus transferring default dynamic template, and will match to is added in default dynamic template, generates the bulletin template of sentence level；It organizes the bulletin template of generation to form bulletin.Compared to the prior art, the feature that the application passes through extraction structural data, according to knowledge corpus data library and logic corpus data library dynamic generation template, bulletin is generated according to template filling data, it can reduce the workload for manually writing bulletin, it can be improved and write efficiency, guarantee the timeliness of bulletin.

Description

A kind of automatic bulletin generation method and system

Technical field

The present invention relates to information process field more particularly to a kind of automatic bulletin generation methods.

Background technique

Bulletin is the brief information for transmitting certain aspect information, also known as " news in brief ", in some fields, such as finance neck Domain, the data faced are many and diverse, and very high to timeliness requirement, lead to the intricate operation of bulletin author, pressure is big.Therefore, Lack a kind of method and system that can automatically generate bulletin in the prior art.

Summary of the invention

The purpose of the present invention is to provide the method and system that one kind can automatically generate bulletin according to information data, with solution The certainly big technical problem of bulletin author intricate operation, pressure.

To achieve the above object, the present invention provides a kind of automatic bulletin generation method, comprising:

Obtain information data；

Feature extraction is carried out to the information data of acquisition；

The corpus in matching knowledge corpus data library and logic corpus data library is removed using the feature of extraction；Wherein, described Knowledge corpus data library stores the corpus for the relevant knowledge that bulletin generates, and the storage of logic corpus data library carries out logical description Corpus；

The corpus transferring default dynamic template, and will match to is added in default dynamic template, generates sentence level Bulletin template；

The bulletin template of generation is organized, and combines the information data obtained, forms bulletin.

In the above-mentioned technical solutions, further, feature extraction is carried out to the information data of acquisition to specifically include:

Using natural language processing feature extraction algorithm, word frequency analysis or semantic analysis are carried out to information data, To extract feature；

Alternatively, extracting feature using default feature extraction algorithm.

In the above-mentioned technical solutions, further, feature is extracted using natural language processing feature extraction algorithm, it is specific to wrap It includes:

Information data are split into several document feature sets；

Every document feature sets are assessed using valuation functions, obtain the weight of every document feature sets；

Document feature sets and its weight, the feature as extracted.

In the above-mentioned technical solutions, further, every document feature sets are assessed using valuation functions, obtains every The weight of document feature sets, refers to: the weight of every document feature sets is indicated using word frequency and the product of inverse document frequency；

Wherein,

In the above-mentioned technical solutions, further, matching knowledge corpus data library and logic language are gone to using the feature of extraction The content of text for expecting lane database, refers to:

Using document feature sets and its weight by content of text vectorization；

According to cosine similarity algorithm, immediate language is matched from knowledge corpus data library and logic corpus data library Material.

In the above-mentioned technical solutions, further, default dynamic template is transferred, comprising:

Labeling processing is carried out to all dynamic templates, extracts associated dynamic template using specific query statement, and By query result with matching degree sequence in case subsequent replacement demand.

In the above-mentioned technical solutions, further, before carrying out feature extraction to the information data of acquisition, to information data It is pre-processed；The pretreatment includes: to filter out participle and stop words.

The application also provides a kind of automatic bulletin generation system, comprising:

Information data acquisition module, the information data acquisition module is for the data that obtain information；

Data processing module, the data processing module are used to analyze the information data of acquisition to extract the spy of information data Sign；

Knowledge corpus data library, knowledge corpus data library are used to store the corpus of the relevant knowledge of bulletin generation；

Logic corpus data library, logic corpus data library are used to store the related corpus for carrying out logical description；

Logic processing module, the logic processing module be used for according to the feature of extraction go matching knowledge corpus data library and Corpus in logic corpus data library；

Template generation module, the template generation module is for transferring default dynamic template, and the corpus that will match to adds It adds in default dynamic template, generates the bulletin template of sentence level；

Bulletin generation module organizes the bulletin template of generation to form bulletin.

In the above-mentioned technical solutions, further, the bulletin generation module includes:

Coupling analysis submodule, the coupling analysis submodule are used to analyze the degree of coupling between the content of bulletin template；

Framework tissue submodule, the framework tissue submodule is used to select the Rational of demonstration briefing template, with shape At bulletin.

In the above-mentioned technical solutions, further, the bulletin generation module further include:

Correlation analysis and back forecasting submodule, the correlation analysis and back forecasting submodule for pair: framework The bulletin for organizing submodule to generate carries out correlation analysis, to guarantee the accuracy of bulletin；

Risk control submodule, the risk control submodule are used to carry out compliance inspection to the bulletin of generation.

Compared to the prior art, automatic bulletin provided by the present application generates system, by extracting the feature of structural data, According to knowledge corpus data library and logic corpus data library dynamic generation template, bulletin is generated according to template filling data, it can Mitigate the workload for manually writing bulletin, can be improved and write efficiency, guarantee the timeliness of bulletin.

Detailed description of the invention

It, below will be to specific in order to illustrate more clearly of the application specific embodiment or technical solution in the prior art Embodiment or attached drawing needed to be used in the description of the prior art be briefly described, it should be apparent that, it is described below Attached drawing is some embodiments of the application, for those of ordinary skill in the art, before not making the creative labor It puts, is also possible to obtain other drawings based on these drawings.

Fig. 1 is the schematic diagram that system is generated according to the automatic bulletin that some embodiments of the present application provide；

Fig. 2 is the schematic diagram for the bulletin generation module that automatic bulletin shown in FIG. 1 generates system；

Fig. 3 is the exemplary process diagram of the automatic bulletin generation method provided according to some embodiments of the present application.

Specific embodiment

It is described as the application defined in requirement and its equivalent that has the right convenient for Integrated Understanding below with reference to attached drawing Various embodiments.These embodiments include various specific details in order to understand, but these are considered only as illustratively.Cause This, it will be appreciated by those skilled in the art that carrying out variations and modifications without departing from this to various embodiments described herein The scope and spirit of application.In addition, briefly and to be explicitly described the application, the application will be omitted to known function and structure Description.

The term used in following description and claims and phrase are not limited to literal meaning, and being merely can Understand and consistently understands the application.Therefore, for those skilled in the art, it will be understood that provide to the various implementations of the application The description of example is only the purpose to illustrate, rather than limits the application of appended claims and its Equivalent definitions.

Below in conjunction with the attached drawing in some embodiments of the application, technical solutions in the embodiments of the present application carries out clear Chu is fully described by, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments. Based on the embodiment in the application, obtained by those of ordinary skill in the art without making creative efforts all Other embodiments shall fall in the protection scope of this application.

It should be noted that the term used in the embodiment of the present application is the mesh for being only in description specific embodiment , it is not intended to be limiting the application." one " of the embodiment of the present application and singular used in the attached claims, "one", "an", " described " and "the" be also intended to including most forms, unless the context clearly indicates other meaning.Also It should be appreciated that term "and/or" used herein refer to and include one or more mutually bindings list any of project or All possible combinations.Express " first ", " second ", " first " and " second " is for modifying respective element without examining Worry sequence or importance are used only for distinguishing a kind of element and another element, without limiting respective element.

It is illustrated below that concrete example is generated as with financial and economic news bulletin, but automatic bulletin generates system and method not It is limited to financial and economic news, can is any field and the bulletin of form.

Fig. 1 is the schematic diagram that system is generated according to the automatic bulletin that some embodiments of the present application provide.

As shown in Figure 1, automatic bulletin provided by the present application generates system, including information data acquisition module 110, data Processing module 120, knowledge corpus data library 130, logic corpus data library 140, logic processing module 150, template generation module 160 and bulletin generation module 170.

Information data acquisition module 110

For information data acquisition module 110 for the data that obtain information, advisory data is the Data entries that bulletin generates system, In some embodiments, data source can be the big data laboratory structuring number collected of user (such as certain bank) According to.News data acquisition module can read data according to the requirement of client from database or data server, then will count According to transferring to bulletin to generate system, bulletin is automatically generated.

Data processing module 120

Data processing module 120 mainly does signature analysis to data and carries out feature extraction.The analysis and extraction point of feature It is two kinds, one is using, natural language processing carries out word frequency analysis to data or semantic analysis, another kind are default Feature extraction.

The first feature extracting method, it is different according to the usage scenario of data, different feature analysis als can be customized.With Under divide three levels, to introduce natural language processing feature extraction algorithm:

One, text vector

Classical vector space model (VSM:Vector Space Model) is applied successfully to famous SMART text This searching system.VSM concept is simple, and the processing to content of text is reduced to the vector operation in vector space, and it with The semantic similarity of similarity expression spatially, it is intuitive and easy to understand.When document is represented as the vector of document space, so that it may logical It crosses the similitude calculated between vector and measures the similitude between document.Most common similarity measurement mode is in text-processing COS distance.Text Mining System uses vector space model, represents mesh with document feature sets (T1, T2 ... Tn) and its weight Wi Information is marked, when carrying out information matches, uses the degree of correlation of these characteristic items evaluation unknown text and target sample.

The selection of document feature sets and its weight is known as the feature extraction of target sample, and the superiority and inferiority of feature extraction algorithm will be direct Influence the operational effect of system.

Feature extraction algorithm of two, based on statistics

This type algorithm is assessed each feature in characteristic set, and by construction valuation functions to each spy Sign marking, word each in this way obtain an assessed value, also known as weight.Then all features are sorted by weight size, The optimal characteristics of predetermined number are extracted as the character subset for extracting result.Clearly for this type algorithm, text feature is determined The principal element of extraction effect is the quality of valuation functions.

Three .TF-IDF

The maximally efficient implementation method of word weight is exactly TF-IDF.Wherein TF is known as word frequency, for calculating word description The ability of document content, IDF are known as inverse document frequency, the ability for distinguishing document for calculating the word.The guiding theory of TF-IDF Establish on such basic assumption: occur word many times in a text, another in class text go out Occurrence number also can be very much, and vice versa.So estimating if feature space coordinate system takes TF word frequency to be used as, so that it may embody similar The characteristics of text.In addition it is also contemplated that word distinguishes different classes of ability, TF-IDF method thinks the text that a word occurs Frequency is smaller, it is distinguished, and different classes of ability is bigger, so the concept of inverse text frequency IDF is introduced, with TF and IDF Product estimate as the value of feature space coordinate system.The weighted value that Feature Words are calculated with TF-IDF algorithm is to indicate to work as The frequency that one word occurs in this document is higher, while the number occurred in other documents is fewer, then shows the word pair It is stronger in the separating capacity for indicating this document, so its weighted value just should be bigger.

Specifically:

TF-IDF=TF × IDF.

Use case:

After carrying out TF-IDF feature extraction to text data, using in its TF-IDF feature and knowledge corpus data library 130 Data carry out similarity mode, and most similar knowledge data should be just in the case where some other restrictive condition all meets The maximally related background knowledge data of this article notebook data.

Second of default feature extraction refers to the matching that according to specific logic data are carried out with Feature Words, to will count It is associated according to specific knowledge corpus and logic corpus.After feature extraction, two kinds of algorithms require will be important in data, Significant data are extracted and are saved, and are supplied to the use of bulletin generation module 170.

In some embodiments, system is classified and being handled data using default feature using decision-tree model. Decision tree is the model established by some attributes of text.Here attribute refers to such as comprising some specific Keyword.Then decision tree uses top-down mode, and each node in tree represents an attribute, and every of the node Subtree all represents a determining value of this attribute, puts decision value on leaf node.Objective attribute target attribute is sentenced with decision tree When disconnected, subtree will be entered along known attribute from tree root, until finding leaf node, the result also just judged. So the path from root node to each leaf node is a decision rule.Here judging result is referred to textual data According to successful characteristic matching has been carried out, the generation template for being suitble to expression text data is had found, in conjunction with next modules Carry out the generation of article.

Use case:

Such as when system is using default feature " limit-up ", " opening quotation ", etc. features determine that text data is to express one The news of stock market's limit-up, to carry out the integration and generation of article using relevant data template.

In above two default feature extracting method, it is required to pre-process text data.The main packet of pretreatment Containing two parts, participle and stop words filtering.

For participle, very mature participle dictionary has been had accumulated, accurate word segmentation can have been carried out to article.

Stop words refers to that those cannot reflect the function word of theme, they cannot not only reflect the theme of document, but also The extraction of keyword can be interfered, it is necessary to be filtered out.Such as: " ", " ground ", " obtaining " etc auxiliary word, it is necessary to It is filtered out.Stop words is determined as all function words and punctuation mark under normal circumstances.

Logic processing module 150

Logic processing module 150 mainly goes matching 130 He of knowledge corpus data library using specific algorithm with information feature Content of text in logic corpus data library 140.Such as when needing to generate article background, logic processing module 150 can root According to the feature of input information, similarity mode is carried out in knowledge corpus data library 130, maximally related background information is found, returns Back to logic processing module 150 (will be described in detail in next module knowledge corpus data library 130 this search with it is matched Process).For another example, the corpus data of input needs the relationship between Expressive Features, and such as " limit-up ", " limit down ", " rising sharply " is " big Fall " when, logic processing module 150 can match in logic corpus data library 140 according to the feature and demand of input corpus Maximally related logic corpus.

By access knowledge corpus data library 130 and logic corpus data library 140, logic processing module 150 has obtained phase The corpus data of pass transfers to template generation module to carry out template generation.

Knowledge corpus data library 130

The relevant background knowledge that mainly storage bulletin generates of knowledge corpus data library 130, and pass through and input corpus It compares to find most similar corpus data.Specifically, it is necessary first to the corpus of input be segmented and be calculated word frequency, so An article is indicated using vector according to word frequency afterwards, the article and knowledge corpus number of input can be thus indicated using vector According to all articles in library 130.

So, this problem that " finds maximally related background information ", be changed into find how in the database one with it is defeated Most classic cosine similarity algorithm can be used again in the maximally related vector of incoming vector.More specifically, can be input A certain data in corpus and knowledge corpus data library 130 is imagined as the two lines section in hyperspace, is all from origin ([0,0 ...]) it sets out, it is directed toward different directions.Between two lines section shape in an angle, if angle is 0 degree, it is meant that Direction is identical, line segment is overlapped；If angle is 90 degree, it is meant that form right angle, direction is completely dissimilar；If angle is 180 Degree, it is meant that direction is exactly the opposite.It therefore, can be by the size of angle, to judge the similarity degree of vector.Angle is smaller, It is more similar with regard to representing.

By taking two-dimensional space as an example, a (x₁, y₁) and b (x₂, y₂) it is two vectors in coordinate system, it is fixed according to cosine Reason, can acquire their angle theta with following formula:

This calculation method of cosine also sets up n-dimensional vector.It is assumed that A and B are two n-dimensional vectors, A be [A1, A2 ..., An], B is [B1, B2 ..., Bn], then the cosine of the angle theta of A and B is equal to:

Wherein cosine value is closer to 1, indicates that angle closer to 0 degree, that is, two vectors are more similar, i.e. " cosine phase Like property ".The most similar article in knowledge corpus data library 130 therewith can be found according to the corpus of input as a result,.

Logic corpus data library 140

Logic corpus data library 140 mainly stores the related corpus for carrying out logical description, is described during generating bulletin The important quarry of property phrase and clause.Although logic processing module 150 is to match corpus according to the feature of trigger event And random call, but synonymous and different attribute description can be labeled and segment in the database.Such as: " pull out big sun Line ", " blowout rises sharply ", " skip limit-up " although etc. can be employed as describe stock excellent situation, " pulling out big land " The presentation that share price sharp rises is only described, without " blowout rises sharply " to the metaphor of trading volume, " skip limit-up " is then even more The difference of rise form is presented.Corpus can also be enriched constantly in actual use by artificial and machine learning. Wherein, manual intervention is that more descriptive materials are manually entered into database；Machine learning is then by algorithm from big Description that is similar to a certain event but being not quite similar is found out in amount news report, is then marked and is put in storage.Here it is used to The algorithm for extracting corpus material is similar to the aforementioned feature extraction algorithm based on statistics, such as TF-IDF and decision-tree model.

Template generation module

Template generation module 160 is responsible for transferring default dynamic template and adds matching corpus, and bulletin template, this template are generated Generation granularity be sentence level.This module can obviously be divided into two word steps:

The inquiry of dynamic template: dynamic template is defined herein as the sentence of a fixed clause, (i.e. including template part Known clause part) and dynamic fill part (i.e. unknown message part).

Such as: " X1 " stock code " X2 " " X4 " " X5 " in " X3 "；This template can be used for describing single stock specific Market summary info in period.

Meanwhile in order to improve the dynamic and feasibility of module generation, it is dynamic that a template can be created multiple sons by succession Morphotype version, such as:

" X1 " stock code " X2 " slumps " X5 " in morning quotation

" X1 " stock code " X2 " rises slightly " X5 " in morning quotation

In order to improve the accuracy of template generation, this module carries out labeling processing to all templates, and using specific Query statement high efficiency extraction associated dynamic template, and query result is sorted with matching degree in case of subsequent replacement demand.Language Material matching addition: due to that may include a large amount of unknown messages in dynamic template, corpus matching adding module needs first will dynamic Unknown message in template is matched with data type and description information, for example, the matching of time type matching, degree adverb, Value type matching etc..Fills can be carried out to it after unknown domain is matched, need simultaneously during filling internal Appearance is handled, including numerical precision adjustment (such as 123,450,000 yuan be adjusted to 1.23 hundred million yuan), numerical value it is regular (such as 5.0002% be adjusted to 5%), numerical value replacement (such as go up 10% be adjusted to limit-up) etc..

By two above step, system produces the template comprising a certain the details complete information, in order to guarantee this The accuracy of template final news in brief, also needs the template to all generations to be reconstructed and organize, bulletin generation module 170 is It can reach this purpose.

Bulletin generation module 170

Bulletin generation module 170 is responsible for that previous step template integrative organization generated gets up to form the money finally shown News, and the information of generation is analyzed and verified, guarantee that this information is qualified and closes rule.

The specific structure and function of bulletin generation module 170 are refering to Fig. 2.

Fig. 2 is the schematic diagram for the bulletin generation module that automatic bulletin shown in FIG. 1 generates system.

As shown in Fig. 2, bulletin generation module includes coupling analysis submodule 210, framework tissue submodule 220, correlation Analysis and back forecasting submodule 230 and risk control submodule 240.

Coupling analysis submodule 210: a module (i.e. template generation module) exports template content on this module Main Analysis Between the degree of coupling.The degree of coupling of template is defined as between different template contents with the presence or absence of dependence and degree of dependence. The degree of coupling between template can be loose couplings (or for lower coupling), be also possible to close-coupled (or for high coupling Property).The degree of coupling for analyzing the template generated is conducive to the information structure that more efficient tissue ultimately generates, and makes the money generated News have higher readability.

Framework tissue submodule 220: this submodule is responsible for the Rational for selecting finally to show, generates including previous step The tissue of template module be associated with, the segmentation of sentence and paragraph and continuity authentication function.Wherein, according to coupling analysis submodule The content of 210 output of block, this submodule can establish framework organization chart in memory, and generated by optimization graph traversal algorithm and relatively closed Suitable institutional framework, this institutional framework be correspond to previous step template result this how to be connected.Meanwhile different mould It needs to connect by smooth conjunction or divide by modes such as paragraphs between version, with the integrative organization's knot being optimal Structure.

Correlation analysis and back forecasting submodule 230: it needs to carry out phase by the information that organizational structure submodule generates The analysis of closing property, that is, the information generated will correspond to relevant initial data to the full extent, to guarantee the accuracy of information.Meanwhile Backtracking analyze the information corresponding information daughter element can be corresponded into corresponding initial data and retain the correspondence between data, Sequentially and causality.It is regarded as accurate agonic by the information of this submodule.

Risk control submodule 240: in view of the ambiguity of Chinese text, the text that certain machines generate may include sensitivity Information, after the different submodules especially generated are readjusted sequence and add conjunction or be divided, these texts can It can produce ambiguity.The information of generation is carried out compliance inspection by the risk control submodule 240 again, guarantees that system is most lifelong At module accurately close rule.

As shown in figure 3, the application also provides a kind of automatic bulletin generation method, comprising:

S310, obtain information data；

S320 pre-processes information data；

S330 carries out feature extraction to information data；

In some embodiments, using natural language processing feature extraction algorithm, word frequency statistics point are carried out to information data Analysis or semantic analysis, to extract feature.Specifically, comprising:

Information data are split into several document feature sets；

Document feature sets and its weight, the feature as extracted.

In some embodiments, feature is extracted using default feature extraction algorithm.

S340 removes the corpus in matching knowledge corpus data library and logic corpus data library using the feature of extraction；

The corpus that S350 transfers default dynamic template, and will match to is added in default dynamic template, generates Sentence-level Other bulletin template；

S360 organizes the bulletin template of generation to form bulletin.

The above is only the preferred embodiment of the present invention, it is noted that those skilled in the art are come It says, without departing from the technical principles of the invention, several improvements and modifications can also be made, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of automatic bulletin generation method characterized by comprising

Obtain information data；

Feature extraction is carried out to the information data of acquisition；

The corpus in matching knowledge corpus data library and logic corpus data library is removed using the feature of extraction；Wherein, the knowledge Corpus data library stores the corpus for the relevant knowledge that bulletin generates, and the storage of logic corpus data library carries out the language of logical description Material；

The corpus transferring default dynamic template, and will match to is added in default dynamic template, generates the bulletin of sentence level Template；

It organizes the bulletin template of generation to form bulletin.

2. automatic bulletin generation method according to claim 1, which is characterized in that carry out feature to the information data of acquisition Extraction specifically includes:

Using natural language processing feature extraction algorithm, word frequency analysis or semantic analysis are carried out to information data, to mention Take feature；

Alternatively, extracting feature using default feature extraction algorithm.

3. automatic bulletin generation method according to claim 2, which is characterized in that utilize natural language processing feature extraction Algorithm extracts feature, specifically includes:

Information data are split into several document feature sets；

Document feature sets and its weight, the feature as extracted.

4. automatic bulletin generation method according to claim 3, which is characterized in that using valuation functions to every Feature Words Item is assessed, and is obtained the weight of every document feature sets, is referred to: every feature is indicated using word frequency and the product of inverse document frequency The weight of entry；

Wherein,

5. automatic bulletin generation method according to claim 3 or 4, which is characterized in that go to match using the feature of extraction Content of text in knowledge corpus data library and logic corpus data library, refers to:

Using document feature sets and its weight by content of text vectorization；

According to cosine similarity algorithm, immediate corpus is matched from knowledge corpus data library and logic corpus data library.

6. automatic bulletin generation method according to claim 1, which is characterized in that transfer default dynamic template, comprising:

Labeling processing is carried out to all dynamic templates, extracts associated dynamic template using specific query statement, and will look into Result is ask with matching degree sequence in case subsequent replacement demand.

7. automatic bulletin generation method according to claim 1, which is characterized in that carry out feature to the information data of acquisition Before extraction, information data are pre-processed；The pretreatment includes: to filter out participle and stop words.

8. a kind of automatic bulletin generates system characterized by comprising

Data processing module, the data processing module are used to analyze the information data of acquisition to extract the feature of information data；

Logic processing module, the logic processing module are used to go matching knowledge corpus data library and logic according to the feature of extraction Corpus in corpus data library；

Template generation module, the template generation module is for transferring default dynamic template, and the corpus that will match to is added to In default dynamic template, the bulletin template of sentence level is generated；

9. automatic bulletin according to claim 8 generates system, which is characterized in that the bulletin generation module includes:

Framework tissue submodule, the framework tissue submodule is used to select the Rational of demonstration briefing template, to form letter Report.

10. automatic bulletin according to claim 9 generates system, which is characterized in that the bulletin generation module further include:

Correlation analysis and back forecasting submodule, the correlation analysis and back forecasting submodule for pair: framework tissue The bulletin that submodule generates carries out correlation analysis, to guarantee the accuracy of bulletin；