CN108563620A

CN108563620A - The automatic writing method of text and system

Info

Publication number: CN108563620A
Application number: CN201810331488.4A
Authority: CN
Inventors: 王娜; 胡滨洋
Original assignee: Shanghai Yi Cai Fan Tai Media Technology Co Ltd
Current assignee: Shanghai Yi Cai Fan Tai Media Technology Co Ltd
Priority date: 2018-04-13
Filing date: 2018-04-13
Publication date: 2018-09-21

Abstract

The present invention provides a kind of automatic writing methods of text, include the following steps：Process is presented in information gathering process, text resolution process, content generating process and product.The reader conduct analytic process, including：Reader's behavior information is obtained from one or more of internet platforms, analyzes the reader's behavior information, wherein described information gatherer process, the text resolution process and the content generating process is adjusted according to the reader's behavior information.

Description

The automatic writing method of text and system

Technical field

The invention mainly relates to computer realm more particularly to a kind of automatic writing method of text and systems.

Background technology

Along with the high speed development of internet, more and more first-hand information are issued by network.These information Type is abundant, substantial amounts, the form of expression are various.For content originator, especially media worker, it is expected that supervising in time The information of magnanimity is controlled and obtains, a large amount of writing materials of coming are collected in effectively management through various channels, efficiently, rapidly to these Material is screened, is handled and carried out contents production.

The automatic writing method of some texts has been proposed, the major part of these methods is to be based on structured message.Structure Change information and be decomposed into multiple inter-related component parts after analysis, has specific hierarchical structure between each component part, Its operation and maintenance is managed by database, and has certain working specification.In contrast, permitted in non-structured information More contents are all unpredictable.It is write automatically according to non-structured information, is a huge challenge.

Invention content

The technical problem to be solved in the present invention is to provide a kind of automatic writing method of text and systems, contribute to according to non-knot The information of structure is write automatically.

In order to solve the above technical problems, the present invention provides a kind of automatic writing method of text, include the following steps：Information Gatherer process, including：Information is acquired from internet, format conversion is carried out to described information, noise cleaning is carried out to described information, Data primary dcreening operation is carried out to described information, text is obtained, wherein the text includes unstructured part；Text resolution process, packet It includes：Classify to the text, the name entity in the text is identified according to the classification of the text, according to the text Classification extract in the text name entity between entity relationship, according to the classification of the text extraction can reflect institute State the event morpheme of the event in text；Content generating process, including：One or more writing scenes are pre-configured with, are matched in advance One or more logic templates are set, according to the name entity, the entity relationship and event morpheme and apply the writing field Scape and logic template generate paragraph, identify associated paragraph and aggregate into article；Process is presented in product, including：By the article It is distributed to one or more internet platform；Reader conduct analytic process, including：It is flat from one or more of internets Platform obtains reader's behavior information, analyzes the reader's behavior information, wherein described information gatherer process, the text resolution process It is adjusted according to the reader's behavior information with the content generating process.

In one embodiment of this invention, the text resolution process further includes：It extracts and refines in advance in the text Keyword.

In one embodiment of this invention, the text resolution process further includes：Extract the key message in the text.

In one embodiment of this invention, the text resolution process further includes：It extracts in the text for constituting text The sentence of shelves abstract.

In one embodiment of this invention, the text resolution process further includes：Analyze the feeling polarities of the text.

In one embodiment of this invention, the above method further includes numerical analysis process, and the data analysis process includes： Numerical computations and statistics are carried out to the data in the text, monitor whether the data in the text exceptional value occur.

In one embodiment of this invention, the step of classifying to the text include according to the classification that pre-establishes into Row classification, wherein pre-establishing class method for distinguishing and including：Obtain the classification of one or more settings；By the of multiple training texts A part is referred in one or more of classifications；One or more of classifications will can not be referred in multiple training texts In second part be divided into one or more clusters；Receive the tag along sort of the foundation to one or more of clusters.

In one embodiment of this invention, each logic template in one or more of logic templates include one or Multiple candidate sentences, each candidate sentences include one or more candidate name entities, morpheme and clause.

In one embodiment of this invention, according to the name entity, the entity relationship and event morpheme and using institute Stating the step of writing scene and logic template generation article includes：It is automatically generated according to the parameter of input using deep learning method Paragraph, the paragraph are received in the logic template.

The present invention also proposes a kind of automatic authoring system of text, including memory, can be executed by processor for storing Instruction；Processor, for executing described instruction to realize method as described above.

The text automatic generation method of the embodiment of the present invention and system combination information collection, data analysis, text editing, The key modules such as content publication, data backflow may be implemented automatic writing integrated process, improve the efficiency of contents production And timeliness.

Description of the drawings

Fig. 1 is the schematic diagram of the automatic authoring system of text according to an embodiment of the invention.

Fig. 2 is the schematic diagram of the automatic authoring system of text according to another embodiment of the present invention.

Fig. 3 is the schematic diagram of the automatic writing method of text according to an embodiment of the invention.

Fig. 4 is information collection schematic diagram according to an embodiment of the invention.

Fig. 5 is text classification schematic diagram according to an embodiment of the invention.

Fig. 6 is the cluster arborescence of unknown classification bulletin according to an embodiment of the invention.

Fig. 7 is text categories system example according to an embodiment of the invention.

Fig. 8 is name Entity recognition example according to an embodiment of the invention.

Fig. 9 is keyword extraction result example according to an embodiment of the invention.

Figure 10 is event extraction result example according to an embodiment of the invention.

Figure 11 is target critical Examples of information according to an embodiment of the invention.

Specific implementation mode

For the above objects, features and advantages of the present invention can be clearer and more comprehensible, below in conjunction with attached drawing to the tool of the present invention Body embodiment elaborates.

Many details are elaborated in the following description to facilitate a thorough understanding of the present invention, still the present invention can be with Implemented different from other manner described here using other, therefore the present invention do not limited by following public specific embodiment System.

As shown in the application and claims, unless context clearly prompts exceptional situation, " one ", "one", " one The words such as kind " and/or "the" not refer in particular to odd number, may also comprise plural number.It is, in general, that term " comprising " only prompts to wrap with "comprising" Include clearly identify the step of and element, and these steps and element do not constitute one it is exclusive enumerate, method or equipment The step of may also including other or element.

The embodiment of the present invention describes the automatic writing method of text and system.This contributes to according to non-structured information To be write automatically.

Fig. 1 is the block diagram of the automatic authoring system of text of one embodiment of the invention.Refering to what is shown in Fig. 1, text is write automatically System 100 may include internal communication bus 101, processor (processor) 102, read-only memory (ROM) 103, arbitrary access Memory (RAM) 104, communication port 105, input output assembly 106, hard disk 107 and user interface 108.Intercommunication is total The data communication of 100 inter-module of computer may be implemented in line 101.Processor 102 can be judged and be sent out prompt.At some In embodiment, processor 102 can be made of one or more processors.Computer 100 and its may be implemented in communication port 105 Into row data communication between his component (not shown).In some embodiments, computer 100 can pass through communication port 105 send from network and receive information and data.Input output assembly 106 is supported defeated between computer 100 and other component Enter/output stream.The interaction between computer 100 and user and information exchange may be implemented in user interface 108.Computer 100 can also include various forms of program storage units and data storage element, such as hard disk 107, read-only memory (ROM) 103 and random access memory (RAM) 104, computer disposal can be stored and/or various data text that communication uses Possible program instruction performed by part and processor 102.

As an example, input output assembly 106 may include the one or more of component below：Mouse, trace ball, Keyboard, touch control component, sound receiver etc..

For example, the automatic writing method of the text of the application may be embodied as computer program, be stored in hard disk 107 In, and can be recorded in processor 102 and execute, to implement the present processes.

It is appreciated that the automatic authoring system of the text of the application be not limited to it is computer-implemented by one, but can be by Multiple online collaborative computers are implemented.Online computer can be connected by LAN or wide area network and communication.

Such as the automatic authoring system of text of the embodiment of the present invention can be that text writes software automatically, be stored in hard disk In.

When the automatic authoring system of text is embodied as software, it can also store in a computer-readable storage medium as system Product.For example, computer readable storage medium can include but is not limited to magnetic storage apparatus (for example, hard disk, floppy disk, magnetic stripe), light Disk (for example, compact disk (CD), digital versatile disc (DVD)), smart card and flash memory device are (for example, electrically erasable is only Read memory (EPROM), card, stick, key driving).In addition, various storage media described herein can be represented for storing information One or more equipment and/or other machine readable medias.Term " machine readable media " can include but is not limited to store, Including and/or the carrying code and/or wireless channel and various other media (and/or storage medium) of instruction and/or data.

A example safety message of the embodiment of the present invention, which reports tracking system, can also be embodied as software service (Software As a Service) form.Fig. 2 is the block diagram of the automatic authoring system of text of another embodiment of the present invention.With reference to 2 institute of figure Show, system may include that client computer 210 and server 220, the two are connected by network 210.Network 210 known can have with various Line or wireless network, it is not reinflated herein.Server 220 and the cooperation of client computer 210 are to realize described in previous embodiment Method or its change case.User interface, communication port and input module can be equipped in client computer 210.User interface can be to Various interfaces are presented in user, and input module can receive the input of user.Communication port can be configured in server 220 (not shown), memory 221 and processor (not shown), memory 221 store computer instruction, and processor executes these instructions With the major part of implementation method.The result of processor processing is transmitted to client computer 210 by communication port, in client computer 210 It is shown in user interface.

It is appreciated that the automatic authoring system of the text of the application is not limited to by a server implementation, but can be by Multiple online server coordinated implementations.Online server can be connected by LAN or wide area network and communication.

It should be understood that embodiments described above is only signal.Embodiment described herein can be in hardware, software, solid It is realized in part, middleware, microcode or its arbitrary combination.For hardware realization, processing unit can be in one or more spy Determine purposes integrated circuit (ASIC), digital signal processor (DSP), digital signal processing appts (DSPD), programmable logic device It part (PLD), field programmable gate array (FPGA), processor, controller, microcontroller, microprocessor and/or is designed as executing Other electronic units of function described herein or its combine interior realization.

Fig. 3 is the schematic diagram of the automatic writing method of text according to an embodiment of the invention.The method of the present embodiment can be with Implement in Fig. 1, Fig. 2 or the automatic authoring system of text of its variation.Refering to what is shown in Fig. 3, the text of the present embodiment is write automatically It may include that 340 and of process is presented in information gathering process 310, text resolution process 320, content generating process 330, product as method Reader conduct analytic process 350.Optionally, the automatic writing method of text may include diagnosis process 360.

Information gathering process 310 may include the step 311 that information is acquired from internet, and the step of format conversion is carried out to information Rapid 312, the step 313 of noise cleaning is carried out to information, and the step 314 of data primary dcreening operation is carried out to information.This step can obtain Text, text may include unstructured part.Certainly, text may also comprise structure part and/or semi-structured part.Below Processing be notably directed to unstructured part.

Text resolution process 320 may include：To the step 321 that text is classified, text is identified according to the classification of text In name entity step 322, according to the classification of text extract text in name entity between entity relationship the step of 324, and extract according to the classification of text the step 325 for the event morpheme that can reflect event in text.Optionally, text Resolving 320 may also include：The step 323 of the keyword refined in advance in extraction text extracts in text for constituting text The step 326 of the sentence of shelves abstract extracts the step 327 of the key message in text, and analyze the feeling polarities of text Step 328.

Content generating process 330 may include：The step of step 331 that paragraph generates, the identification of association paragraph, selection and combination 332 and contribution generate step 333.Here, the step of being pre-configured with one or more writing scenes, and it is pre-configured with one A or multiple logic templates, paragraph generate step 331 in, according to text resolution process 320 obtains name entity, the reality Body relationship and event morpheme, Applied Composition scene and logic template generate paragraph.

Product is presented process 340 and may include：Article is distributed to the step 341 of one or more internet platform.

Reader conduct analytic process 350 may include the step that reader's behavior information is obtained from one or more internet platform Rapid 351, and analyze the step 352 of reader's behavior information.350 obtained information of reader conduct analytic process, which can enter, examines Disconnected process 360.

Diagnosis process 360 may include according to reader's behavior information, adjustment information gatherer process 310, text resolution process 320 With content generating process 330.According to reader's behavior information, for distribution platform and content originator in selection article, modification It is referred to when appearance.

It is alternatively possible to after content generating process, artificial content auditing, amendment step 334 is added.Diagnosis process 360 can collect the feedback from step 334 in step 361, and the error statistics of each process 310-340 are carried out by diagnosing And error analysis, help system are constantly iterated according to actual conditions, optimize, to further increase system effectiveness.

In the present embodiment, the data that information gathering process 310 are obtained can be put into the original contents library of database 30 32.Text resolution process 320 can use the domain knowledge base 33 of database 30, the data obtained that can be put into database 30 original contents library 31.Content generating process 310 can use writing scene and the logic template library 34 of database 30, and institute The data of acquisition can be put into the machine contribution library 35 of database 30.

The following detailed description of each process.

Information collection

Distributed reptile

Fig. 4 is information acquisition system schematic diagram according to an embodiment of the invention.Refering to what is shown in Fig. 4, in order to obtain magnanimity Internet on data, including all kinds of websites and social platform, it is possible to provide a distributed reptile system.Distributed reptile system System ensure that the dynamic scalable of two dimensions from framework, be performance dimension and data source dimension respectively.For this purpose, reptile framework On decoupled two big modules, be console module (including central scheduler 41 and insert receptacle 42) and card module 43 respectively.It climbs The console module of worm is to ensure the dynamic property extension of the system, main offer reptile central scheduler 41 and abstract hardware money Source does not include any service logic.Reptile task 44 can be assigned to each reptile service to load balancing by central scheduler 41 It is executed on device 42.As long as adding crawler server, you can the handling capacity of horse back linear expansion crawler system.Card module 43 be for Ensure the crawler system data source level dynamic scalable.Since each website crawls, logic is different, data structure is different, It can not unify, so each data source crawls logical cohesion in each card module 43.Card module 43 can be flat in reptile Hot plug executes on platform.After Platform deployment, as long as exploitation plug-in unit, you can linear expansion reptile data source.

Here, the information source of acquisition can be authoritative news website, information announcement channel, social media, structural data Interface etc..Collected content is put into original contents library.

Format analysis processing and noise cleaning

Since the information source category of acquisition is abundant, especially format used in unstructured data also differs widely, institute To need to carry out preliminary treatment to the data that acquisition obtains.Format analysis processing technology in the present embodiment mainly turns including PDF format Change technology and HTML cleaning techniques.

PDF format switch technology is mainly used for the pdf document that will be got, and is converted to html format file.Citing comes It says, in field of finance and economics, the bulletin that major listed company is issued is PDF format, wherein including a variety of lattice such as word, chart The important information of formula.In order to which the data in these bulletins are extracted and handled, the present embodiment is first after carrying out data acquisition Pdf document is first converted into html format.The technical characterstic is accuracy rate height, and can be in stet shelves chart class Information does not cause data to omit and lack.

HTML cleaning techniques are mainly used for cleaning the web data of acquisition, only retain Web page text, screen out webpage In navigation, advertisement, video etc. " noise ".For most of webpages, general Web Page Cleaning Technology can be used, is obtained wherein Text.For the more complex webpage of part-structure, it can targetedly preset and use cleaning rule, to ensure Web Cleanout Coverage and accuracy rate.

Data primary dcreening operation

Data primary dcreening operation refers in data acquisition, when the domain name of comprehensive consideration webpage, place column, title, publication Between etc. information, tentatively filter out non-targeted data, only retain the target data in specify information source.Data acquisition is mainly from major Website obtains data, can have various forms of external linkages in each website.In gatherer process, target data is got Meanwhile have certain probability that can be acquired to these out-link web pages, the purpose of data primary dcreening operation technology be by these data into Row filtering.

Text resolution

Text classification

After getting text data, it is necessary first to which text is sub-divided into specific category；Further according to different classes of text Feature and demand carry out subsequent information extraction and parsing.It can be seen that text classification is very basic and vital Step, effect quality will have a direct impact on the progress of subsequent step.

When the text data of acquisition is related to that multiple fields, source is various, type is various, format subject matter differs, content is complicated When, text categorization task is more challenging.It in one embodiment, can be by by artificial experience and machine learning side Method is combined, and is established taxonomic hierarchies and is realized automatic classification.

By taking " listed company's bulletin " this class text as an example, existing about 1000 listed companies, the bulletin quantity issued daily It it is thousands of, the peak time bulletin amount of publication is up to 4000/day, 350,000 average/annual, and wide variety, content Complexity is high.If according to the experience of field of finance and economics editor and reporter, bulletin can be divided into more than 90 classifications.But through and pumping Sample and artificial mark, it is found that only about 40% bulletin can be accurately classified as this more than 90 class, remaining about 60% bulletin can not correspond to To specific classification.

An embodiment according to the present invention, proposes file classification method as shown in Figure 5.According to this method, obtain first Take one or more classifications set.Then such as step 52, when the first part's text for the text document 51 for judging multiple training It has criteria for classification, then this first part's text is referred in the one or more classifications set.Judge when in step 53 The second part text of the text document 51 of multiple training can not be referred in the one or more classifications set, then in step 54, the second part text of multiple training texts is divided into one or more clusters.The form of cluster is as shown in 55.In step 56 It may determine that whether cluster is important.The mode of judgement can be that artificial judgment or machine judge.In step 57, receive to important The tag along sort established of cluster, and together with the classification of existing setting, form new taxonomic hierarchies 58.

It filters out important announcement in order to comprehensive, effective and establishes an effective criteria for classification, for above-mentioned " unknown Classification is announced ", in one embodiment, the analysis such as Fig. 4 is carried out to it by hierarchical cluster.By dividing cluster result Analysis, class number of clusters k=22 are relatively reasonable values.This 22 class clusters are manually spot-check, its generic are carried out general It includes.

Bulletin in 22 classifications is carried out after manually spot-check, summarize and judging, using the high classification of importance as individually Classification be added to bulletin taxonomic hierarchies in, the low classification of importance is uniformly classified as ' others ' class, thus i.e. can determine most The criteria for classification announced eventually.Fig. 7 is text categories system example according to an embodiment of the invention.

It, can be in conjunction with the 33 (reference chart of domain knowledge base established according to domain knowledge according to taxonomic hierarchies established above 3), characterized by source, webpage original tag, title, content of text, key feature word etc., machine learning and regular phase are utilized In conjunction with to model training and optimization, to realize the automatic classification of text.

Name Entity recognition

The purpose of name entity identification algorithms, which is name, place name, the organization's title etc. identified in sentence, to be made It is managed for entity or associated vocabulary.Can often occur a large amount of names, place name and organization's title etc. in each class text Entity information, and this type of information often plays an important role to the identification of text, classification and information extraction.

In one embodiment, be based on hidden Markov model (HMM), in conjunction with the magnanimity politics of collection, economic, science and technology, The entity informations such as name, place name, government organs and listed company's title in the fields such as culture can obtain the higher life of accuracy rate Name entity recognition model, to realize identification and the mark to entity information in text.Fig. 8 is according to an embodiment of the invention Entity recognition example is named, as shown in figure 8, in this example, " Wang Shi " is identified as name, " Vanke Co., Ltd " It is identified as organization, and " Shenzhen " and " Liuzhou " is then classified as place name.

Keyword extraction

The aiming at of keyword extraction techniques is chosen several representative vocabulary in article and is prompted in full text Thought is thought.During handling mass text, sum up every article keyword can not only assist user to text into Row fast understanding, additionally it is possible to the efficiency of retrieval, management, reading be greatly improved, found for article subject extraction, hot word, document The work important in inhibiting such as automated tag and information index.

In one embodiment of this invention, based on the text for carrying artificial keyword more than 120,000, with part of speech, there is position Set, TF-IDF, Text-Rank score, Word2Vec vectors etc. are characterized, trained the keyword extraction mould of high-accuracy Type.By actual test, the coverage rate of the model is generally higher than presently disclosed keyword extraction tool.Fig. 9 is according to this hair The keyword extraction result example of a bright embodiment.As shown in figure 9, this section is about permanent short-term propagation of the life insurance greatly in A share market The text for causing supervision to be paid close attention to recommends " insurance capital ", " supervision ", " permanent big ", " propagation " by keyword extraction algorithm Four keywords have preferable suggesting effect to the important information of this article.

Entity relation extraction

Entity relationship extraction refers to, after the name entity in text is identified, further confirms that between these entities The type of relationship, wherein entity relationship is pre-defined.

For example, in text " ... [Vanke | ORG] founder [Wang Shi | PER] ... ", " Vanke ", " Wang Shi " is name entity, constitutes subordinate relation (Org-Aff.Founder) again between the two.The reality of entity relation extraction It is existing, it handles and retrieves for mass text, numerous natures such as knowledge base is built automatically, textual association, machine translation and documentation summary Language processing tasks provide important technical support.

In one embodiment of this invention, it is kind with limited high quality mark document by the method for semi-supervised learning Son trains high-precision condition random characterized by entity text, type, context, syntax tree distance, special clause etc. Field (CRF) model, carries out the entity relation extraction of text.During model repetitive exercise, output result passes through rule Judged, further increases accuracy rate, and be constantly trained high confidence results input model.

Event extraction

The effect of Event Extraction be in text with the shape of the structuring of the event standard of natural language expressing Formula redefines.Especially in news class text, correctly identifies and extract the event that occurs in text for from semantic Angle, which understands content of text and carries out more deep text mining, vital effect.

In one embodiment of this invention, Event Extraction, including pretreatment (participle, subordinate sentence, interdependent syntactic analysis, Entity recognition, relation recognition etc.), trigger word identification, candidate events sentence identification, event sentence judgement, event type judgement, event member All multi-steps such as element identification, use based on syntax, pattern match, a variety of methods of machine learning in different step.Figure 10 is Event extraction result example according to an embodiment of the invention, from text shown in Figure 10, the present embodiment, which has identified, " to be received This event of purchase ", and corresponding purchaser, time buying, purchase object, concluded price.

Extract documentation summary

The documentation summary of high quality can greatly improve the efficiency for reading text, make user or reader's fast understanding text Content, and judge the use of text, researching value.Abstract itself can also be used for the mistake of content creation as the material of high quality Cheng Zhong.

In one embodiment of this invention, the file summarization method based on Text-Rank and machine learning is realized, and Two methods achieve good effect.Entire chapter text is considered as a network, in text by the method based on Text-Rank Each sentence be considered as the node in network, gone out between sentence after correlation, i.e., according to feature calculations such as meaning of a word distance, semantic distances The importance score of each sentence can be calculated according to Page-Rank methods, the higher sentence of score is more possible to appear in pluck In wanting.Method based on machine learning, then by part of speech, term vector, name entity, with the correlation of title etc. characterized by, judgement Whether sentence should appear in abstract.

Key message extracts

Information extraction is one of the important step during contribution generates, and is only carried out targetedly, accurately to key message Extraction, convert unstructured data to structural data, can utilize obtain information carry out subsequent analysis, such as information Association, content summary etc., to generate, data are accurate, content is reliable, informative contribution.According to the present invention one implements Example, the method being combined with machine learning using rule carry out key message extraction to the text that acquisition obtains.

By taking " listing announcement " as an example, name entity (place name, company name, name therein need to be targetedly extracted Deng), temporal information, stock information (stock code, capital stock, issue price etc.), company information (registered capital, main business etc.) Etc. critical datas (such as Figure 11).In extraction process, the data that format is fixed, accuracy requirement is high, such as stock code, stock Sheet, registered capital etc., DT original texts king are mainly extracted using rule-based method, to ensure the correctness of data.Statement is more Sample, the data without set form, such as name entity, temporal information, then mainly using machine learning method be identified and Extraction.

Feeling polarities are analyzed

Feeling polarities analytical technology is applied to judge the emotional color of text sentence, paragraph, chapter.To in text Sentence and chapter carry out sentiment analysis, and the subjective attitude for concluding the viewpoint and author that include in text, result is contributed to can be used for The scenes such as text retrieval, calculation of relationship degree, content-aggregated, commending contents.

An embodiment according to the present invention, sentiment analysis are considered as more classification problems to text sentence and chapter, i.e., will be defeated The text entered is classified as passive (derogatory sense), actively (commendation), neutral one kind.During emotional semantic classification, the evaluation that occurs in text Word and combination evaluation unit, word position feature, n-gram word features, part of speech feature, upper and lower sentence emotional category etc. can quilts Consider, is used for training machine learning model.Evaluates word and combination evaluation unit can also be used for the foundation of rule, in conjunction with Machine learning model as a result, obtaining final emotional semantic classification result.

Data analysis the relevant technologies

Common numerical computations and statistical method

After obtaining structural data by data acquisition or text resolution, also needs to analyze these data, calculate, Valuable result could be obtained.For example, field of finance and economics must often calculate on year-on-year basis/ring than amount of increase and amount of decrease, sports field need to often calculate not With the utilization rate and success rate of technology, electric business field need to often calculate top search term, hot item and sales volume trend etc..

In one embodiment of this invention, efficient, targetedly data can be carried out according to the demand under different scenes Analysis.Preferably, while obtaining data, with regard to being analyzed in real time, analysis result is used for content and generates in real time, it is ensured that entire The high-timeliness of flow.

Anomaly

It under special screne, when monitoring data in real time, needs to find exceptional value, and generates in time corresponding Content reported, such as：Situations such as commodity transaction amount is abnormal, stock market data rises suddenly and sharply/slumps, seismic monitoring data exception.

In one embodiment of this invention, hypothesis can be utilized by being modeled to normal value according to domain-specific knowledge A variety of methods such as inspection, mode discovery, machine learning realize exceptional value discovery technique for different scenes；And not Tongfang It can be compared to each other and be verified between method, improve credible result degree, reduce rate of false alarm.By the uninterrupted monitoring to data, The cost manually monitored can be greatly reduced, and avoids losing situations such as reporting, failing to report caused by human factor.

Content generates

Write scene and logic template

Under different field, type, subject matter, there can be many writing scenes, different writing scenes is corresponding with different The many aspects such as writing demand, including required data, common words, common style, article length, writing logic, wherein and with It is the most key and important to write logic.In order to structurally store different writing scenes and its corresponding a variety of writing demands. Each category feature can be used to carry out qualitative description to writing scene for an embodiment according to the present invention.According to the present invention one implements Example is also pre-configured with a set of readable, reusable, can share, change and safeguard simple frame for describing different writing demands. In the context of this application, which is referred to as " logic template ".

Logic template to write logic as core, further include in logic template required data, word preference, length limitation, The information such as feeling polarities preference.Logic template is using sentence as basic structure.Each logic template may include one or more times It includes one or more candidate name entities, morpheme and clause to select sentence, each candidate sentences.More specifically, each sentence In, all include the expression (including entity, phrase, word, clause etc.) that can be replaced.These replace expression by special symbol It number is marked.Sentence is filled using different entities, phrase or word, it will changes expression and the semanteme of sentence, therefore, i.e., Make to be based on a logic template, can also create and express more various contributions.The complexity of each logic template, depends on In the levels of precision and complexity of writing logic.Write that logic is more accurate, more complicated (such as financial report, data analysis report Deng), then logic template is more complicated；Logic template is more brief if (otherwise such as news summary etc.).

One logic template represents under corresponding writing scene, a kind of style of writing thinking that may be used, a kind of writing field Scape can possess multiple logic templates.Logic template is the concrete embodiment to writing logic and demand, is to Writing Experience and to know That knows is embodied so that abstract empirical conversion is specific word, and can read, change and cross-platform sharing.This is right Content creation transmission of knowledge, study, improvement suffer from huge meaning and value.

In practical applications, when by information collection, text resolution, data analysis, getting a logic mould After all data needed for plate, one embodiment of the invention will automatically select the expression for meeting feeling polarities (such as sports field Win completely, win by a narrow margin, lose the game regretfully, defeat), generate corresponding article automatically according to the logic template.

Deep learning

An embodiment according to the present invention, deep learning method are used to automatically generate one section of text according to the parameter of input, This section of text is received in logic template as a literary section, finally becomes a part for entire article.It is to generate descriptive labelling Example, when specified commodity are clothing, the parameter of input includes type, color, clothing is long, is suitble to crowd, Time To Market, left front, material The dozens ofs feature such as matter, the place of production, the common people, style, brand, price.

The step of article being generated according to name entity, entity relationship and event morpheme and Applied Composition scene and logic template It may include：Paragraph is automatically generated according to the parameter of input using deep learning method.This paragraph can be received in logic template.

Based on constantly experiment and model iteration result, in order to ensure the continuity and correctness of text, deep learning side Method is primarily used to generate shorter text fragments.

Content is associated with and polymerization

When carrying out content generation, it usually needs polymerization multiclass information forms the article of a completion.Such as in analysis macroscopic view When economic data, the master data announced according to official is needed, is associated with the analysis result of statistician and the phase of domain expert Close comment.Often for covering surface compared with wide, content is various, can not only analyze causes the factor of data movement (for example to be divided for the analysis of expert Food, service, education are referred to when analysing CPI), it can also analyze influence of the data variation to Macroeconomic Control Policy.Therefore, common text This Similarity Algorithm in such a scenario and is not suitable for.An embodiment according to the present invention, by establishing the relevant knowledge in field The entity referred in text, event, feeling polarities are compared in collection of illustrative plates, analysis, calculate the correlation degree between text.

With CPI data instances, after official announces basic data, the embodiment of the present invention will be to collected expert view Text resolution is carried out, judges whether the content in expert view is related to CPI, whether meets official's data, feeling polarities Meet data variation etc..Unrelated with theme or viewpoint there are mistake will be screened out, will be into traveling one in remaining expert view Article content is added as literary section with after automatic select in step sequence.

Contribution is distributed

Intelligence writing platform is supported and the docking of third party's data platform, can be in time efficiently completed with help content creator The publication of author content.By individual cultivation, contribution can be transmitted to platforms such as microblogging, wechat, enterprise CMS, and the technology is main It is realized by data-interface.

Reader conduct is analyzed

In this step, click, the reading information of the article of platform publication can be obtained, such as：User's amount of reading is read Read duration, article reprints number, like time, comment number, comments on content and the essential information (age, occupation) etc. of reader. According to these data, in conjunction with information such as the themes, keyword, feeling polarities of article, it can be drawn a portrait by user and big data is divided Analysis etc. technologies, analyze different topics and article all ages and classes, gender, occupation, region reader in preference degree.

Diagnosis

The input of diagnosis is two category informations：1. editor is when carrying out content auditing and modification, the mistake of discovery and to original text The modification of part operates；2. the analysis result of reader conduct analysis module.According to type I information, we can be to each of system Step carries out error statistics and error analysis, help system are constantly iterated according to actual conditions, optimize, to further carry High system effectiveness.According to the second category information, joined when selecting article, modification content for distribution platform and content originator It examines.

The text automatic generation method and system of the embodiment of the present invention being capable of automatic collection fields such as finance and economics, electric business Hundreds of information sources cover many authoritative information publishers such as each ministries and commissions of country, secondary market, expert's social media account；It is right It, can each speech like sound of dynamic generation (such as Chinese, English) after wherein structuring, unstructured (text) data carry out analyzing processing Contribution.

In the above-mentioned methods, information collection, text resolution, content generating portion multiple technologies and step, can be according to tool Body demand is omitted, increased or is replaced.In actual conditions, the keyword, abstract, feeling polarities of text need not be obtained Etc. information, then corresponding steps can be omitted；If desired the data other than the information that the above method is extracted are obtained, can also be increased Add corresponding text resolution module, such as hot spot is found, subject distillation；Equally, different technological means can also be selected, is reached Same parsing purpose, such as rule is replaced using machine learning model.

Although the present invention is described with reference to current specific embodiment, those of ordinary skill in the art It should be appreciated that above embodiment is intended merely to illustrate the present invention, can also make in the case of no disengaging spirit of that invention Go out various equivalent change or replacement, therefore, as long as to the variation of above-described embodiment, change in the spirit of the present invention Type will all be fallen in the range of following claims.

Claims

1. a kind of automatic writing method of text, includes the following steps：

Information gathering process, including：Information is acquired from internet, format conversion is carried out to described information, described information is carried out Noise cleans, and carries out data primary dcreening operation to described information, text is obtained, wherein the text includes unstructured part；

Text resolution process, including：Classify to the text, the life in the text is identified according to the classification of the text Name entity extracts the entity relationship between the name entity in the text, according to the text according to the classification of the text Classification extract the event morpheme that can reflect event in the text；

Content generating process, including：One or more writing scenes are pre-configured with, one or more logic templates are pre-configured with, Paragraph is generated according to the name entity, the entity relationship and event morpheme and the application writing scene and logic template, It identifies associated paragraph and aggregates into article；

Process is presented in product, including：The article is distributed to one or more internet platform；

Reader conduct analytic process, including：Reader's behavior information is obtained from one or more of internet platforms, analyzes institute Reader's behavior information is stated,

Wherein described information gatherer process, the text resolution process and the content generating process are believed according to the reader conduct Breath is adjusted.

2. the automatic writing method of text according to claim 1, which is characterized in that the text resolution process further includes： Extract the keyword refined in advance in the text.

3. the automatic writing method of text according to claim 1, which is characterized in that the text resolution process further includes： Extract the key message in the text.

4. the automatic writing method of text according to claim 1, which is characterized in that the text resolution process further includes： Extract the sentence for constituting documentation summary in the text.

5. the automatic writing method of text according to claim 1, which is characterized in that the text resolution process further includes： Analyze the feeling polarities of the text.

6. the automatic writing method of text according to claim 1, which is characterized in that further include numerical analysis process, it is described Data analysis process includes：Carrying out numerical computations and statistics, the data monitored in the text to the data in the text is It is no exceptional value occur.

7. the automatic writing method of text according to claim 1, which is characterized in that the step of classifying to the text Include being classified according to the classification that pre-establishes, wherein pre-establishing class method for distinguishing and including：

Obtain the classification of one or more settings；

The first part of multiple training texts is referred in one or more of classifications；

The second part that can not be referred in one or more of classifications in multiple training texts is divided into one or more poly- Class；

Receive the tag along sort of the foundation to one or more of clusters.

8. the automatic writing method of text according to claim 1, which is characterized in that in one or more of logic templates Each logic template include one or more candidate sentences, each candidate sentences include it is one or more it is candidate name entities, Morpheme and clause.

9. the automatic writing method of text according to claim 1, which is characterized in that according to the name entity, the reality Body relationship and event morpheme and application the writing scene and logic template generation article the step of include：Use deep learning side Method automatically generates paragraph according to the parameter of input, and the paragraph is received in the logic template.

10. a kind of automatic authoring system of text, including：

Memory, for storing the instruction that can be executed by processor；

Processor, for executing described instruction to realize such as claim 1-9 any one of them methods.