CN111859887A

CN111859887A - Scientific and technological news automatic writing system based on deep learning

Info

Publication number: CN111859887A
Application number: CN202010707063.6A
Authority: CN
Inventors: 刘超; 刘霖雯
Original assignee: Beijing Beidou Tianxun Technology Co Ltd
Current assignee: Zhengzhou chaos Information Technology Co.,Ltd.
Priority date: 2020-07-21
Filing date: 2020-07-21
Publication date: 2020-10-30

Abstract

The invention discloses a scientific and technological news automatic writing system based on deep learning, which relates to the technical field of news writing and comprises a web crawler module; a scientific and technological news preprocessing module; a scientific and technological news classification clustering module; a scientific and technological news deep learning generation training module; a news automatic generation module; the news display module is generated, so that the rapid generation of scientific and technical news is realized, and news forms of different styles can be generated according to different website styles and the like.

Description

Scientific and technological news automatic writing system based on deep learning

Technical Field

The invention relates to the technical field of news writing, in particular to an automatic scientific and technological news writing system based on deep learning, which is used for information processing and news manuscript writing of scientific and technological news.

Background

News works have many categories, such as civil life, current affairs, military affairs and the like, and the news of the internet is gradually increased in various columns or layouts seen in newspaper ends, so that various news websites are in endless numbers.

The scientific and technological news is the report of the special scientific and technological fact which occurs recently, and the scientific and technological news is mostly conference news, the materials are mostly conference draft and related reports, and the material is rarely special interview, so the material is very important. For science and technology news, the requirements of the reporter are biased to the rational thinking of the science and technology. With the development of the internet, scientific characters and scientific events, the number of related reports of the scientific class is increased every day, and relatively more and more news reports are provided for the scientific class, so that the cost of the news reports is increased.

Therefore, in order to reduce the reporting cost of science and technology news, the latest research result of deep mind is found, and the performance of the deep learning artificial network RNN widely used in the fields of voice recognition, image recognition, semantic understanding and the like is remarkably improved (substentially beter). The research is mainly enhanced by using external memory on the memory time sequence generation model, and the method has certain inspiration on the research in the field of deep learning.

On the basis of summarizing and analyzing scientific news contents written by human authors, the invention discloses a machine learning writing system realized by a machine learning method.

Disclosure of Invention

The invention aims to: in order to realize the rapid generation of scientific and technological news and generate news forms with different styles according to different website styles and the like, the invention provides an automatic scientific and technological news generating and writing system based on mass data large-scale training deep learning.

The invention specifically adopts the following technical scheme for realizing the purpose: the technical news automatic writing system based on deep learning is characterized by comprising the following modules:

A web crawler module: the module collects scientific and technological channels and scientific and technological news of websites from each website, collects related contents of each scientific and technological website, extracts the text of the collected data and stores the extracted data in a database;

science and technology news preprocessing module: performing word segmentation, named entity identification, entity relation extraction, syntactic analysis and semantic analysis on the collected news;

science and technology news classification clustering module: the method mainly aims at scientific and technological news content, further refines the content with great care, adopts intelligent classification and clustering technology, carries out detailed classification on scientific and technological news and trains and learns on the news content based on a deep learning generated memory model, and finally realizes a news generation model based on the generated memory model;

the scientific and technological news deep learning generation training module: the classification system based on svm and textrnn based on deep learning simultaneously carries out an unsupervised clustering algorithm aiming at part of news with uncertain category attributes, and realizes the clustering of the content with relatively deviated attribution of the classification threshold value based on the lda automatic clustering algorithm;

the automatic news generation module: the news generation model automatically searches news contents required by the writing generation user and displays the news contents to the user as long as the user inputs key words, writing styles, time and other elements of news to be written;

And a news display module is generated: the news generated by the news automatic generation module is transmitted to a designated forum and news websites according to a designated network protocol, and is scored by a user, the quality of the news generation quality is evaluated and fed back to the fourth part, optimization and improvement are carried out continuously, and finally the content of the basic readable news of one edition is realized.

Further, the scientific and technological news preprocessing module comprises:

the news content word segmentation submodule comprises: the method mainly aims at news texts and titles, performs complex and simple conversion on format words, unifies case and case, deletes invalid characters and the like, performs word segmentation on processed contents, and removes stop words as candidate processing data sets;

a news named entity identification module which is mainly used for identifying the name of a person, the name of a place, the name of an organization, the name of a product, a professional noun, the occurrence time and the like of news;

a news entity relation extraction module, which is mainly used for extracting and optimizing the relation entity relation among various entities aiming at various recognized nouns, wherein the entities are recognized based on a crf + + mode, and then a knowledge base is labeled according to a hownet and an entity relation established manually to extract the entity relation, so that preparation is made for the next deep learning training;

The news text content analysis module is mainly used for carrying out syntactic analysis on specific contents of news, and the syntactic structure analyzer is based on a Stanford syntactic analyzer, realizes a Chinese function, analyzes the syntactic structure of each sentence of the contents and the context relationship among the sentences, and makes a labeled sequence of the syntactic analysis;

the news text semantic analysis module analyzes and processes people, companies, scientific abbreviations, product abbreviations, company abbreviations and human-related positions of scientific news reports, replaces and expands synonyms and synonyms by using semantic resources, calculates semantic relevance by using a word2 vec-based mode, and counts partial synonyms, synonyms and related words based on the aspect of text capture. Further, the word segmentation sub-module comprises a word segmentation system, and the word segmentation system is an ansj word segmentation system based on a named entity recognition part embedded with crf + +.

Further, the crf + + is implemented in a c + + language, a large amount of stl data structures are applied, and c language is used for rewriting a part of code related to stl in the source code on the basis of deep reading of the source code, which may be specifically expressed as: in the tagger. cpp source code file, a vector < constchar > structure is used:

In addition, after the characteristics are coded and the memory is not released, the invention replaces the std (vector < vector _ char > > TaggerImpl (x) _, the memory forced release immediately obtains 10% of memory reduction through experimental comparison, and the L-BFGS algorithm is an improvement on the quasi-Newton algorithm aiming at the modification of CRF + + and L-BFGS. Its name has told us that it is an improvement of the BFGS algorithm based on the quasi-newton method. The basic idea of the L-BFGS algorithm is as follows: the algorithm only stores and utilizes curvature information of the latest m iterations to construct an approximate matrix of the Hessian matrix, the step length is optimized in the iteration direction, the step length is automatically adjusted according to the specific training content effect, and the training effect is effectively guaranteed not to be too cheap and too large.

Furthermore, the news named entity recognition module performs corpus training and recognizes the name of a person, a place, an organization name, a product name, a professional noun, the occurrence time and the like of news by adopting a crf + + model. Automatic system of writing of news, its characterized in that: the news automatic generation module comprises a user interaction module and a news generation module.

Furthermore, the user interaction module is mainly used for automatically searching for learning generated and related keyword sentences by using the generated writing model through inputting keywords of contents which want to generate a scientific and technical paper by a user, learning the relation among the keywords, enabling the news contents at the most article chapter level to be in smooth transition, combining a plurality of keywords and writing styles, analyzing and decomposing by adopting a recurrent neural network, adding new trends to store and protect long-range information, and memorizing and storing.

The invention has the following beneficial effects:

1. the invention writes a learning source, comes from the Internet, adopts the crawler with extremely strong universality to collect scientific and technological news data, greatly improves the collection speed, can extract the collected content, titles, abstracts and texts, can quickly collect and store the data by configuring the collection source if a new news type is found, solves the problem that the scientific and technological news writing needs a large amount of manpower and material resources, and reduces the labor cost and the time cost.

2. In the preprocessing stage of news contents, the autonomous research and development intelligent word segmentation system is used, so that the word segmentation accuracy is effectively improved, and a good foundation is provided for data processing.

3. The method adopts the improved CRF + + to carry out named entity recognition and entity relation extraction, has higher recognition accuracy, carries out syntactic analysis and semantic analysis, effectively segments the content of the article, and greatly improves the final deep learning generalization learning capability.

4. The invention adopts intelligent classification and clustering algorithm, effectively aggregates the collected news, and facilitates deep learning of each writing style.

5. The invention adopts a deep learning method to learn about the writing style, writing mode, writing content characteristics, writing length, writing scene and the like of each collected and classified scientific news content, thereby generating a writing model.

6. The scientific and technological news generating system is simple to use, can quickly generate a writing manuscript only by inputting some written keywords, writing styles and small news types, and is low in writing cost and quick in writing.

7. The method can be used in the field of scientific and technological news writing, can be quickly expanded to other fields along with the enhancement of generalization learning ability, and has good popularization expansibility.

Drawings

FIG. 1 is a schematic diagram of the overall architecture of the present invention;

FIG. 2 is a content processing flow diagram of the present invention;

FIG. 3 is a diagram of the training model generation for automatically generating news in accordance with the present invention;

fig. 4 is a flow chart of an implementation of the present invention for automatically generating news.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Furthermore, the terms "first," "second," and the like are used merely to distinguish one description from another, and are not to be construed as indicating or implying relative importance.

In the description of the embodiments of the present invention, it should be noted that the terms "inside", "outside", "upper", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings or orientations or positional relationships conventionally arranged when products of the present invention are used, and are only used for convenience in describing the present invention and simplifying the description, but do not indicate or imply that the devices or elements indicated must have specific orientations, be constructed in specific orientations, and operated, and thus, cannot be construed as limiting the present invention.

Example 1

As shown in fig. 1 to 4, an automatic scientific news authoring system based on deep learning is characterized by comprising the following modules:

The science and technology news preprocessing module comprises:

The user interaction module is mainly used for automatically searching and learning generated and related keyword sentences by using the generated writing model through inputting keywords of contents which want to generate a scientific and technological paper by a user, then learning the relation among the keywords, enabling the news contents at the most article chapter level to be in smooth transition, combining a plurality of keywords and writing styles, analyzing and decomposing by adopting a recurrent neural network, adding new trends to store and protect long-range information, and memorizing and storing.

Example 2

As shown in fig. 1 to 4, in order to make the inner side occupy less, the present embodiment is further improved on the basis of embodiment 1, specifically: the crf + + is implemented by using a c + + language, a large number of stl data structures are applied, c languages are used for rewriting a part of code related to stl in the source code on the basis of deep reading of the source code, and the method specifically can be represented as follows: in the tagger. cpp source code file, a vector < constchar > structure is used:

in addition, after the characteristics are coded and the memory is not released, the invention replaces the std (vector < vector _ char > > TaggerImpl (x) _, the memory forced release immediately obtains 10% of memory reduction through experimental comparison, and the L-BFGS algorithm is an improvement on the quasi-Newton algorithm aiming at the modification of CRF + + and L-BFGS. Its name has told us that it is an improvement of the BFGS algorithm based on the quasi-newton method. The basic idea of the L-BFGS algorithm is as follows: the algorithm only stores and utilizes curvature information of the latest m iterations to construct an approximate matrix of the Hessian matrix, the step length is optimized in the iteration direction, the step length is automatically adjusted according to the specific training content effect, and the training effect is effectively guaranteed not to be too cheap and too large. .

Example 3:

as shown in fig. 1 to 4, the news named entity recognition module performs corpus training and recognizes names of people, places, organizations, products, professional nouns, occurrence times, and the like of news by using a crf + + model. Automatic system of writing of news, its characterized in that: the news automatic generation module comprises a user interaction module and a news generation module.

Claims

1. The technical news automatic writing system based on deep learning is characterized by comprising the following modules:

2. The automatic scientific news authoring system based on deep learning of claim 1, wherein: the science and technology news preprocessing module comprises:

the news text semantic analysis module analyzes and processes people, companies, scientific abbreviations, product abbreviations, company abbreviations and human-related positions of scientific news reports, replaces and expands synonyms and synonyms by using semantic resources, calculates semantic relevance by using a word2 vec-based mode, and counts partial synonyms, synonyms and related words based on the aspect of text capture.

3. The automatic scientific news authoring system based on deep learning of claim 2, wherein: the word segmentation sub-module comprises a word segmentation system, and the word segmentation system is an ansj word segmentation system based on a named entity recognition part embedded with crf + +.

4. The automatic scientific news authoring system for deep learning according to claim 3, wherein: the crf + + is realized by using a c + + language, a large number of stl data structures are applied, and c language is used for rewriting a part of code related to stl in the source code on the basis of deep reading of the source code.

5. The automatic scientific news authoring system based on deep learning of claim 2, wherein: the news named entity recognition module is used for training corpora and recognizing the name, place name, organization name, product name, professional nouns, occurrence time and the like of news by adopting a crf + + model.

6. The automatic scientific news authoring system based on deep learning of claim 1, wherein: the news automatic generation module comprises a user interaction module and a news generation module.

7. The automatic scientific news authoring system based on deep learning of claim 6, wherein: the user interaction module is mainly used for automatically searching and learning generated and related keyword sentences by using the generated writing model through inputting keywords of contents which want to generate a scientific and technological paper by a user, then learning the relation among the keywords, enabling the news contents at the most article chapter level to be in smooth transition, combining a plurality of keywords and writing styles, analyzing and decomposing by adopting a recurrent neural network, adding new trends to store and protect long-range information, and memorizing and storing.