CN112115256A

CN112115256A - Method and device for generating news text abstract integrated with Chinese stroke information

Info

Publication number: CN112115256A
Application number: CN202010970430.1A
Authority: CN
Inventors: 周士华; 颜静; 王宾; 吕卉
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2020-09-15
Filing date: 2020-09-15
Publication date: 2020-12-22

Abstract

The invention provides a method and a device for generating a news text abstract integrated with Chinese stroke information. The method comprises the following steps: acquiring news text data; preprocessing the news text data to obtain word segmentation texts of the news text data; scanning each word in the word segmentation text to obtain a stroke dictionary of the word segmentation text, and converting the stroke dictionary into a vector form; generating embedding of each word based on stroke information based on the stroke dictionary in a vector form, and representing the sentence into an embedded vector according to the embedding of each word based on the stroke information; and representing the embedded vector as a directed graph, iterating the directed graph by using a TextRank algorithm to obtain the score of each sentence, and generating a summary output according to the score of each sentence. The invention uses the TextRank algorithm blended with stroke information, uses a Cw2vec model to generate word vectors based on the stroke information, and generates a text abstract of news through iteration.

Description

Method and device for generating news text abstract integrated with Chinese stroke information

Technical Field

The invention relates to the field of text abstract generation in natural language processing, in particular to a method and a device for generating a news text abstract integrated with Chinese stroke information.

Background

The TextRank algorithm is an effective abstract generation algorithm and has the advantages of high speed and unsupervised performance. The traditional TextRank algorithm has some disadvantages, for example, only discrete information such as word frequency and the like is considered, and people fuse the TextRank algorithm and a word vector representation technology in the later improvement process, so that the quality of abstract generation is improved. However, the existing main popular Word vector models such as Word2vec, FastText, Bert, etc. are all based on western languages, so that semantic information in Chinese characters cannot be effectively utilized.

Disclosure of Invention

The invention provides a method and a device for generating a news text abstract integrated with Chinese stroke information. The traditional TextRank algorithm is fused with stroke information, a Cw2vec model is used, each word in a sentence is mapped to a high-dimensional word bank according to the stroke information in the text, a sentence vector fused with the Chinese stroke information is formed, and then the TextRank algorithm is used for iteration to generate the abstract of the text. The invention solves the problem that the semantic information in the Chinese character cannot be effectively utilized by the existing method.

The technical means adopted by the invention are as follows:

a method for generating news text abstract merged with Chinese stroke information comprises the following steps:

acquiring news text data, wherein the news text data comprises news titles and texts;

preprocessing the news text data to obtain word segmentation texts of the news text data;

scanning each word in the word segmentation text to obtain a stroke dictionary of the word segmentation text, and converting the stroke dictionary into a vector form;

generating embedding of each word based on stroke information based on the stroke dictionary in a vector form, and representing the sentence into an embedded vector according to the embedding of each word based on the stroke information;

and representing the embedded vector as a directed graph, iterating the directed graph by using a TextRank algorithm to obtain the score of each sentence, and generating a summary output according to the score of each sentence.

Further, preprocessing the news text data to obtain a word segmentation text of the news text data, including:

dividing the news text data into a plurality of sentences according to punctuations of Chinese;

sequentially carrying out data cleaning on each sentence, and deleting repeated data and invalid data;

and performing word segmentation operation on the cleaned sentences, and separating words by using pause signs, thereby obtaining word segmentation texts of the news text data.

Further, generating an embedding of each word based on stroke information based on the stroke dictionary in vector form, comprising:

calculating the similarity between each word in the participle text and the context word based on the vector-form stroke dictionary;

based on the similarity, carrying out probability modeling on the context word through the current word, and generating the embedding of the word based on the stroke information through the model.

An apparatus for generating news text summary merged with Chinese stroke information, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring news text data which comprises news titles and texts;

the preprocessing unit is used for preprocessing the news text data to obtain word segmentation texts of the news text data;

the stroke dictionary generating unit is used for scanning each word in the participle text to obtain a stroke dictionary of the participle text, and converting the stroke dictionary into a vector form;

an embedded vector generating unit for generating embedding of each word based on the stroke information based on the stroke dictionary in the form of a vector, and expressing the sentence as an embedded vector according to the embedding of each word based on the stroke information;

and the output unit is used for representing the embedded vector as a directed graph, iterating the directed graph by using a TextRank algorithm to obtain the score of each sentence, and generating abstract output according to the score of each sentence.

Further, the preprocessing unit includes:

the sentence dividing module is used for dividing the news text data into a plurality of sentences according to punctuation marks of Chinese;

the cleaning module is used for cleaning data of each sentence in sequence and deleting repeated data and invalid data;

and the word segmentation module is used for performing word segmentation operation on the cleaned sentences and separating words by using pause signs, so that word segmentation texts of the news text data are obtained.

Further, the embedded vector generation unit includes:

the similarity calculation module is used for calculating the similarity between each word in the participle text and the context word based on the vector-form stroke dictionary;

and the embedding generation module is used for carrying out probability modeling on the context word through the current word based on the similarity and generating the embedding of the word based on the stroke information through the model.

Compared with the prior art, the invention has the following advantages:

the invention provides a text abstract generating method based on a TextRank algorithm blended with stroke information, word vectors are generated by using a Cw2vec model, information hidden in Chinese strokes is mined and utilized, the quality of text abstract of the traditional TextRank algorithm is improved, and the generated abstract is superior to other text abstract generating algorithms based on TextRank in accuracy, recall rate and comprehensive evaluation index F1.

For the above reasons, the present invention can be widely applied to the field of natural language processing.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of a summary generation method of the present invention.

FIG. 2 is a diagram of analysis of results generated in the examples.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The invention provides a news text abstract generating method integrated with Chinese stroke information, which is characterized in that a traditional TextRank algorithm is integrated with stroke information, a Cw2vec model is used, each word in a sentence is mapped to a high-dimensional word bank according to the stroke information in the text, a sentence vector integrated with the Chinese stroke information is formed, and then the TextRank algorithm is used for iteration to generate the abstract of the text. The method specifically comprises the following steps: acquiring news text data, wherein the news text data comprises news titles and texts; preprocessing the news text data to obtain word segmentation texts of the news text data; scanning each word in the word segmentation text to obtain a stroke dictionary of the word segmentation text, and converting the stroke dictionary into a vector form; generating embedding of each word based on stroke information based on the stroke dictionary in a vector form, and representing the sentence into an embedded vector according to the embedding of each word based on the stroke information; and representing the embedded vector as a directed graph, iterating the directed graph by using a TextRank algorithm to obtain the score of each sentence, and generating a summary output according to the score of each sentence.

The embodiments of the present invention will be described in detail with reference to the accompanying drawings.

A method for generating a news text summary integrated with chinese stroke information, as shown in fig. 1, includes the following steps:

step 1: collecting news text data (including text and title), preprocessing the data, mainly deleting repeated data and invalid data, and performing word segmentation and sentence segmentation on the data.

Specifically, a piece of news text is arbitrarily selected as an implementation case, and examples are as follows:

clinical observation and preliminary basic research show that the lung-clearing and toxin-expelling decoction is a general formula suitable for light, common and severe new coronary pneumonia, and has the characteristics of quick effect, high efficiency and safety. In clinical symptoms, the improvement is very obvious, such as fever symptoms, which disappear within 3 days; after 6 days of treatment, the absorption of lung CT lesions is also very obvious; the negative conversion rate of nucleic acid is one hundred percent, and the average negative conversion time is about 10 days. 'Wangwei' indicates that a group of experts research the material basis of the lung clearing and toxin expelling soup, and the medicine can play a role in regulating and controlling the Xinguan pneumonia through multiple components and multiple links. Particularly, the traditional Chinese medicine composition can effectively inhibit the production of endotoxin, can avoid or delay the occurrence of inflammatory storm, and becomes an important means for preventing and treating the new coronary pneumonia. The average cost for treating a severe patient is about 15 ten thousand, so that the prescription takes 3 days, one treatment course is about 100 yuan, two treatment courses are about 200 yuan, and the medical insurance cost is greatly saved. "

The sentence is then processed, first the case news text is divided into 7 sentences according to the punctuation marks of Chinese. During sentence segmentation. "; "? "" … … "" is used in the design of a toy! "etc. are clause symbols. And then performing word segmentation operation on the case text, and separating the case text by using a pause sign, wherein the news text after word segmentation is as follows: "clinical, observational, and, preliminary, basic, research, prescription … … medical insurance, expense, aspect, also, substantial, savings".

Step 2: scanning each word in the whole corpus to obtain a stroke n-gram dictionary S, then using S (w) to represent a word w stroke n-gram set, and calculating a similarity function between w and c.

Specifically, each word in the case news text is scanned to obtain a stroke n-gram dictionary S, and the dictionary set S is as follows: - … …, converting the dictionary collection into vector form according to stroke correspondence information: s (w) ═ 22214 … … 2441121222 ], denoting w, c respectively the current word and the context word, the similarity function between w and c is as follows:

where q is the stroke n-gram, vector, of set S (w)

And

representing the stroke vector representation of words q and c, respectively.

The stroke to number correspondence table is as follows:

for example: the "clinical" stroke is denoted [ 2221431425 … … ], "observed" stroke is denoted [ 5425235431 … … ], and the specific representation steps are described in the document shansing Cao; wei Lu; jun Zhou; xialong Li.2018.cw2vec, Learning Chinese word indexes with Stroke n-gram information. AI Department, and the sim value between two words is 247 by the calculation of the Ant Financial Services group; the "big" stroke is denoted as [ 3325112341 … … ], "save" stroke is denoted as [ 1225223411 … … ], and the sim value between the two words is calculated as 138.

And step 3: the predictive modeling of the context word c based on the current word w is performed using a Softmax function for probabilistic modeling, where c' is the word in the vocabulary.

An objective function defined throughout the corpus is generated and optimized using a standard gradient-based approach. In the word embedding process, in order to simplify calculation, a series of wrong words, namely negative samples, are continuously sampled randomly, and the model is enabled to continuously process the two-classification problem, so that the quality of the output embedding form is higher, and the optimization of the model is realized. After the learning process is performed, the embedding of the word is directly output:

L＝∑_w∈D∑_c∈T(w)logσ(sim(w,c))+λE_C′～p[logσ(-sim(w,c′))](3)

where L is the embedding of the input word, λ is the number of negative examples, λ E_c′P is the desired term, the selected negative sample C' fits into the distribution p, and the activation function σ is the sigmoid function. T (w) is the set of all experiments, D is the set of all words in the corpus.

After learning, the resulting word is based on the embedding of stroke information, for example: the embedding of the word "clinical" is expressed as: [0.683,0.352,0.289 … … 0.369,0.479], the word "observe" is embedded as [0.257,0.326,0.814 … … 0.423,0.125 ].

And 4, step 4: according to the embedding form of the fused stroke information of each word, the sentence is expressed as an embedding vector, in the embodiment, the first sentence "clinical observation and preliminary basic research in the news text shows that the lung-clearing and toxin-expelling soup is a general formula suitable for light, common and severe new coronary pneumonia, and has the characteristics of quick acting, high efficiency and safety" expressed as [0.683,0.352,0.289 … … 0.148.148, 0.452 ].

And 5: the sentences in the case are represented by using a directed graph, and iteration is carried out by using a TextRank algorithm, wherein the iteration formula is as follows:

W_ijrepresents the weight between two nodes, expressed as the similarity between two nodes, i.e. [0.56,0.25,0.36,0.38 … … 0.12 in the case]，In(V_i) Refers to the set of all nodes pointing to node i, Out (V)_i) Refers to the set of all nodes to which a node points, In (V)_i) And Out (V)_i) In this case, 9 sentences in the text. d is the damping coefficient, typically 0.85, the scores of all nodes are initialized to 1 before the algorithm proceeds, and iteration is stopped when the error of the node is less than 0.0001.

Further, the TextRank algorithm is improved from the conventional page ranking algorithm PageRank, in which the mutually-referenced pages are nodes pointing to each other. In sentence sequencing, any two sentences can obtain similarity through calculation, so that any one sentence is a pointed node and is also a node for transmitting an edge in the application, and the weight value of the edge is a value of the similarity.

Step 6: and sequencing nine sentences of the news text in the case according to the scores, wherein the nine sentences are respectively as follows: 0.8616,0.0956,0.3301, 0.1258,0.5611,0.1251,0.4425,0.7422,0.5211, the two sentences with the highest scores are: (1) various clinical observations and preliminary basic researches show that the lung-clearing and toxin-expelling decoction is a general formula suitable for light, common and severe new coronary pneumonia, and has the characteristics of quick acting, high efficiency and safety. (2) Particularly, the traditional Chinese medicine composition can effectively inhibit the production of endotoxin, can avoid or delay the occurrence of inflammatory storm, and becomes an important means for preventing and treating the new coronary pneumonia. The text summary generated for this case is therefore: clinical observation and preliminary basic research show that the lung-clearing and toxin-expelling decoction is a general formula suitable for light, common and severe new coronary pneumonia, and has the characteristics of quick effect, high efficiency and safety. Particularly, the composition can effectively inhibit the production of endotoxin, can avoid or delay the occurrence of inflammatory storm, and becomes an important means for preventing and treating the new coronary pneumonia at this time.

In order to illustrate the application effect of the invention, in the embodiment, the summary generated in the above steps is evaluated by using a Rouge method, the generated summary is compared with a reference summary so as to prove the effectiveness of the method, and the summary generated by the system is evaluated by using the traditional Rouge1, Rouge2, the accuracy p in Rouge, the recall rate r and the comprehensive evaluation index F1, wherein the accuracy p refers to the proportion of the co-occurrence words of the generated summary and the reference summary appearing in the reference summary. The recall ratio r refers to the proportion of the co-occurrence words of the generated abstract and the reference abstract appearing in the generated abstract, and the comprehensive evaluation index F1 (indicated by F) comprehensively considers the accuracy and the recall ratio. The ROUGE evaluation is carried out by using an open source tool with the version of ROUGE-1.5.5, and the reference abstract of the case is' observation and experiment shows that the lung clearing and toxin expelling decoction is an important means for preventing and treating the new crown pneumonia. "through calculation, as shown in fig. 2, the accuracy p, the recall rate r and the comprehensive evaluation index F1 of Rouge1 for generating the summary are Rouge1-p:0.227, Rouge1-r:0.833 and Rouge1-F:0.357 respectively; the accuracy rate p, the recall rate r and the comprehensive evaluation index F1 of the Rouge2 are 0.139 to 0 Rouge2-p, 0.545 to 2-r and 0.222 to 2-F respectively; the accuracy rate p, the recall rate r and the comprehensive evaluation index F1 of the Rougel are 0.270, 0.833 and 0.408 respectively. The result is superior to the conventional TextRank algorithm and the TextRank algorithm represented by other word vectors in the aspects of accuracy, recall rate and comprehensive evaluation index F1.

The invention also provides a device for generating the news text abstract integrated with the Chinese stroke information, which comprises the following steps: the device comprises an acquisition unit, a preprocessing unit, a stroke dictionary generating unit, an embedded vector generating unit, an output unit and an evaluation unit.

The acquisition unit is used for acquiring news text data, and the news text data comprises news titles and texts.

And the preprocessing unit is used for preprocessing the news text data to obtain word segmentation texts of the news text data. The preprocessing unit includes: the sentence dividing module is used for dividing the news text data into a plurality of sentences according to punctuation marks of Chinese; the cleaning module is used for cleaning data of each sentence in sequence and deleting repeated data and invalid data; and the word segmentation module is used for performing word segmentation operation on the cleaned sentences and separating words by using pause signs, so that word segmentation texts of the news text data are obtained.

and the embedded vector generating unit is used for generating embedding of each word based on the stroke information based on the stroke dictionary in the vector form and representing the sentence into the embedded vector according to the embedding of each word based on the stroke information. Further, the embedded vector generation unit includes: the similarity calculation module is used for calculating the similarity between each word in the participle text and the context word based on the vector-form stroke dictionary; and the embedding generation module is used for carrying out probability modeling on the context word through the current word based on the similarity and generating the embedding of the word based on the stroke information through the model.

And the evaluation unit is used for evaluating the generated abstract.

For the embodiment of the apparatus of the present invention, since it corresponds to the above embodiment of the method, the description is simple, and for the related similarities, please refer to the description in the above embodiment, and the detailed description is omitted here.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for generating news text abstract merged with Chinese stroke information is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein the step of preprocessing the news text data to obtain a segmented text of the news text data comprises:

3. The method of claim 1, wherein generating stroke information based embedding of words based on a vector-based stroke dictionary comprises:

4. An apparatus for generating a news text summary incorporating stroke information of Chinese, comprising:

5. The apparatus for generating a news text summary blended with Chinese stroke information as claimed in claim 4, wherein said preprocessing unit comprises:

6. The method of claim 4, wherein the embedded vector generation unit comprises: