CN112115256A - Method and device for generating news text abstract integrated with Chinese stroke information - Google Patents

Method and device for generating news text abstract integrated with Chinese stroke information Download PDF

Info

Publication number
CN112115256A
CN112115256A CN202010970430.1A CN202010970430A CN112115256A CN 112115256 A CN112115256 A CN 112115256A CN 202010970430 A CN202010970430 A CN 202010970430A CN 112115256 A CN112115256 A CN 112115256A
Authority
CN
China
Prior art keywords
word
stroke
generating
news text
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010970430.1A
Other languages
Chinese (zh)
Inventor
周士华
颜静
王宾
吕卉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University
Original Assignee
Dalian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University filed Critical Dalian University
Priority to CN202010970430.1A priority Critical patent/CN112115256A/en
Publication of CN112115256A publication Critical patent/CN112115256A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention provides a method and a device for generating a news text abstract integrated with Chinese stroke information. The method comprises the following steps: acquiring news text data; preprocessing the news text data to obtain word segmentation texts of the news text data; scanning each word in the word segmentation text to obtain a stroke dictionary of the word segmentation text, and converting the stroke dictionary into a vector form; generating embedding of each word based on stroke information based on the stroke dictionary in a vector form, and representing the sentence into an embedded vector according to the embedding of each word based on the stroke information; and representing the embedded vector as a directed graph, iterating the directed graph by using a TextRank algorithm to obtain the score of each sentence, and generating a summary output according to the score of each sentence. The invention uses the TextRank algorithm blended with stroke information, uses a Cw2vec model to generate word vectors based on the stroke information, and generates a text abstract of news through iteration.

Description

Method and device for generating news text abstract integrated with Chinese stroke information
Technical Field
The invention relates to the field of text abstract generation in natural language processing, in particular to a method and a device for generating a news text abstract integrated with Chinese stroke information.
Background
The TextRank algorithm is an effective abstract generation algorithm and has the advantages of high speed and unsupervised performance. The traditional TextRank algorithm has some disadvantages, for example, only discrete information such as word frequency and the like is considered, and people fuse the TextRank algorithm and a word vector representation technology in the later improvement process, so that the quality of abstract generation is improved. However, the existing main popular Word vector models such as Word2vec, FastText, Bert, etc. are all based on western languages, so that semantic information in Chinese characters cannot be effectively utilized.
Disclosure of Invention
The invention provides a method and a device for generating a news text abstract integrated with Chinese stroke information. The traditional TextRank algorithm is fused with stroke information, a Cw2vec model is used, each word in a sentence is mapped to a high-dimensional word bank according to the stroke information in the text, a sentence vector fused with the Chinese stroke information is formed, and then the TextRank algorithm is used for iteration to generate the abstract of the text. The invention solves the problem that the semantic information in the Chinese character cannot be effectively utilized by the existing method.
The technical means adopted by the invention are as follows:
a method for generating news text abstract merged with Chinese stroke information comprises the following steps:
acquiring news text data, wherein the news text data comprises news titles and texts;
preprocessing the news text data to obtain word segmentation texts of the news text data;
scanning each word in the word segmentation text to obtain a stroke dictionary of the word segmentation text, and converting the stroke dictionary into a vector form;
generating embedding of each word based on stroke information based on the stroke dictionary in a vector form, and representing the sentence into an embedded vector according to the embedding of each word based on the stroke information;
and representing the embedded vector as a directed graph, iterating the directed graph by using a TextRank algorithm to obtain the score of each sentence, and generating a summary output according to the score of each sentence.
Further, preprocessing the news text data to obtain a word segmentation text of the news text data, including:
dividing the news text data into a plurality of sentences according to punctuations of Chinese;
sequentially carrying out data cleaning on each sentence, and deleting repeated data and invalid data;
and performing word segmentation operation on the cleaned sentences, and separating words by using pause signs, thereby obtaining word segmentation texts of the news text data.
Further, generating an embedding of each word based on stroke information based on the stroke dictionary in vector form, comprising:
calculating the similarity between each word in the participle text and the context word based on the vector-form stroke dictionary;
based on the similarity, carrying out probability modeling on the context word through the current word, and generating the embedding of the word based on the stroke information through the model.
An apparatus for generating news text summary merged with Chinese stroke information, comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring news text data which comprises news titles and texts;
the preprocessing unit is used for preprocessing the news text data to obtain word segmentation texts of the news text data;
the stroke dictionary generating unit is used for scanning each word in the participle text to obtain a stroke dictionary of the participle text, and converting the stroke dictionary into a vector form;
an embedded vector generating unit for generating embedding of each word based on the stroke information based on the stroke dictionary in the form of a vector, and expressing the sentence as an embedded vector according to the embedding of each word based on the stroke information;
and the output unit is used for representing the embedded vector as a directed graph, iterating the directed graph by using a TextRank algorithm to obtain the score of each sentence, and generating abstract output according to the score of each sentence.
Further, the preprocessing unit includes:
the sentence dividing module is used for dividing the news text data into a plurality of sentences according to punctuation marks of Chinese;
the cleaning module is used for cleaning data of each sentence in sequence and deleting repeated data and invalid data;
and the word segmentation module is used for performing word segmentation operation on the cleaned sentences and separating words by using pause signs, so that word segmentation texts of the news text data are obtained.
Further, the embedded vector generation unit includes:
the similarity calculation module is used for calculating the similarity between each word in the participle text and the context word based on the vector-form stroke dictionary;
and the embedding generation module is used for carrying out probability modeling on the context word through the current word based on the similarity and generating the embedding of the word based on the stroke information through the model.
Compared with the prior art, the invention has the following advantages:
the invention provides a text abstract generating method based on a TextRank algorithm blended with stroke information, word vectors are generated by using a Cw2vec model, information hidden in Chinese strokes is mined and utilized, the quality of text abstract of the traditional TextRank algorithm is improved, and the generated abstract is superior to other text abstract generating algorithms based on TextRank in accuracy, recall rate and comprehensive evaluation index F1.
For the above reasons, the present invention can be widely applied to the field of natural language processing.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart of a summary generation method of the present invention.
FIG. 2 is a diagram of analysis of results generated in the examples.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The invention provides a news text abstract generating method integrated with Chinese stroke information, which is characterized in that a traditional TextRank algorithm is integrated with stroke information, a Cw2vec model is used, each word in a sentence is mapped to a high-dimensional word bank according to the stroke information in the text, a sentence vector integrated with the Chinese stroke information is formed, and then the TextRank algorithm is used for iteration to generate the abstract of the text. The method specifically comprises the following steps: acquiring news text data, wherein the news text data comprises news titles and texts; preprocessing the news text data to obtain word segmentation texts of the news text data; scanning each word in the word segmentation text to obtain a stroke dictionary of the word segmentation text, and converting the stroke dictionary into a vector form; generating embedding of each word based on stroke information based on the stroke dictionary in a vector form, and representing the sentence into an embedded vector according to the embedding of each word based on the stroke information; and representing the embedded vector as a directed graph, iterating the directed graph by using a TextRank algorithm to obtain the score of each sentence, and generating a summary output according to the score of each sentence.
The embodiments of the present invention will be described in detail with reference to the accompanying drawings.
A method for generating a news text summary integrated with chinese stroke information, as shown in fig. 1, includes the following steps:
step 1: collecting news text data (including text and title), preprocessing the data, mainly deleting repeated data and invalid data, and performing word segmentation and sentence segmentation on the data.
Specifically, a piece of news text is arbitrarily selected as an implementation case, and examples are as follows:
clinical observation and preliminary basic research show that the lung-clearing and toxin-expelling decoction is a general formula suitable for light, common and severe new coronary pneumonia, and has the characteristics of quick effect, high efficiency and safety. In clinical symptoms, the improvement is very obvious, such as fever symptoms, which disappear within 3 days; after 6 days of treatment, the absorption of lung CT lesions is also very obvious; the negative conversion rate of nucleic acid is one hundred percent, and the average negative conversion time is about 10 days. 'Wangwei' indicates that a group of experts research the material basis of the lung clearing and toxin expelling soup, and the medicine can play a role in regulating and controlling the Xinguan pneumonia through multiple components and multiple links. Particularly, the traditional Chinese medicine composition can effectively inhibit the production of endotoxin, can avoid or delay the occurrence of inflammatory storm, and becomes an important means for preventing and treating the new coronary pneumonia. The average cost for treating a severe patient is about 15 ten thousand, so that the prescription takes 3 days, one treatment course is about 100 yuan, two treatment courses are about 200 yuan, and the medical insurance cost is greatly saved. "
The sentence is then processed, first the case news text is divided into 7 sentences according to the punctuation marks of Chinese. During sentence segmentation. "; "? "" … … "" is used in the design of a toy! "etc. are clause symbols. And then performing word segmentation operation on the case text, and separating the case text by using a pause sign, wherein the news text after word segmentation is as follows: "clinical, observational, and, preliminary, basic, research, prescription … … medical insurance, expense, aspect, also, substantial, savings".
Step 2: scanning each word in the whole corpus to obtain a stroke n-gram dictionary S, then using S (w) to represent a word w stroke n-gram set, and calculating a similarity function between w and c.
Specifically, each word in the case news text is scanned to obtain a stroke n-gram dictionary S, and the dictionary set S is as follows: - … …, converting the dictionary collection into vector form according to stroke correspondence information: s (w) ═ 22214 … … 2441121222 ], denoting w, c respectively the current word and the context word, the similarity function between w and c is as follows:
where q is the stroke n-gram, vector, of set S (w)
Figure RE-GDA0002786606720000052
And
Figure RE-GDA0002786606720000053
representing the stroke vector representation of words q and c, respectively.
The stroke to number correspondence table is as follows:
Figure RE-GDA0002786606720000061
for example: the "clinical" stroke is denoted [ 2221431425 … … ], "observed" stroke is denoted [ 5425235431 … … ], and the specific representation steps are described in the document shansing Cao; wei Lu; jun Zhou; xialong Li.2018.cw2vec, Learning Chinese word indexes with Stroke n-gram information. AI Department, and the sim value between two words is 247 by the calculation of the Ant Financial Services group; the "big" stroke is denoted as [ 3325112341 … … ], "save" stroke is denoted as [ 1225223411 … … ], and the sim value between the two words is calculated as 138.
And step 3: the predictive modeling of the context word c based on the current word w is performed using a Softmax function for probabilistic modeling, where c' is the word in the vocabulary.
Figure RE-GDA0002786606720000062
An objective function defined throughout the corpus is generated and optimized using a standard gradient-based approach. In the word embedding process, in order to simplify calculation, a series of wrong words, namely negative samples, are continuously sampled randomly, and the model is enabled to continuously process the two-classification problem, so that the quality of the output embedding form is higher, and the optimization of the model is realized. After the learning process is performed, the embedding of the word is directly output:
L=∑w∈Dc∈T(w)logσ(sim(w,c))+λEC′~p[logσ(-sim(w,c′))](3)
where L is the embedding of the input word, λ is the number of negative examples, λ Ec′P is the desired term, the selected negative sample C' fits into the distribution p, and the activation function σ is the sigmoid function. T (w) is the set of all experiments, D is the set of all words in the corpus.
After learning, the resulting word is based on the embedding of stroke information, for example: the embedding of the word "clinical" is expressed as: [0.683,0.352,0.289 … … 0.369,0.479], the word "observe" is embedded as [0.257,0.326,0.814 … … 0.423,0.125 ].
And 4, step 4: according to the embedding form of the fused stroke information of each word, the sentence is expressed as an embedding vector, in the embodiment, the first sentence "clinical observation and preliminary basic research in the news text shows that the lung-clearing and toxin-expelling soup is a general formula suitable for light, common and severe new coronary pneumonia, and has the characteristics of quick acting, high efficiency and safety" expressed as [0.683,0.352,0.289 … … 0.148.148, 0.452 ].
And 5: the sentences in the case are represented by using a directed graph, and iteration is carried out by using a TextRank algorithm, wherein the iteration formula is as follows:
Figure RE-GDA0002786606720000071
Wijrepresents the weight between two nodes, expressed as the similarity between two nodes, i.e. [0.56,0.25,0.36,0.38 … … 0.12 in the case],In(Vi) Refers to the set of all nodes pointing to node i, Out (V)i) Refers to the set of all nodes to which a node points, In (V)i) And Out (V)i) In this case, 9 sentences in the text. d is the damping coefficient, typically 0.85, the scores of all nodes are initialized to 1 before the algorithm proceeds, and iteration is stopped when the error of the node is less than 0.0001.
Further, the TextRank algorithm is improved from the conventional page ranking algorithm PageRank, in which the mutually-referenced pages are nodes pointing to each other. In sentence sequencing, any two sentences can obtain similarity through calculation, so that any one sentence is a pointed node and is also a node for transmitting an edge in the application, and the weight value of the edge is a value of the similarity.
Step 6: and sequencing nine sentences of the news text in the case according to the scores, wherein the nine sentences are respectively as follows: 0.8616,0.0956,0.3301, 0.1258,0.5611,0.1251,0.4425,0.7422,0.5211, the two sentences with the highest scores are: (1) various clinical observations and preliminary basic researches show that the lung-clearing and toxin-expelling decoction is a general formula suitable for light, common and severe new coronary pneumonia, and has the characteristics of quick acting, high efficiency and safety. (2) Particularly, the traditional Chinese medicine composition can effectively inhibit the production of endotoxin, can avoid or delay the occurrence of inflammatory storm, and becomes an important means for preventing and treating the new coronary pneumonia. The text summary generated for this case is therefore: clinical observation and preliminary basic research show that the lung-clearing and toxin-expelling decoction is a general formula suitable for light, common and severe new coronary pneumonia, and has the characteristics of quick effect, high efficiency and safety. Particularly, the composition can effectively inhibit the production of endotoxin, can avoid or delay the occurrence of inflammatory storm, and becomes an important means for preventing and treating the new coronary pneumonia at this time.
In order to illustrate the application effect of the invention, in the embodiment, the summary generated in the above steps is evaluated by using a Rouge method, the generated summary is compared with a reference summary so as to prove the effectiveness of the method, and the summary generated by the system is evaluated by using the traditional Rouge1, Rouge2, the accuracy p in Rouge, the recall rate r and the comprehensive evaluation index F1, wherein the accuracy p refers to the proportion of the co-occurrence words of the generated summary and the reference summary appearing in the reference summary. The recall ratio r refers to the proportion of the co-occurrence words of the generated abstract and the reference abstract appearing in the generated abstract, and the comprehensive evaluation index F1 (indicated by F) comprehensively considers the accuracy and the recall ratio. The ROUGE evaluation is carried out by using an open source tool with the version of ROUGE-1.5.5, and the reference abstract of the case is' observation and experiment shows that the lung clearing and toxin expelling decoction is an important means for preventing and treating the new crown pneumonia. "through calculation, as shown in fig. 2, the accuracy p, the recall rate r and the comprehensive evaluation index F1 of Rouge1 for generating the summary are Rouge1-p:0.227, Rouge1-r:0.833 and Rouge1-F:0.357 respectively; the accuracy rate p, the recall rate r and the comprehensive evaluation index F1 of the Rouge2 are 0.139 to 0 Rouge2-p, 0.545 to 2-r and 0.222 to 2-F respectively; the accuracy rate p, the recall rate r and the comprehensive evaluation index F1 of the Rougel are 0.270, 0.833 and 0.408 respectively. The result is superior to the conventional TextRank algorithm and the TextRank algorithm represented by other word vectors in the aspects of accuracy, recall rate and comprehensive evaluation index F1.
The invention also provides a device for generating the news text abstract integrated with the Chinese stroke information, which comprises the following steps: the device comprises an acquisition unit, a preprocessing unit, a stroke dictionary generating unit, an embedded vector generating unit, an output unit and an evaluation unit.
The acquisition unit is used for acquiring news text data, and the news text data comprises news titles and texts.
And the preprocessing unit is used for preprocessing the news text data to obtain word segmentation texts of the news text data. The preprocessing unit includes: the sentence dividing module is used for dividing the news text data into a plurality of sentences according to punctuation marks of Chinese; the cleaning module is used for cleaning data of each sentence in sequence and deleting repeated data and invalid data; and the word segmentation module is used for performing word segmentation operation on the cleaned sentences and separating words by using pause signs, so that word segmentation texts of the news text data are obtained.
The stroke dictionary generating unit is used for scanning each word in the participle text to obtain a stroke dictionary of the participle text, and converting the stroke dictionary into a vector form;
and the embedded vector generating unit is used for generating embedding of each word based on the stroke information based on the stroke dictionary in the vector form and representing the sentence into the embedded vector according to the embedding of each word based on the stroke information. Further, the embedded vector generation unit includes: the similarity calculation module is used for calculating the similarity between each word in the participle text and the context word based on the vector-form stroke dictionary; and the embedding generation module is used for carrying out probability modeling on the context word through the current word based on the similarity and generating the embedding of the word based on the stroke information through the model.
And the output unit is used for representing the embedded vector as a directed graph, iterating the directed graph by using a TextRank algorithm to obtain the score of each sentence, and generating abstract output according to the score of each sentence.
And the evaluation unit is used for evaluating the generated abstract.
For the embodiment of the apparatus of the present invention, since it corresponds to the above embodiment of the method, the description is simple, and for the related similarities, please refer to the description in the above embodiment, and the detailed description is omitted here.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (6)

1. A method for generating news text abstract merged with Chinese stroke information is characterized by comprising the following steps:
acquiring news text data, wherein the news text data comprises news titles and texts;
preprocessing the news text data to obtain word segmentation texts of the news text data;
scanning each word in the word segmentation text to obtain a stroke dictionary of the word segmentation text, and converting the stroke dictionary into a vector form;
generating embedding of each word based on stroke information based on the stroke dictionary in a vector form, and representing the sentence into an embedded vector according to the embedding of each word based on the stroke information;
and representing the embedded vector as a directed graph, iterating the directed graph by using a TextRank algorithm to obtain the score of each sentence, and generating a summary output according to the score of each sentence.
2. The method as claimed in claim 1, wherein the step of preprocessing the news text data to obtain a segmented text of the news text data comprises:
dividing the news text data into a plurality of sentences according to punctuations of Chinese;
sequentially carrying out data cleaning on each sentence, and deleting repeated data and invalid data;
and performing word segmentation operation on the cleaned sentences, and separating words by using pause signs, thereby obtaining word segmentation texts of the news text data.
3. The method of claim 1, wherein generating stroke information based embedding of words based on a vector-based stroke dictionary comprises:
calculating the similarity between each word in the participle text and the context word based on the vector-form stroke dictionary;
based on the similarity, carrying out probability modeling on the context word through the current word, and generating the embedding of the word based on the stroke information through the model.
4. An apparatus for generating a news text summary incorporating stroke information of Chinese, comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring news text data which comprises news titles and texts;
the preprocessing unit is used for preprocessing the news text data to obtain word segmentation texts of the news text data;
the stroke dictionary generating unit is used for scanning each word in the participle text to obtain a stroke dictionary of the participle text, and converting the stroke dictionary into a vector form;
an embedded vector generating unit for generating embedding of each word based on the stroke information based on the stroke dictionary in the form of a vector, and expressing the sentence as an embedded vector according to the embedding of each word based on the stroke information;
and the output unit is used for representing the embedded vector as a directed graph, iterating the directed graph by using a TextRank algorithm to obtain the score of each sentence, and generating abstract output according to the score of each sentence.
5. The apparatus for generating a news text summary blended with Chinese stroke information as claimed in claim 4, wherein said preprocessing unit comprises:
the sentence dividing module is used for dividing the news text data into a plurality of sentences according to punctuation marks of Chinese;
the cleaning module is used for cleaning data of each sentence in sequence and deleting repeated data and invalid data;
and the word segmentation module is used for performing word segmentation operation on the cleaned sentences and separating words by using pause signs, so that word segmentation texts of the news text data are obtained.
6. The method of claim 4, wherein the embedded vector generation unit comprises:
the similarity calculation module is used for calculating the similarity between each word in the participle text and the context word based on the vector-form stroke dictionary;
and the embedding generation module is used for carrying out probability modeling on the context word through the current word based on the similarity and generating the embedding of the word based on the stroke information through the model.
CN202010970430.1A 2020-09-15 2020-09-15 Method and device for generating news text abstract integrated with Chinese stroke information Pending CN112115256A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010970430.1A CN112115256A (en) 2020-09-15 2020-09-15 Method and device for generating news text abstract integrated with Chinese stroke information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010970430.1A CN112115256A (en) 2020-09-15 2020-09-15 Method and device for generating news text abstract integrated with Chinese stroke information

Publications (1)

Publication Number Publication Date
CN112115256A true CN112115256A (en) 2020-12-22

Family

ID=73802034

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010970430.1A Pending CN112115256A (en) 2020-09-15 2020-09-15 Method and device for generating news text abstract integrated with Chinese stroke information

Country Status (1)

Country Link
CN (1) CN112115256A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113722482A (en) * 2021-08-25 2021-11-30 昆明理工大学 News comment opinion sentence identification method
CN117195877A (en) * 2023-11-06 2023-12-08 中南大学 Word vector generation method, system and equipment for electronic medical record and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033066A (en) * 2018-06-04 2018-12-18 浪潮软件股份有限公司 A kind of abstract forming method and device
CN110059155A (en) * 2018-12-18 2019-07-26 阿里巴巴集团控股有限公司 The calculating of text similarity, intelligent customer service system implementation method and device
CN110119444A (en) * 2019-04-23 2019-08-13 中电科大数据研究院有限公司 A kind of official document summarization generation model that extraction-type is combined with production
CN110532554A (en) * 2019-08-26 2019-12-03 南京信息职业技术学院 A kind of Chinese abstraction generating method, system and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033066A (en) * 2018-06-04 2018-12-18 浪潮软件股份有限公司 A kind of abstract forming method and device
CN110059155A (en) * 2018-12-18 2019-07-26 阿里巴巴集团控股有限公司 The calculating of text similarity, intelligent customer service system implementation method and device
CN110119444A (en) * 2019-04-23 2019-08-13 中电科大数据研究院有限公司 A kind of official document summarization generation model that extraction-type is combined with production
CN110532554A (en) * 2019-08-26 2019-12-03 南京信息职业技术学院 A kind of Chinese abstraction generating method, system and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SHAOSHENG CHAO等: "cw2vec:Learning Chinese Word Embeddings with Stroke n-gram information", 《THE THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113722482A (en) * 2021-08-25 2021-11-30 昆明理工大学 News comment opinion sentence identification method
CN117195877A (en) * 2023-11-06 2023-12-08 中南大学 Word vector generation method, system and equipment for electronic medical record and storage medium
CN117195877B (en) * 2023-11-06 2024-01-30 中南大学 Word vector generation method, system and equipment for electronic medical record and storage medium

Similar Documents

Publication Publication Date Title
CN110413986B (en) Text clustering multi-document automatic summarization method and system for improving word vector model
CN109190117B (en) Short text semantic similarity calculation method based on word vector
CN110378409B (en) Chinese-Yue news document abstract generation method based on element association attention mechanism
CN110413768B (en) Automatic generation method of article titles
CN111177365A (en) Unsupervised automatic abstract extraction method based on graph model
Shi et al. Deep adaptively-enhanced hashing with discriminative similarity guidance for unsupervised cross-modal retrieval
CN107895000B (en) Cross-domain semantic information retrieval method based on convolutional neural network
JP5216063B2 (en) Method and apparatus for determining categories of unregistered words
US11645447B2 (en) Encoding textual information for text analysis
Moravvej et al. Biomedical text summarization using conditional generative adversarial network (CGAN)
CN110879834A (en) Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof
CN112115256A (en) Method and device for generating news text abstract integrated with Chinese stroke information
Ye et al. Improving cross-domain Chinese word segmentation with word embeddings
CN113065349A (en) Named entity recognition method based on conditional random field
Seeha et al. ThaiLMCut: Unsupervised pretraining for Thai word segmentation
CN110929022A (en) Text abstract generation method and system
CN110609997B (en) Method and device for generating abstract of text
CN112182159B (en) Personalized search type dialogue method and system based on semantic representation
Banisakher et al. Improving the identification of the discourse function of news article paragraphs
CN113626584A (en) Automatic text abstract generation method, system, computer equipment and storage medium
Thant et al. Preprocessing of YouTube Myanmar music comments for sentiment analysis
Zouidine et al. A comparative study of pre-trained word embeddings for Arabic sentiment analysis
CN111881678A (en) Domain word discovery method based on unsupervised learning
Ramani et al. An Explorative Study on Extractive Text Summarization through k-means, LSA, and TextRank
Jayasudha et al. Effective Model for Improving Symptoms-Based Disease Prediction by BiMM-BERT Algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination