CN114722836A

CN114722836A - Abstract generation method, apparatus, device and medium

Info

Publication number: CN114722836A
Application number: CN202210516005.4A
Authority: CN
Inventors: 赵菲菲
Original assignee: Beijing Zhongke Wenge Technology Co ltd
Current assignee: Beijing Zhongke Wenge Technology Co ltd
Priority date: 2022-05-12
Filing date: 2022-05-12
Publication date: 2022-07-08
Anticipated expiration: 2042-05-12
Also published as: CN114722836B

Abstract

The disclosure relates to a method, a device, equipment and a medium for generating an abstract, which can extract text features of each target text from a plurality of visual angles related to semantic distance after a plurality of target texts are obtained; and because the text features used for carrying out topic clustering on the target texts have a plurality of visual angle types related to semantic distance, the information features referred to during topic clustering are rich, the topic clustering can be carried out on the plurality of target texts from a plurality of visual angles comprehensively, the accuracy of the topic clustering is improved, and the extracted topic abstracts are more accurate and effective.

Description

Abstract generation method, device, equipment and medium

Technical Field

The present disclosure relates to the field of text processing technologies, and in particular, to a method, an apparatus, a device, and a medium for generating an abstract.

Background

Topic extraction is a process of extracting specific event information from a large amount of information. Because the topic extraction has the advantages of high data fidelity, good timeliness, high coverage rate and the like, the topic extraction is widely applied to the aspects of news extraction, policy making, public appeal investigation and the like.

In the related art, it is generally necessary to cluster massive information into different topics, and then extract summaries of the information under each topic obtained by clustering, so as to implement topic extraction. However, if the information features referred to in the subject clustering process are relatively single, the clustering result is inaccurate, and further the extracted summary is inaccurate, so that the subject extraction result is inaccurate.

Disclosure of Invention

In order to solve the technical problem, the present disclosure provides a method, an apparatus, a device, and a medium for generating an abstract.

In a first aspect, an embodiment of the present disclosure provides a digest generation method, including:

acquiring a plurality of target texts;

extracting text features of the target texts aiming at each target text, wherein the text features comprise features of a plurality of view angle types related to semantic distance;

performing topic clustering on a plurality of target texts based on the text characteristics of each target text to obtain at least one first text set;

and extracting the topic abstract of each first text set to obtain the topic abstract corresponding to each first text set.

In a second aspect, an embodiment of the present disclosure provides an apparatus for generating a summary, including:

the acquisition module is used for acquiring a plurality of target texts;

the extraction module is used for extracting text features of the target texts aiming at each target text, wherein the text features comprise features of a plurality of view angle types related to semantic distance;

the clustering module is used for carrying out topic clustering on the target texts based on the text characteristics of the target texts to obtain at least one first text set;

and the extraction module is used for extracting the topic abstract of each first text set to obtain the topic abstract corresponding to each first text set.

In a third aspect, an embodiment of the present disclosure provides an abstract generating device, including:

a memory;

a processor; and

a computer program;

wherein a computer program is stored in the memory and configured to be executed by a processor to implement the summary generation method as described in the first aspect.

In a fourth aspect, the present disclosure provides a medium on which a computer program is stored, where the computer program is executed by a processor to implement the digest generation method described in the first aspect.

In a fifth aspect, the disclosed embodiments also provide a computer program product, which includes a computer program or instructions, and when the computer program or instructions are executed by a processor, the above summary generation method is implemented.

The invention relates to a method, a device, equipment and a medium for generating an abstract, which can extract text characteristics of each target text from a plurality of visual angles related to semantic distance after a plurality of target texts are obtained, perform topic clustering on the plurality of target texts based on the text characteristics to obtain a plurality of first text sets, and further perform topic abstract extraction on each first text set respectively.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic flowchart of a digest generation method according to an embodiment of the present disclosure;

fig. 2 is a schematic flow chart of another digest generation method provided in the embodiment of the present disclosure;

fig. 3 is a schematic diagram of a semantic distance calculation process provided by the embodiment of the present disclosure;

fig. 4 is a schematic flowchart of another digest generation method provided in the embodiment of the present disclosure;

fig. 5 is an implementation schematic diagram of another digest generation method provided by the embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a summary generation apparatus according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a digest generation device according to an embodiment of the present disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.

Generally, each text contains rich semantic information which can embody some subjects, but the original text hardly reflects the subject contents discussed by the subjects intuitively, and the abstract of the subjects can extract key information in the text and eliminate irrelevant and useless data, so that the readability of the subjects is enhanced, and a user can be helped to quickly and conveniently make policy formulation, public appeals investigation and the like according to the abstract of the subjects. Therefore, it is particularly important to abstract the topic.

In the related art, before extracting the summary of the topics, clustering of the topics is required, and then extracting the summary of the topics according to the clustered topics. However, in the related art, the information features referred to during topic clustering are relatively single, so that the clustering result is inaccurate, and further, the topic abstract result extracted according to the clustered topics is inaccurate.

Taking a text in an internet space as an example, because the internet text has the characteristics of multiple sources, isomerism, mass and the like, if topic clustering is performed on the text in the internet space only from a single text characteristic, the topic clustering result is not accurate because the text characteristic in the internet space is not considered comprehensively, and further, when an abstract of topics is extracted according to the clustered topics, the abstract of the extracted topics is not accurate.

In view of the foregoing problems, embodiments of the present disclosure provide a method, an apparatus, a device, and a medium for generating an abstract. The summary generation method will be described first with reference to specific embodiments.

In the embodiment of the present disclosure, the digest generation method may be performed by a digest generation device, where the digest generation device may be an electronic device or a server. The electronic devices include, but are not limited to, smart phones, palm computers, tablet computers, wearable devices with display screens, desktop computers, notebook computers, all-in-one machines, smart home devices, and the like. The server can be an independent server or a cluster of a plurality of servers, and can comprise a server built in the local and a server erected in the cloud.

Fig. 1 is a schematic flow chart of a digest generation method according to an embodiment of the present disclosure.

As shown in fig. 1, the digest generation method may specifically include the following steps:

and S110, acquiring a plurality of target texts.

In the embodiment of the present disclosure, the target text may be a sentence, a paragraph, or an article, which participates in the topic extraction.

In the embodiment of the present disclosure, before performing topic extraction, the summary generation device first needs to obtain a text participating in the topic extraction. The source of the text may be a text in internet information, such as a text included in news, microblog, report, paper, and the like, and the source of the text may also be a text stored locally in the digest generation device, which is not limited in this disclosure.

In some embodiments, after the summary generation device acquires the texts participating in the topic extraction, each acquired text may be directly used as a target text.

In other embodiments, after the summary generation device acquires the text participating in the topic extraction, the acquired text may be preprocessed, and the preprocessed text is used as the target text.

Specifically, the preprocessing may include data cleaning, word segmentation, stop word filtering, and other processing manners. The data cleaning may include filtering processing of emoticons, data deduplication processing, filtering processing of web page links, filtering processing of special symbols in hypertext markup languages, conversion processing of simplified and complex bodies, and the like. The word segmentation may include a word segmentation process using a chinese thesaurus analysis tool. Stop word filtering may include filtering stop words using a stop word table.

For example, the abstract generating device may first perform data cleaning on the obtained text to obtain a cleaned text, then perform word segmentation on the cleaned text to obtain a word segmentation result, and finally perform word stop filtering on the word segmentation result to obtain a preprocessed text.

And S120, extracting text features of the target texts aiming at each target text, wherein the text features comprise features of a plurality of view angle types related to semantic distances.

In the embodiment of the present disclosure, after obtaining a plurality of target texts, the abstract generating device may extract text features of each target text from a plurality of perspective types related to semantic distances.

Where the semantic distance may be the distance in semantic space between every two target texts.

Further, the perspective type may be a feature type that describes each target text at a different angle or in a different manner relative to semantic distance.

Since the larger the semantic distance value between two target texts is, the larger the difference of text features is, the less the possibility that the two target texts describe the same topic is. Conversely, the smaller the semantic distance value between two target texts, the smaller the difference between the text features, and the higher the possibility that the two target texts describe the same topic. Therefore, after the abstract generating device extracts the text features of the target texts from different view angle types related to the semantic distance, the text features between every two target texts can be used for measuring the semantic distance between every two target texts, and further whether the two target texts have the same description can be determined.

S130, performing topic clustering on the target texts based on the text features of the target texts to obtain at least one first text set.

Clustering may be the process of dividing a collection of physical or abstract objects into classes composed of similar objects. The clusters generated by clustering are a set of objects, the objects in the same cluster are similar to each other, and the objects in different clusters are different from each other.

In this disclosure, the clustered objects may be target texts, and since target texts containing the same or similar text features may be considered to describe the same issue, after extracting the text features of each target text from multiple perspective types, the abstract generation device may perform feature comparison on the multiple target texts from the multiple perspective types according to the text features of the multiple perspective types of each target text, cluster the target texts containing the same or similar text features into the same cluster, that is, the same first text set, and divide the target texts containing different or dissimilar text features into different clusters, that is, different first text sets.

Furthermore, after the summary generation device performs topic clustering on a plurality of target texts, at least one first text set may be obtained, each obtained first text set may correspond to one topic, and each target text in each first text set may have the same or similar text features.

It will be appreciated that in some embodiments, there may be a difference in the subject of a certain target text description from the subject of other target text descriptions, and therefore the first set of text in which the target text is located includes only the one target text.

And S140, extracting the topic summaries of the first text sets to obtain the topic summaries corresponding to the first text sets.

In the disclosed embodiments, the topic abstract refers to the content outline discussed in the topic. Because the original target text hardly reflects the subject content discussed by the topic intuitively, the abstract extraction of the topic can extract the key information in the text data to enhance the readability of the topic; meanwhile, a user is difficult to formulate a policy, investigate the public appeal and the like according to a single target text, and extract the topic abstract of a plurality of target texts with the same topic, so that the content outlines of the target texts can be integrated, and a basis can be provided for policy formulation, the public appeal investigation and the like. Therefore, after the summary generation device acquires the first text sets, the summary generation device can further extract the subject summary in each first text set.

Specifically, the summary generation device may perform topic summary extraction by using an algorithm such as a text sorting algorithm (TextRank) and a Centroid locating algorithm (Centroid). The text sorting algorithm (TextRank) is described as an example.

In some embodiments, the summary generation device may extract the keywords contained in each first text set by using a text sorting algorithm (TextRank); then, evaluating the keywords, for example, evaluating each keyword in terms of keyword length, keyword position, whether the keyword includes a title, and the like; based on the evaluation of each keyword, giving different weights to each keyword to obtain the importance of each keyword, sorting in a descending order according to the importance of each keyword, and selecting a plurality of most important keywords for marking; if the marked key words form adjacent word groups, multi-word key words are combined and serve as the subject abstract.

In other embodiments, the abstract generating device may further extract a keyword group included in each first text set by using a TextRank algorithm; then, evaluating the key phrases, for example, evaluating each key phrase from the aspects of key phrase length, key phrase position, whether the key phrase contains a title, and the like; based on the evaluation of each key phrase, giving different weights to each key phrase to obtain the importance of each key phrase, sorting in a descending order according to the importance of each key phrase, and selecting a plurality of key phrases with the most importance for marking; if the marked key phrases form adjacent phrases, the adjacent phrases are combined into a multi-word key phrase, and the multi-word key phrase is used as an abstract of the subject.

In still other embodiments, the abstract generating device may also extract the keywords and the keyword groups included in each first text set simultaneously by using a TextRank algorithm; then evaluating the keywords and the keyword groups, giving different weights to each keyword and the keyword groups to obtain the importance of each keyword and the keyword groups, sorting in a descending order according to the importance of each keyword and the keyword groups, and selecting a plurality of most important keywords and keyword groups for marking; if the marked key words and key word groups form adjacent word groups, multi-word key word groups are combined, and the multi-word key word groups are used as the subject abstract.

According to the embodiment of the disclosure, after the target texts are obtained, text features of each target text related to the semantic distance can be extracted from a plurality of visual angles related to the semantic distance; and because the text features used for carrying out topic clustering on the target texts have a plurality of visual angle types related to semantic distance, the information features referred to during topic clustering are rich, the topic clustering can be carried out on the plurality of target texts from a plurality of visual angles comprehensively, the accuracy of the topic clustering is improved, and the extracted topic abstracts are more accurate and effective.

Fig. 2 is a schematic flow chart of another digest generation method provided in the embodiment of the present disclosure, fig. 3 is a schematic diagram of a semantic distance calculation process provided in the embodiment of the present disclosure, and the digest generation method shown in fig. 2 is described below with reference to fig. 3.

As shown in fig. 2, the digest generation method may specifically include the following steps:

s210, acquiring a plurality of target texts.

The step is the same as S110, and is not described herein again.

S220, aiming at each target text, extracting text features of the target text, wherein the text features comprise features of a plurality of view angle types related to semantic distance.

In the embodiment of the disclosure, after acquiring a plurality of target texts, the abstract generating device extracts text features of each target text from a plurality of view angle types.

Optionally, the abstract generating device may extract the text features of each target text from a plurality of perspective types by at least two of an entity perspective type, an entity semantic perspective type, and a line language word perspective type, and therefore the text features may include at least two of the entity features, the entity semantic features, and the line language word features, which will be described below respectively:

A. and (4) entity characteristics.

The entity characteristics may include the respective entities to which the target text relates.

Specifically, the entity characteristics are structured attributes in the unstructured text data, and the entities are contents with specific meanings in the structured attributes. The content having a specific meaning may be one word or a plurality of words.

The entities can be classified into different types such as names of people, names of places, names of organizations, time and the like.

For example, in a target text that "Xiaoming yesterday encounters small red in a park", the "Xiaoming", "yesterday", "park", "small red" and the like are all entities, wherein the entity types corresponding to the "Xiaoming", "small red" are names of people; the entity type corresponding to the park is a place name; the entity type corresponding to "yesterday" is time.

Optionally, the entity feature for extracting each target text may be: and (3) extracting entities contained in each target text through named entity identification, and taking the extracted entities as entity features of the target text.

Named entity recognition refers to the process of recognizing entities with specific meanings from target texts, and most entities are nouns in terms of part of speech. Therefore, the basic principle of named entity recognition is to perform Chinese word segmentation and part-of-speech tagging on a text included in a target text, and then obtain an entity tag according to the text tagged with the part-of-speech, so that an entity included in the target text can be extracted.

Optionally, Chinese word segmentation and part-of-speech tagging can be performed by using a popular open source library jieba; and then identifying entities in the target text by using a BiLSTM-CRF model of a combined model of a Bi-directional Long Short-Term Memory (BilSTM) and a Conditional Random Field model (CRF).

Specifically, the jieba open source library can perform Chinese word segmentation on the target text, and label the part of speech of the words included in the segmented target text. Continuing to take a target text of 'Xiaoming yesterday meets with reddish in a park' as an example, a jieba open source library is used for dividing the target text into words to obtain words of 'Xiaoming, yesterday, park, meet and reddish', and then each word is subjected to part-of-speech tagging, for example, 'Xiaoming' is tagged as a noun, 'yesterday' is tagged as a noun, 'meet' is tagged as a verb, and 'reddish' is tagged as a noun.

The BilSTM-CRF model can extract relevant entities from a text. The basic structure of the model is divided into a representation layer, a BilSTM layer and a CRF layer. The presentation layer can present each word labeled with part of speech in the target text as a word vector; the BilSTM layer uses the word vectors as input and outputs respective scores of all entity labels of each word in the target text; and the CRF layer uses a matrix formed by respective scores of all entity labels of each word output by the BilSTM layer and a transition probability matrix as parameters of an original CRF model, and outputs an entity labeling result.

It will be appreciated that because named entity recognition can obtain entities including person names, place names, organization names, time, etc., extracting entities in each target text enables extraction of text features from the hierarchy of words.

B. And (4) entity semantic features.

The entity semantic features may include normalized frequency vectors corresponding to semantic roles of respective entities to which the target text relates.

In the disclosed embodiment, the semantic role refers to the role that the argument plays in the event that the verb refers to, and thus the semantic role of the entity is the role that each entity plays in the event that the verb refers to. Because different entities play different roles in the event, the semantic roles of the entities reflect the association relationship between the entities contained in the target text on the logical relationship in the event.

The roles of the entities can be divided into: types of actors, parties, leads, affections, conspires, guests, achievements, source facts, involvement, comparisons, belongings, and the like.

Continuing to take the target text of 'Xiaoming yesterday meets a small red in the park' as an example, in the target text, the 'Xiaoming yesterday meets a small red in the park' is taken as an event; "Xiaoming", "Xiaohong", "yesterday" and "park" are entities; "encounter" is a predicate verb; the semantic role of "Xiaoming" in the event that "Xiaoming yesterday encountered little red in the park" was a fact, "the semantic role of" Xiao hong "in the event that" Xiaoming yesterday encountered little red in the park "was a fact," yesterday "was the time at which the event that" Xiaoming yesterday encountered little red in the park "occurred," park "was the place at which the event that" Xiaoming yesterday encountered little red in the park "occurred. The entity semantic role reflects that the logical relationship between the two entities of 'Xiaoming' and 'Xiaohong' in the target text is 'encountered in the park yesterday'.

Optionally, the extracting semantic role features of the entities in each target text may be: performing semantic role labeling on each entity related to the target text to obtain the semantic role of each entity; counting the frequency of the semantic roles of each entity appearing in the target text to obtain the frequency corresponding to each semantic role; generating a normalized frequency vector corresponding to the semantic role of each entity according to the frequency number corresponding to each semantic role; and taking the normalized frequency vector corresponding to the semantic role of each extracted entity as the semantic role characteristic of the target text entity.

Specifically, in the embodiment of the present disclosure, after the abstract generating device marks the semantic role of each entity in the target text, the frequency of the semantic role of each entity appearing in the target text may be obtained through statistics, and according to the frequency corresponding to each semantic role, a normalized frequency vector corresponding to the semantic role of each entity is generated.

The abstract generating equipment marks semantic roles of all entities involved in the target text, namely marks the semantic roles corresponding to all the entities included in the target text. In some embodiments, entities may be semantically role labeled using a pylpt tool. The Pylpt tool is a Python library of a natural voice processing task based on a neural network, and can realize part-of-speech tagging, semantic role tagging and the like of texts.

After the abstract generating device obtains the semantic roles of each entity, the semantic roles of each entity can be counted, and the frequency of each semantic role appearing in the target text is counted.

In some embodiments, a user can customize semantic roles that require statistics. For example, only 5 semantic roles of affairs, parties, reception, affection and affairs and the frequency of occurrence of each semantic role in the target text can be counted; the 20 semantic roles of affairs, party, college affairs, affection affairs, guest affairs, source affairs, affection affairs, comparison, belongings, etc. and the frequency of each semantic role appearing in the target text can also be counted, and the embodiment of the disclosure is not limited herein.

After counting the semantic roles of each entity and the frequency of the semantic roles of each entity appearing in the target text, the abstract generating equipment constructs semantic role vectors with the same number as that of all the semantic roles for each entity; and calculating to obtain the normalized frequency vector of each semantic role according to the ratio of the frequency of each semantic role to the frequency of all the semantic roles in the target text.

The normalized frequency vector can be obtained by calculating the normalized frequency of each semantic role according to the ratio of the frequency of each semantic role appearing in the target text to the frequency of all semantic roles appearing in the target text, and then vectorizing the normalized frequency of each semantic role. And each semantic role is a dimension of the normalized frequency vector, and the normalized frequency of each semantic role is a component in the normalized frequency vector.

Illustratively, when a user only counts 5 semantic roles of action, party, lead, feeling and subject, taking the target text of 'Xiaoming yesterday meets Xiaohong in park' as an example, the semantic roles involved in the target text are the 2 semantic roles of action and subject, the semantic role of 'Xiaoming' in the target text is the action, and the semantic role of 'Xiaohong' in the target text is the subject. Based on the above description, the target bit text may be constructed as a 5-dimensional vector [0.5,0,0,0,0.5 ]. Wherein, the first component 0.5 of the vector is a numerical value obtained by calculating the ratio of the frequency of occurrence of the semantic role of 'construction' to the frequency of occurrence of all semantic roles in the target text for 1 time; the second component 0 of the vector is a value calculated by the ratio of the frequency of occurrence of the semantic role of "principal" of 0 times to the frequency of occurrence of all semantic roles in the target text of 2 times. Other vectors are the same as the above calculation method, and are not described herein again.

It can be understood that, because the semantic roles of the entities can describe some events in the target text or the association relationship between sentences in the logical relationship, extracting the semantic roles of the entities in each target text realizes the extraction of the subjects from the sentence level.

C. The lines feature words.

The textual word features may include a text vector of the target text.

In some embodiments, the literary words may be all words contained in the target text that have a particular meaning. Continuing with the target text "Xiaoming yesterday encountered a small red in the park" as an example, since "has" no specific meaning in the target text, the target text contains all words with specific meaning as "Xiaoming, yesterday, in the park, encountered, small red".

Optionally, the extracting of the word features for the line language may be to convert all words with specific meanings in the target text into feature vectors, and extract the feature vectors corresponding to all words with specific meanings in the target text. And taking the feature vectors corresponding to all the words with specific meanings as the character of the Chinese words of the target text.

Converting all words with specific meanings in the target text into the feature vector may be to count the frequency of occurrence of each word in the target text and vectorize the counted frequency of occurrence of each word in the target text.

In some embodiments, converting all words with specific meaning in the target text into feature vectors may use a Term Frequency-Inverse Document Frequency algorithm (TF-IDF), and the implementation principle of the TF-IDF is:

firstly, calculating TF in a target text, wherein TF represents the frequency of a certain word appearing in an article, and for convenience of comparing different texts, word frequency can be normalized, and the calculation formula of TF normalization is as follows:

second, the IDF in the target text is computed, where IDF represents the inverse document frequency, a measure of the general importance of a word. Wherein, an inverse document corpus is needed to represent the usage environment of the corpus when calculating the IDF, and the calculation formula is:

and finally, calculating TF-IDF, wherein the TF-IDF is calculated according to the formula: and TF-IDF = TF-IDF, and vectorizing the calculated TF-IDF to obtain the character of the Chinese word in the target text.

It can be understood that, since the extraction of the characteristic words in the language is performed on all the words with specific meanings contained in the target text, and all the words with specific meanings represent the text content of the target text, the extraction of the behavior words in each target text realizes the extraction of the topics from the text level.

And S230, calculating the feature similarity of every two target texts for each view angle type based on the text features.

In the embodiment of the disclosure, after extracting text features of target texts from at least two of an entity perspective type, an entity semantic perspective type and a literary expression perspective type, the abstract generating device calculates feature similarity between every two target texts according to the extracted at least two text features.

Wherein, calculating the feature similarity between each two target texts from the entity perspective type may be calculating entity feature similarity included between each two target texts. Generally, the higher the similarity of the entities such as time, place, people, etc. contained between each two target texts is, the higher the probability of describing the same issue between the two target texts is.

Specifically, the entity feature similarity between every two target texts can be calculated by a similarity algorithm.

Optionally, in the case that the text features include entity features, the similarity of the entity features of every two target texts may be calculated using the Jaccard similarity algorithm. The basic principle of the Jaccard similarity algorithm is that given two sets, e.g., set A, set B, the Jaccard coefficient is defined as the ratio of the size of the intersection of set A and set B to the size of the union, i.e., the Jaccard coefficient is defined as the ratio of the size of the intersection of set A and set B to the size of the union

A larger Jaccard value indicates a higher degree of similarity. In the embodiment of the disclosure, the similarity of the entity features is calculated for every two target texts by using a Jaccard similarity algorithm, and according to the above calculated similarity of the entity features, the value is between 0 and 1, and the larger the value is, the higher the similarity of the entity included between the two target texts is, the more likely the same topic is described.

Wherein, calculating the feature similarity between each two target texts from the entity semantic view type may be calculating semantic role similarity of entities in an entity intersection between each two target texts. Generally, the higher the semantic role similarity of the contained entities between every two target texts, the higher the probability of describing the same issue between the two target texts is.

Specifically, the semantic role similarity of the entities between every two target texts can be determined by the similarity between semantic role vectors of the entities in the entity intersection between every two target texts; and then carrying out weighted average calculation on the semantic role similarity values of the entities in the intersection according to the frequency of the appearance of the entities to obtain the semantic role similarity values.

Optionally, in a case that the text features include entity semantic features, a distance calculation method (Jensen-Shannon Divergence, JS) may be used to calculate similarity between semantic role vectors of entities in an entity intersection between every two target texts, and then the similarity between the semantic role vectors of the entities in the intersection is weighted and averaged according to the frequency of occurrence of the entities to obtain the semantic role similarity of the entities in the entity intersection.

Specifically, a calculation formula for calculating the similarity between semantic role vectors of entities in an entity intersection between every two target texts by using a JS algorithm is as follows:

wherein, P represents the semantic role distribution of the entity in the target text i, and Q represents the semantic role distribution of the entity in the target text j.

Carrying out weighted average on the semantic role similarity values of the entities in the intersection according to the frequency of occurrence of the entities, and obtaining a calculation formula of the semantic role similarity values of the entities in the entity intersection by calculation, wherein the calculation formula comprises the following steps:

wherein v represents a normalized frequency vector corresponding to a semantic role of the entity, P represents a semantic role distribution of the entity in the target text i, and Q identifies the semantic role distribution of the entity in the target text j.

The calculation of the feature similarity between every two target texts from the perspective type of the Chinese words may be to calculate the feature similarity between every two target texts. In general, the higher the similarity of the words used by the same lines contained in each two target texts, the greater the probability that the same topic is described between the two target texts.

Optionally, in the case that the text features include line language word features, the line language word feature similarity may be calculated by a cosine similarity algorithm. The cosine similarity is to judge the similarity of two vectors by calculating the size of an included angle between the two vectors. The smaller the angle, the more similar the two vectors are represented. The cosine similarity algorithm has the following calculation formula:

word feature similarity for Chinese =

Wherein the content of the first and second substances,

a feature vector representing the target text i,

a feature vector representing the target text j.

S240, calculating semantic distance between every two target texts based on the feature similarity of each view angle type.

In the embodiment of the present disclosure, after extracting the text features of the target text, the digest generation device may calculate the semantic distance between every two target texts based on the feature similarity determined according to the text features.

Specifically, the abstract generating device may start with any target text, traverse all the target texts, and calculate a semantic distance between any two target texts in the traversal process.

In some embodiments, calculating a semantic distance between two of the target texts based on the feature similarity for each perspective type comprises: adding the feature similarity aiming at each visual angle type to obtain a semantic metric value; taking the reciprocal of the semantic metric value as the semantic distance between two target texts, wherein the calculation formula is as follows:

wherein the content of the first and second substances,

representing the semantic distance between target text i and target text j,

representing the entity feature similarity between the target text i and the target text j;

representing semantic role similarity of entities between the target text i and the target text j;

and representing the similarity of the word characteristics of the line texts between the target text i and the target text j.

In other embodiments, for the feature similarity of each perspective type, calculating the semantic distance between two target texts may further be: and adding the feature similarity aiming at each visual angle type, then averaging, and taking the average as the semantic distance between the two target texts.

In still other embodiments, calculating the semantic distance between two of the target texts for the feature similarity of each perspective type may further be: after different weights are set for the feature similarity of each visual angle type, the weighted feature similarities are summed and then averaged, and the average calculated after the weighted feature similarities are summed is used as the semantic distance between the two target texts.

It can be understood that, for the feature similarity of each perspective type, a plurality of calculation methods may be adopted to calculate the semantic distance between two target texts, which is not limited in the embodiment of the present disclosure.

In the embodiment of the present disclosure, the text features may optionally include entity features, entity semantic features, and literary expression features, and the calculation process of the semantic distance is described below with reference to fig. 3 as an example.

As shown in fig. 3, in the embodiment of the present disclosure, after the abstract generating device obtains a plurality of target texts, the similarity between every two target texts may be calculated from three dimensions of an entity feature, an entity semantic feature, and a line language word feature, and then the semantic distance between every two target texts may be obtained according to the similarity between every two target texts.

Specifically, the abstract generating equipment extracts entity features of the target texts through named entity identification, and calculates the entity feature similarity between every two target texts through a jaccard similarity algorithm; meanwhile, the abstract generating equipment extracts semantic features of the target text entities by labeling semantic roles of the target text entities, and calculates the semantic role similarity of the entities between every two target texts by a JS distance calculation method; further, the abstract generating equipment extracts the characteristic of the Chinese words of the target text through a TF-IDF algorithm, and calculates the similarity of the characteristic of the Chinese words between every two target texts by adopting a cosine similarity algorithm. And calculating the similarity between the two target texts according to the three dimensions, thereby calculating the semantic distance between every two target texts.

And S250, performing topic clustering on the target texts according to the semantic distance to obtain at least one first text set.

In the embodiment of the disclosure, after the abstract generating device calculates the semantic distance between every two target texts, the topic clustering is performed on the plurality of target texts according to the calculated semantic distance between every two target texts, so as to obtain at least one first text set.

In some embodiments, after calculating the semantic distance between any two target texts, the digest generation device compares the semantic distance between the two target texts with a preset threshold. If the semantic distance between the two target texts is smaller than or equal to a preset threshold, the two target texts are considered to describe the same topic, and therefore the two target texts are classified into the same topic set. If the semantic distance between the two target texts is greater than a preset threshold, the two target texts are considered to describe different issues, and therefore the two target texts are divided into different issue sets.

Further, after the abstract generating device traverses all the target texts, the method can be used for clustering all the target texts to obtain at least one first text set.

In other embodiments, the abstract generating device may further use a first traversed target text as an issue center of an initial issue, and in a process of traversing all target texts, after calculating a semantic distance between the initial issue and any one target text, if the semantic distance is less than or equal to a preset threshold, it is considered that the target text and the issue center of the initial issue describe the same issue, so that the target text is classified into an issue set in which the issue center is located; if the semantic distance is greater than the preset threshold, the target text is considered to describe a different issue from the issue center of the initial issue, so that the target text is used as the issue center of the new issue.

Further, in the process of traversing all target texts, the summary generation device needs to calculate a semantic distance set between the next target text and the center of each topic. If the minimum value in the semantic distance set is smaller than or equal to a preset threshold value, the target text and the topic center corresponding to the minimum value describe the same topic, and therefore the target text is classified into the topic set where the topic center is located; if the minimum value in the semantic distance set is larger than a preset threshold value, the target text and all the topic centers are considered to describe different topics, and therefore the target text is used as the topic center of a new topic. And repeating the steps until all the target texts are completely traversed to obtain at least one first text set.

And S260, extracting the topic abstract of each first text set to obtain the topic abstract corresponding to each first text set.

In the disclosed embodiment, the step is the same as S140, and is not described herein again.

According to the abstract generation method, the device, the equipment and the medium, after the target text is obtained, the text features of each target text are extracted from at least two visual angle types of the entity visual angle type, the entity semantic visual angle type and the literary word visual angle type, on one hand, the extraction of the subjects according to the subject-text-sentence-word hierarchy is realized, and the comprehensiveness and the effectiveness of the subject extraction are improved; on the other hand, the method realizes compatibility of text corpora with different lengths such as texts, sentences and words, and has good expansibility; according to the method and the device for extracting the topic abstract, after the text features are extracted, the text feature similarity between every two target texts is calculated from different view angle types, the semantic distance between every two target texts is obtained, the calculated semantic distance of multiple view angles is achieved, the obtained semantic distance between the target texts is more accurate and effective, then a plurality of topics are obtained according to the semantic distance, and the topic abstract is extracted, so that the extracted topic abstract is more accurate and effective.

Fig. 4 is a schematic diagram of an implementation of another digest generation method provided in the embodiment of the present disclosure, and fig. 5 is a schematic diagram of an implementation of the another digest generation method provided in the embodiment of the present disclosure, and the digest generation method shown in fig. 4 is described below with reference to fig. 5.

As shown in fig. 4, the digest generation method may include the following steps:

s410, performing topic clustering on the plurality of preset texts to obtain at least one second text set.

In the embodiment of the disclosure, the abstract generating device extracts topics in each preset text, and clusters the topics in the preset texts to obtain at least one second text set.

The preset text may be a sentence, a paragraph, or an article, etc. for selecting the target text. The source of the text may be a text in internet information, such as a text included in news, microblog, report, paper, and the like, and the source of the text may also be a text stored locally in the digest generation device, which is not limited in this disclosure.

Before the abstract generating device carries out theme clustering on a plurality of preset texts, a plurality of preset texts need to be obtained. In some embodiments, after obtaining the text for selecting the target text, the abstract generating device may directly use each obtained text as a preset text.

In other embodiments, after the summary generation device acquires the text for selecting the target text, the acquired text may be preprocessed, and the preprocessed text is used as the preset text.

Optionally, S410 may include: extracting potential theme information of each preset text to obtain a theme distribution vector corresponding to each preset text; and performing topic clustering on the plurality of preset texts based on the topic distribution vector to obtain at least one second text set.

Specifically, the abstract generating device firstly obtains preset texts, and extracts topics in each preset text by using a text topic extraction model to obtain topic distribution vectors; and then, carrying out theme clustering on the obtained theme distribution vectors by using a clustering algorithm so as to realize theme clustering on the preset texts and obtain a plurality of second text sets.

In the disclosed embodiment, the subject extraction model is a modeling method for implicit topics in text, and each topic is actually the distribution probability of words on a word list. Therefore, the topic extraction model is to calculate the topic of the preset text by a certain probability for each word in the preset text.

In some embodiments, the topic extraction model may be an invisible Dirichlet Allocation (LDA) model. The LDA model extracts words in each preset text according to the text content in each preset text to obtain a probability distribution vector of each word; and calculating potential theme distribution vectors in each preset text according to each word probability distribution vector.

Illustratively, the preset text contains words of 'apple, banana and pear', the probability distribution of each word is 33%, and the probability distribution is vectorized to obtain a probability distribution vector; according to the word distribution probability vector in the preset text, 100% of the subjects of the preset text are related to fruits, and therefore the subject distribution vector of the preset text is obtained.

It can be understood that each preset text contains a plurality of words, and therefore when the theme extraction model is used for extracting the theme in each preset text, a plurality of themes may be extracted from each preset text.

In the embodiment of the present disclosure, after obtaining the topic distribution vector of each preset text, the abstract generating device clusters the preset texts describing the same or similar topics into the same cluster, that is, the same second text set, according to a clustering algorithm, and divides the preset texts describing different or dissimilar topics into different clusters, that is, different second text sets.

Furthermore, after the abstract generating device performs topic clustering on a plurality of target texts, each obtained second text set may correspond to one topic, and each preset text in each second text set may describe the same or similar topic.

In some embodiments, the Clustering algorithm may apply a Spatial Clustering for Density-Based Noise (DBSCAN) algorithm. The algorithm can be used for clustering preset texts describing the same theme into one set, and putting preset texts describing different themes into different sets.

S420, acquiring a target text set in each second text set, wherein the target text set comprises target texts.

In the embodiment of the present disclosure, after obtaining at least one second text set, the abstract generating device selects one second text set, uses a plurality of preset texts included in the second text set as a target text set, and uses each preset text included in the second text set as a target text.

It can be understood that, since each second text set corresponds to a topic, after the summary generation device selects one second text set, the following steps can be performed under the topic corresponding to the second text set.

Optionally, the selection of one second text set from the at least one second text set may be random selection, or may be selection according to a user requirement or a set rule, which is not limited in this disclosure.

And S430, extracting text features of the target texts aiming at each target text, wherein the text features comprise features of a plurality of view angle types related to semantic distances.

In the embodiment of the present disclosure, the step is the same as the step S120, and is not described herein again.

S440, performing topic clustering on the plurality of target texts based on the text features of the target texts to obtain at least one first text set.

In the embodiment of the present disclosure, the step is the same as the step S130, and is not described herein again.

S450, extracting the topic abstract of each first text set to obtain the topic abstract corresponding to each first text set.

In the embodiment of the present disclosure, the step is the same as the step S140, and is not described herein again.

It can be understood that, in the embodiment of the present disclosure, when topic clustering is performed on a plurality of preset texts to obtain a plurality of second text sets, the summary generation device first selects one second text set, and performs steps S430 to S450 to extract a topic summary in the second text set; after the second text set topic abstract extraction is completed, another second text set can be selected, and step S430-step S450 of extracting the topic abstract in the second text set are executed; traversing all of the second set of text and performing steps S430-S450 in sequence may extract all of the topic digests.

According to the abstract generation method, the device, the equipment and the medium, after the preset text is obtained, the obtained preset text is subject clustered to obtain a plurality of second text sets, topic clustering is respectively carried out on each second text set to obtain a plurality of first text sets, and then the topic abstract is respectively extracted on each first text set, so that multilayer topic extraction based on the topic-topic abstract from top to bottom is realized; because the topic extraction is carried out under the same topic, the topic extraction efficiency is improved, and the time complexity is reduced.

The following introduces an implementation principle of the abstract generation method provided by the embodiment of the present disclosure.

As shown in fig. 5, the summary generation method provided in the embodiment of the present disclosure realizes a multi-level topic extraction method based on topics, topics and topic summaries from top to bottom. Specifically, the abstract generating device performs topic clustering on a plurality of preset texts to obtain at least one second text set, wherein each second text set corresponds to one topic. And then selecting a second text set as a target text set, extracting text characteristics of a target text in the target text set, and clustering topics based on the text characteristics of the target text to obtain at least one first text set, wherein each first text set corresponds to one topic. And finally, selecting a first text set, and extracting the topic abstract of the first text set to obtain the topic abstract. By adopting the method, all the second text sets are traversed, and topic clustering under all the second texts is completed; and traversing all the first text sets to complete the extraction of the summary of the topics under all the first texts.

Fig. 6 is a schematic structural diagram of a summary generation apparatus according to an embodiment of the present disclosure. The summary generation means may be the summary generation device as described in the above embodiments, or the summary generation means may be a component or assembly in the summary generation device. The digest generation apparatus provided in the embodiment of the present disclosure may execute the processing procedure provided in the digest generation method embodiment, as shown in fig. 6, the digest generation apparatus 60 includes:

an obtaining module 61, configured to obtain multiple target texts;

an extraction module 62, configured to, for each target text, extract text features of the target text, where the text features include features of multiple perspective types related to semantic distance;

the clustering module 63 is configured to perform topic clustering on the multiple target texts based on text features of the target texts to obtain at least one first text set;

and an extracting module 64, configured to perform topic abstraction extraction on each first text set to obtain a topic abstraction corresponding to each first text set.

Optionally, the text features include at least two of the following; entity features, wherein the entity features comprise various entities related to the target text; entity semantic features, wherein the entity semantic features comprise normalized frequency vectors corresponding to semantic roles of various entities related to the target text; the Chinese word features comprise text vectors of the target text.

Optionally, the extracting module 62 is configured to, for each target text, extract text features of the target text, where the text features include features of multiple perspective types related to semantic distance, and specifically configured to: performing semantic role labeling on each entity related to the target text to obtain the semantic role of each entity; counting the frequency of the semantic roles of each entity appearing in the target text to obtain the frequency corresponding to each semantic role; and generating a normalized frequency vector corresponding to the semantic roles of each entity according to the frequency number corresponding to each semantic role.

Optionally, the clustering module 63 is configured to perform topic clustering on the multiple target texts based on text features of the target texts, and when at least one first text set is obtained, specifically configured to: calculating the feature similarity for each perspective type between every two target texts based on the text features; calculating semantic distance between two target texts based on the feature similarity for each visual angle type aiming at every two target texts; and performing topic clustering on the plurality of target texts according to the semantic distance to obtain at least one first text set.

Optionally, the clustering module 63 is specifically configured to, when calculating a semantic distance between two target texts based on the feature similarity for each view type: adding the feature similarity aiming at each visual angle type to obtain a semantic metric value; the inverse of the semantic metric value is taken as the semantic distance between the two target texts.

Optionally, the obtaining module 61 includes a clustering unit and an obtaining unit; the clustering unit is used for performing theme clustering on the plurality of preset texts to obtain at least one second text set; the acquisition unit is used for acquiring a target text set in at least one second text set, wherein the target text set comprises a plurality of target texts.

Optionally, the clustering unit is configured to perform topic clustering on the multiple preset texts to obtain at least one second text set, and is specifically configured to: extracting potential theme information of each preset text to obtain a theme distribution vector corresponding to each preset text; and performing theme clustering on the plurality of preset texts based on the theme distribution vector to obtain at least one second text set.

The digest generation apparatus in the embodiment shown in fig. 6 may be used to implement the technical solutions of the above method embodiments, and the implementation principles and technical effects are similar, which are not described herein again.

Fig. 7 is a schematic structural diagram of a digest generation device according to an embodiment of the present disclosure. The digest generation apparatus provided in the embodiment of the present disclosure may execute the processing procedure provided in the digest generation method embodiment, as shown in fig. 7, the digest generation apparatus 70 includes: memory 71, processor 72, computer programs and communication interface 73; wherein the computer program is stored in the memory 71 and is configured to execute the digest generation method described above by the processor 72.

In addition, the embodiments of the present disclosure also provide a medium on which a computer program is stored, where the computer program is executed by a processor to implement the digest generation method of the above embodiments.

Furthermore, the embodiments of the present disclosure also provide a computer program product, which includes a computer program or instructions, and when the computer program or instructions are executed by a processor, the summary generation method of the above embodiments is implemented.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description is only for the purpose of describing particular embodiments of the present disclosure, so as to enable those skilled in the art to understand or implement the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for generating a summary, the method comprising:

acquiring a plurality of target texts;

extracting, for each of the target texts, text features of the target text, the text features including features of a plurality of perspective types related to semantic distance;

performing topic clustering on the target texts based on the text features of the target texts to obtain at least one first text set;

2. The method of claim 1, wherein the text features include at least two of:

entity features, wherein the entity features comprise various entities related to the target text;

entity semantic features, wherein the entity semantic features comprise normalized frequency vectors corresponding to semantic roles of various entities related to the target text;

the Chinese word features comprise text vectors of the target text.

3. The method of claim 1, wherein the textual features comprise entity semantic features;

wherein the extracting text features of the target text comprises:

performing semantic role labeling on each entity related to the target text to obtain the semantic role of each entity;

counting the frequency of the semantic roles of the entities appearing in the target text to obtain the frequency corresponding to the semantic roles;

and generating a normalized frequency vector corresponding to the semantic role of each entity according to the frequency number corresponding to each semantic role.

4. The method of claim 1, wherein the clustering the plurality of target texts based on the text features of each target text to obtain at least one first text set comprises:

calculating feature similarity for each perspective type between every two target texts based on the text features;

for every two target texts, calculating a semantic distance between the two target texts based on the feature similarity for each view angle type;

and performing topic clustering on the plurality of target texts according to the semantic distance to obtain the at least one first text set.

5. The method of claim 4, wherein calculating a semantic distance between two of the target texts based on the feature similarity for each of the perspective types comprises:

adding the feature similarity aiming at each visual angle type to obtain a semantic metric value;

and taking the reciprocal of the semantic metric value as the semantic distance between the two target texts.

6. The method of claim 1, wherein obtaining the plurality of target texts comprises:

performing topic clustering on the plurality of preset texts to obtain at least one second text set;

and acquiring a target text set in each second text set, wherein the target text set comprises the plurality of target texts.

7. The method of claim 6, wherein topic clustering the plurality of preset texts to obtain at least one second text set comprises:

extracting potential theme information of each preset text to obtain a theme distribution vector corresponding to each preset text;

and performing topic clustering on the plurality of preset texts based on the topic distribution vector to obtain at least one second text set.

8. An apparatus for summary generation, the apparatus comprising:

the acquisition module is used for acquiring a plurality of target texts;

an extraction module, configured to extract, for each of the target texts, a text feature of the target text, where the text feature includes features of multiple perspective types related to semantic distance;

and the extraction module is used for extracting the topic summaries of the first text sets to obtain the topic summaries corresponding to the first text sets.

9. An abstract generation device is characterized by comprising

A memory;

a processor; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the summary generation method of any one of claims 1-7.

10. A medium on which a computer program is stored, which computer program, when being executed by a processor, carries out the digest generation method according to any one of claims 1 to 7.