CN116108165A - Text abstract generation method and device, storage medium and electronic equipment - Google Patents

Text abstract generation method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN116108165A
CN116108165A CN202310347275.1A CN202310347275A CN116108165A CN 116108165 A CN116108165 A CN 116108165A CN 202310347275 A CN202310347275 A CN 202310347275A CN 116108165 A CN116108165 A CN 116108165A
Authority
CN
China
Prior art keywords
target
target sentence
round
sentence
sentences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310347275.1A
Other languages
Chinese (zh)
Other versions
CN116108165B (en
Inventor
韩国权
蔡惠民
高山
董厚泽
支婷
洒科进
曹扬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC Big Data Research Institute Co Ltd
Original Assignee
CETC Big Data Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC Big Data Research Institute Co Ltd filed Critical CETC Big Data Research Institute Co Ltd
Priority to CN202310347275.1A priority Critical patent/CN116108165B/en
Publication of CN116108165A publication Critical patent/CN116108165A/en
Application granted granted Critical
Publication of CN116108165B publication Critical patent/CN116108165B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a text abstract generation method, a device, a storage medium and electronic equipment, wherein the method comprises the following steps: extracting keywords in a target text; performing quantity expansion on the corresponding keywords based on the importance degree of the keywords in the original word sequence with the keywords to obtain an effective word sequence of the target sentence; determining the correlation degree between the target sentences and other target sentences according to the effective word sequence; determining the influence weight of the target sentence according to the relevance; a text excerpt of the target text is formed based on the plurality of target sentences with highest impact weights. According to the technical scheme provided by the embodiment of the invention, the keywords are extracted first, the effective word sequence after the number of the keywords is expanded is determined, and the correlation degree between target sentences required when the text abstract is required to be extracted can be more accurately represented based on the effective word sequence, so that the influence weight of the target sentences can be more accurately determined, and the text abstract can be more accurately extracted.

Description

Text abstract generation method and device, storage medium and electronic equipment
Technical Field
The invention relates to the technical field of text summaries, in particular to a text summary generation method, a text summary generation device, a storage medium and electronic equipment.
Background
Text summary generation is one of the main directions of text generation tasks, which is an information compression technique that utilizes various techniques to automatically convert text into a short summary. Currently, the ways of generating text summaries mainly include two types: extraction and generation. Extraction refers to extracting one sentence or several sentences from the text to form a abstract. The generation formula refers to automatically generating a summary on a text basis, which is an end-to-end process.
In carrying out the inventive process, the inventors found that:
the existing extraction type can extract abstracts based on experience or graphs and the like, but the problem of poor abstracts quality easily exists; the generation formula requires training of the codec, for example, the Seq2Seq model, and a large amount of training data is required for learning and training, and is mainly used in the field with rich data sets such as news, and the effect in other fields is general.
Disclosure of Invention
In order to solve the problem that the existing scheme is difficult to quickly and accurately extract the text abstract, the embodiment of the invention aims to provide a text abstract generation method, a device, a storage medium and electronic equipment.
In a first aspect, an embodiment of the present invention provides a text summary generating method, including:
acquiring a target text to be processed;
extracting keywords in the target text;
determining an original word sequence of a target sentence in the target text, and expanding the number of corresponding keywords based on the importance degree of the keywords in the original word sequence with the keywords to obtain an effective word sequence of the target sentence; the expansion quantity of the keywords and the importance degree of the keywords are in positive correlation;
determining the relativity between the target sentences and other target sentences according to the similarity between the effective word sequences of the target sentences and the effective word sequences of other target sentences;
determining influence weights of the target sentences according to the relativity between the target sentences and other target sentences, wherein the influence weights of the target sentences are used for representing influence of the target sentences in the target text;
and forming a text abstract of the target text based on a plurality of target sentences with highest influence weights.
Optionally, the determining the impact weight of the target sentence according to the relatedness between the target sentence and other target sentences includes:
performing iterative execution of multiple rounds of influence weight updating operation until an iteration ending condition is met, and taking the influence weight determined at the end of the iteration as the influence weight of a corresponding target sentence;
wherein the influence weight update operation of the kth round includes:
according to the kth-1 round of influence weight of the ith target sentence and the kth-1 round of influence weight of the jth target sentence, updating the correlation degree between the ith target sentence and the jth target sentence, and determining the kth round of correlation degree between the ith target sentence and the jth target sentence; the ith target sentence and the jth target sentence are any two target sentences in the target text; the k-th round of correlation between the ith target sentence and the jth target sentence is in positive correlation with the k-1-th round of influence weight of the ith target sentence and the k-1-th round of influence weight of the jth target sentence;
generating a k-th round of adjacency matrix M k The adjacency matrix M k The element in (a) represents the kth round of relatedness between the ith target sentence and the jth target sentence;
according to the k-th round adjacency matrix M k Updating the k-1 th round of influence weight of each target sentence, and determining the k-1 th round of influence weight of each target sentence, wherein the k-1 th round of influence weight of each target sentence satisfies the following conditions:
Figure SMS_1
wherein n represents the total number of target sentences, TR k (V i ) Representing the ith target sentence V i The kth round of influence weight, TR k-1 (V i ) Representing the ith target sentence V i Impact weight, m of the k-1 th round ki Representing the adjacency matrix M k The sum of all elements of row i, i=1, 2, …, n; d represents a preset damping coefficient, and d is more than 0 and less than 1;
r represents a column vector of n dimensions with all elements being 1.
Optionally, the updating the relevance between the ith target sentence and the jth target sentence according to the kth-1 round of influence weight of the ith target sentence and the kth-1 round of influence weight of the jth target sentence, and determining the kth round of relevance between the ith target sentence and the jth target sentence includes:
determining a relativity correction term Deltaw between an ith target sentence and a jth target sentence according to the kth-1 round influence weight of the ith target sentence and the kth-1 round influence weight of the jth target sentence k
Figure SMS_2
Wherein a is a preset coefficient, a is more than 0 and less than 0.5, k represents the current round, and T is a preset iteration total round; f () is a preset function, and
Figure SMS_3
representation and representationIth target sentence V i The k-1 th round of impact weight TR k-1 (V i ) Is a function of positive correlation and +.>
Figure SMS_4
<1;
Increasing the correction term Δw for the kth-1 round of correlation between the ith target sentence and the jth target sentence k And generating a kth round of relevance between the ith target sentence and the jth target sentence.
Optionally, the preset function satisfies:
Figure SMS_5
; or ,
Figure SMS_6
wherein ,
Figure SMS_7
the method comprises the steps of carrying out a first treatment on the surface of the u=1, 2, …, L being a preset positive integer.
In a second aspect, an embodiment of the present invention further provides a text summary generating device, including:
the acquisition module is used for acquiring a target text to be processed;
the keyword extraction module is used for extracting keywords in the target text;
the word sequence updating module is used for determining an original word sequence of a target sentence in the target text, and expanding the number of corresponding keywords based on the importance degree of the keywords in the original word sequence with the keywords to obtain an effective word sequence of the target sentence; the expansion quantity of the keywords and the importance degree of the keywords are in positive correlation;
the relevance determining module is used for determining the relevance between the target sentence and other target sentences according to the similarity between the effective word sequence of the target sentence and the effective word sequences of other target sentences;
the influence weight determining module is used for determining influence weights of the target sentences according to the relativity between the target sentences and other target sentences, wherein the influence weights of the target sentences are used for representing influence of the target sentences in the target text;
and the abstract module is used for forming a text abstract of the target text based on a plurality of target sentences with highest influence weights.
Optionally, the influence weight determining module determines the influence weight of the target sentence according to the relatedness between the target sentence and other target sentences, including:
performing iterative execution of multiple rounds of influence weight updating operation until an iteration ending condition is met, and taking the influence weight determined at the end of the iteration as the influence weight of a corresponding target sentence;
wherein the influence weight update operation of the kth round includes:
according to the kth-1 round of influence weight of the ith target sentence and the kth-1 round of influence weight of the jth target sentence, updating the correlation degree between the ith target sentence and the jth target sentence, and determining the kth round of correlation degree between the ith target sentence and the jth target sentence; the ith target sentence and the jth target sentence are any two target sentences in the target text; the k-th round of correlation between the ith target sentence and the jth target sentence is in positive correlation with the k-1-th round of influence weight of the ith target sentence and the k-1-th round of influence weight of the jth target sentence;
generating a k-th round of adjacency matrix M k The adjacency matrix M k The element in (a) represents the kth round of relatedness between the ith target sentence and the jth target sentence;
according to the k-th round adjacency matrix M k Updating the k-1 th round of influence weight of each target sentence, and determining the k-1 th round of influence weight of each target sentence, wherein the k-1 th round of influence weight of each target sentence satisfies the following conditions:
Figure SMS_8
wherein n represents the total number of target sentences, TR k (V i ) Representing the ith target sentence V i The kth round of influence weight, TR k-1 (V i ) Representing the ith target sentence V i Impact weight, m of the k-1 th round ki Representing the adjacency matrix M k The sum of all elements of row i, i=1, 2, …, n; d represents a preset damping coefficient, and d is more than 0 and less than 1; r represents a column vector of n dimensions with all elements being 1.
In a third aspect, an embodiment of the present invention further provides a computer storage medium, where computer executable instructions are stored, where the computer executable instructions are used in the text summary generating method described in any one of the foregoing aspects.
In a fourth aspect, an embodiment of the present invention further provides an electronic device, including:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any one of the text excerpt generation methods described above.
In the scheme provided by the first aspect of the embodiment of the invention, the keywords in the target text are extracted first, and then the number of the keywords in the target sentence is expanded to form an effective word sequence with larger influence of the keywords; and determining the relativity between target sentences based on the effective word sequences, and further determining the influence weight of the target sentences, thereby selecting the target sentences suitable for being used as text abstracts. According to the method, the keywords are extracted first, then the effective word sequence with the number of the keywords expanded is determined, the correlation degree between target sentences required when the text abstract is required to be extracted can be more accurately represented based on the effective word sequence, so that the influence weight of the target sentences can be more accurately determined, and the text abstract can be more accurately extracted. In addition, when the influence weight is determined based on the correlation degree, a training model is not needed, the influence of training data is avoided, and the processing efficiency is higher.
In order to make the above objects, features and advantages of the present invention more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 shows a flowchart of a text summary generation method provided by an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a text abstract generating device according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device for executing a text abstract generating method according to an embodiment of the invention.
Detailed Description
In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
In the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured," and the like are to be construed broadly and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.
The method for generating the text abstract provided by the embodiment of the invention is shown in fig. 1, and comprises the following steps:
step 101: and obtaining the target text to be processed.
The text to be processed is a text to be extracted from the text abstract, and for convenience of description, the text is called a target text. For example, the target text may be an article or the like that requires a text digest to be extracted.
Step 102: and extracting keywords in the target text.
The target text contains words which can better represent text meaning, namely keywords. For example, keywords may be extracted based on TF-IDF (term frequency-inverse text frequency) index, or based on TextRank algorithm or the like, which is not limited in this embodiment.
Step 103: and determining an original word sequence of the target sentence in the target text, and carrying out quantity expansion on the corresponding keywords based on the importance degree of the keywords in the original word sequence with the keywords to obtain an effective word sequence of the target sentence. Wherein, the expansion number of the keywords and the importance degree of the keywords are in positive correlation.
Firstly, dividing a target text, extracting sentences in the target text, namely target sentences, and forming word sequences of the target sentences, namely original word sequences. For example, the target sentence may be segmented (the segmentation process may be the segmentation process referred to in step 102 above, and no repeated segmentation is required) and the part-of-speech tagging may be performed, and then the stop word is filtered, so that only meaningful words, such as nouns, verbs, adjectives, and the like, are retained, and an original word sequence of the target sentence is obtained, where the original word sequence includes at least one word in the target sentence in a sequential order. And, the target text contains a plurality of target sentences, and the original word sequence of each target sentence can be determined.
Some of the target sentences in the target text contain keywords, while the rest of the target sentences do not contain keywords. For target sentences containing keywords, the embodiment of the invention expands the number of the keywords in the original word sequence of the target sentences, namely expands the keywords from one to a plurality of keywords, and the expanded original word sequence is called as an effective word sequence. For example, the original word sequence of a target text A is [ a ] 1 ,a 2 ,a 3 ,a 4 ,a 5 ],a i Representing words in the target text a; if a is 3 As the key word, the key word a can be used 3 The number of the expanded original word sequences (i.e. target word sequences) can be: [ a ] 1 ,a 2 ,a 3 ,a 3 ,a 3 ,a 4 ,a 5 ]I.e. keyword a 3 The number of (2) extends from one to three. For target sentences which do not contain keywords, the original word sequence is directly used as an effective word sequence.
In addition, in the embodiment of the invention, when the number of the keywords is expanded, the number of the keywords to be expanded and the importance degree of the keywords are in positive correlation, namely, the higher the importance degree of the keywords is, the more the number of the keywords to be expanded is. Wherein the importance degree of the keywords is an index determined when extracting the keywords in the target text; the higher the importance of a word, the more likely that the word is to be used as a keyword; for example, the importance level is TF-IDF.
Step 104: and determining the correlation degree between the target sentence and other target sentences according to the similarity degree between the effective word sequence of the target sentence and the effective word sequences of other target sentences.
In the embodiment of the invention, the effective word sequence of the target sentence is the word sequence of the expanded keywords, and the word sequence can be used for representing the importance degree of the target sentence. The embodiment of the invention determines the similarity degree between two sentences based on the effective word sequences of the two sentences, and can take the similarity degree as the correlation degree between the two sentences; alternatively, the degree of similarity may be normalized, and the degree of similarity after normalization may be used as the degree of correlation between the two. The euclidean distance between the valid word sequences of the two target sentences can be used as the similarity degree between the valid word sequences, or word vectors corresponding to the two valid word sequences can be determined based on a word model (such as word2 vec), and cosine similarity between the two word vectors is used as the similarity degree between the two word vectors.
Step 105: and determining the influence weight of the target sentence according to the relativity between the target sentence and other target sentences, wherein the influence weight of the target sentence is used for representing the influence of the target sentence in the target text.
Step 106: a text excerpt of the target text is formed based on the plurality of target sentences with highest impact weights.
In the embodiment of the invention, the correlation degree between any two target sentences in the target text can be determined, the overall correlation degree condition can represent the correlation degree between each target sentence and other target sentences, and the more important target sentences are, the larger the influence (namely the influence weight) of the target sentences is, the larger the correlation degree between the target sentences and other target sentences is, so that the influence weight of the target sentences can be determined based on the correlation degree between the target sentences and other target sentences. For example, the impact weight of each target sentence may be determined based on a conventional TextRank algorithm. After determining the impact weight of each target sentence, the target sentences with the highest impact weights can be used as the text digests of the target texts. For example, all target sentences are ordered according to the influence weight, and the top-ranked (for example, the top 1%) target sentences are used as the text abstract of the target text; or, taking a plurality of target sentences with the influence weight larger than a preset threshold as the text abstract of the target text.
According to the text abstract generation method provided by the embodiment of the invention, the keywords in the target text are extracted firstly, and then the number of the keywords in the target sentences is expanded to form an effective word sequence with larger influence of the keywords; and determining the relativity between target sentences based on the effective word sequences, and further determining the influence weight of the target sentences, thereby selecting the target sentences suitable for being used as text abstracts. According to the method, the keywords are extracted first, then the effective word sequence with the number of the keywords expanded is determined, the correlation degree between target sentences required when the text abstract is required to be extracted can be more accurately represented based on the effective word sequence, so that the influence weight of the target sentences can be more accurately determined, and the text abstract can be more accurately extracted. In addition, when the influence weight is determined based on the correlation degree, a training model is not needed, the influence of training data is avoided, and the processing efficiency is higher.
Optionally, the embodiment of the invention improves the TextRank algorithm, and determines the influence weight based on the improved TextRank algorithm. Specifically, the step 105 "determining the influence weight of the target sentence according to the degree of correlation between the target sentence and the other target sentences" described above includes:
step B1: and iteratively executing multiple rounds of influence weight updating operation until the iteration ending condition is met, and taking the influence weight determined at the time of iteration ending as the influence weight of the corresponding target sentence.
Wherein the influence weight updating operation of each round is used for iteratively updating the influence weight of each target sentence. When the iteration turns reach a preset value (such as 200, etc.), the iteration is ended; or if the influence weight of each target sentence is converged, the iteration ending condition is also met, and the iteration is ended.
Specifically, in step B1, the influence weight updating operation of the kth round includes:
step B11: according to the kth-1 round of influence weight of the ith target sentence and the kth-1 round of influence weight of the jth target sentence, updating the correlation degree between the ith target sentence and the jth target sentence, and determining the kth round of correlation degree between the ith target sentence and the jth target sentence; the ith target sentence and the jth target sentence are any two target sentences in the target text; the k-th round of relevance between the ith target sentence and the jth target sentence is positive correlation with the k-1 th round of influence weight of the ith target sentence and the k-1 th round of influence weight of the jth target sentence.
Step B12: generating a k-th round of adjacency matrix M k Adjacency matrix M k The element in (c) represents the kth round of relevance between the ith target sentence and the jth target sentence.
Step B13: according to the k-th round adjacency matrix M k Updating the k-1 th round of influence weight of each target sentence, and determining the k-1 th round of influence weight of each target sentence, wherein the k-1 th round of influence weight of each target sentence meets the following conditions:
Figure SMS_9
wherein n represents the total number of target sentences, TR k (V i ) Representing the ith target sentence V i The kth round of influence weight, TR k-1 (V i ) Representing the ith target sentence V i Impact weight, m of the k-1 th round ki Representing an adjacency matrix M k The sum of all elements of row i, i=1, 2, …, n; d represents a preset damping coefficient, 0 < d < 1, e.g., d=0.85; r represents an n-dimensional column vector with all elements 1, i.e
Figure SMS_10
In the embodiment of the invention, for any two target sentences in the target text, the relevance of the current round (namely the kth round) is updated based on the influence weight of the previous round (namely the kth round-1), so that the influence weight of the current round is determined, and the update of the influence weight is realized. Specifically, in V i Represents the ith target sentence in V j Representing the j-th target sentence, i and j each being a positive integer of 1 to n, n representing the total number of target sentences; represented by TRImpact weight, and the k-1 th round of impact weight of the ith target sentence is denoted as TR k-1 (V i ) Accordingly, the k-1 th round of influence weight of the jth target sentence is TR k-1 (V j ) The method comprises the steps of carrying out a first treatment on the surface of the The (k-1) th round of impact weight TR of the (i) th target sentence k-1 (V i ) The larger it is, the more important it is, and the more correlation can be set for it; similarly, the k-1 th round of the jth target sentence affects the weight TR k-1 (V j ) The larger the correlation degree is, namely the k-th round of correlation degree between the ith target sentence and the jth target sentence is positive correlation relation with the k-1-th round of influence weight of the ith target sentence and the k-1-th round of influence weight of the jth target sentence.
After the updating determines the relatedness between any two target sentences, the adjacency matrix of the target text can be updated. In the embodiment of the invention, each element in the adjacency matrix represents the correlation degree between two target sentences; specifically, for the k-th round, the adjacency matrix M k Wherein the element represents the kth round of correlation between the ith target sentence and the jth target sentence, the adjacency matrix M k Is an n x n matrix. For example, to
Figure SMS_11
Representing the kth round of correlation between the ith and jth target sentences, then the adjacency matrix M k Can be expressed as:
Figure SMS_12
wherein the relativity between any ith target sentence and itself is 0, namely
Figure SMS_13
And, the embodiment of the invention determines the adjacency matrix M k The sum of all elements of each row in the row i is m ki I.e.
Figure SMS_14
. For example, the sum of all elements of line 1 +.>
Figure SMS_15
. Then, the kth round of influence weight of each target sentence can be updated and determined based on the step B13, namely: />
Figure SMS_16
Wherein the kth round of impact weight of the ith target sentence is denoted as TR k (V i ) The kth round of influence weight of the jth target sentence is TR k (V j ). In the embodiment of the present invention, the k-1 th round of impact weight for each target sentence is divided by the sum of all elements of the corresponding line, e.g.,
Figure SMS_17
the method comprises the steps of carrying out a first treatment on the surface of the Even if the correlation between the target sentences is updated, convergence of the Markov (Markov) process can be always guaranteed, and finally, the converged influence weight can be obtained.
In addition, the initial relevance of the target sentence is the relevance determined in step 104, and the initial impact weight of the target sentence is the average value of all target sentences; for example, if the target text contains n target sentences, then the initial impact weight of each target sentence is 1/n. For example, in the influence weight update operation of round 1, the k-1 influence weight of each target sentence is 1/n.
Optionally, the step B11 "the step of updating the degree of correlation between the ith target sentence and the jth target sentence according to the kth-1 round of influence weight of the ith target sentence and the kth-1 round of influence weight of the jth target sentence" the step of determining the degree of correlation between the ith target sentence and the jth target sentence includes:
step B111: determining a relativity correction term delta w between the ith target sentence and the jth target sentence according to the kth-1 round influence weight of the ith target sentence and the kth-1 round influence weight of the jth target sentence k
Figure SMS_18
Wherein a is a preset coefficient, a is more than 0 and less than 0.5, k represents the current round, and T is a preset iteration total round; f () is a preset function, and
Figure SMS_19
representing and i-th target sentence V i The k-1 th round of impact weight TR k-1 (V i ) Is a function of positive correlation and +.>
Figure SMS_20
<1。
Step B112: adding a correction term Deltaw to the k-1 th round of relevance between the ith target sentence and the jth target sentence k And generating the kth round of relevance between the ith target sentence and the jth target sentence. For example, the number of the cells to be processed,
Figure SMS_21
in the embodiment of the invention, a correction term for correcting the relevance of the current round is determined based on the influence weight of the previous round of the target sentence, and the correction term is added on the basis of the previous round of relevance, so that the relevance of the current round is updated.
Embodiments of the invention
Figure SMS_22
Representing the correction term when updating the relevance of two target sentences. Specifically, if the ith target sentence and the kth-1 th round of impact weight TR of the jth target sentence k-1 (V i )、TR k-1 (V j ) If the correlation degree is larger, the correlation degree between the two is considered to be higher, and the correlation degree between the two can be properly increased, so that the k-th round correlation degree between the two is obtained; if the influence weights of the ith target sentence and the jth target sentence are the k-1 th round of influence weights TR k-1 (V i )、TR k-1 (V j ) If the correlation degree is smaller, the correlation degree is possibly larger, but the correlation degree is smaller when the influence weight is determined; if the ith target sentence and the jth target sentenceInfluence weight of child k-1 th round influence weight TR k-1 (V i )、TR k-1 (V j ) And if the two are larger or smaller, the correlation degree of the two is smaller.
Furthermore, the embodiment of the invention sets the adjustment coefficient related to the round k for the correction term based on the sigmoid function, namely
Figure SMS_23
The method comprises the steps of carrying out a first treatment on the surface of the Where a is generally set to a small value, for example, a=0.1. In the embodiment of the invention, in the initial iteration stage (namely when the k value is smaller), the influence weight is not accurate, and the influence of the influence weight on the correlation degree is set to be smaller so as to avoid obtaining unsuitable or even wrong correlation degree; in the middle iteration stage, the influence of the influence weight on the correlation degree can be gradually increased, so that the convergence process can be accelerated while the correlation degree is updated; in the later iteration stage (i.e. when the k value is larger, for example, when k is close to the total round T), the influence of the influence weight on the correlation degree is larger, and the degree of the influence is kept basically unchanged, so that the problem that the correlation degree is difficult to converge at last due to updating can be avoided.
Wherein, since the influence weight itself is smaller than 1, the above-mentioned preset function f () can satisfy:
Figure SMS_24
i.e. the correction term is determined directly based on the impact weight of the previous round.
Alternatively, the correction term may be determined based on the influence weights of the previous rounds, i.e., the preset function f () satisfies:
Figure SMS_25
in an embodiment of the present invention,
Figure SMS_26
weights representing the respective round impact weights; the weighting summation is carried out on the previous multi-round influence weight, and the weighting summation result is taken as +.>
Figure SMS_27
And thereby determine the correction term.
Specifically, the number of rounds L that need to be selected forward is set, i.e., the impact weights of the previous L rounds are determined. At the kth round, the impact weights at the kth round 1 (u=l), the kth round 2 (u=l-1), … …, the kth round L (u=1) need to be determined; and, the weight is
Figure SMS_28
The method meets the following conditions: />
Figure SMS_29
U=1, 2, …, L, which is a power function and monotonically increases. As described above, as the iterative process proceeds, the more accurate the influence weights are, the embodiment of the present invention is obtained by setting +.>
Figure SMS_30
A lower weight may be set for the lower turn (corresponding to a smaller u value)>
Figure SMS_31
Setting a larger weight for the high rounds (corresponding to a larger u value)
Figure SMS_32
. And, ownership ∈>
Figure SMS_33
The sum is 1, i.e.)>
Figure SMS_34
The text abstract generating method flow is described in detail above, the method can also be realized by the corresponding device, and the structure and function of the device are described in detail below.
Based on the same inventive concept, the embodiment of the present invention further provides a text abstract generating device, as shown in fig. 2, including:
an acquisition module 21, configured to acquire a target text to be processed;
a keyword extraction module 22, configured to extract keywords in the target text;
the word sequence updating module 23 is configured to determine an original word sequence of a target sentence in the target text, and perform quantity expansion on corresponding keywords based on importance degrees of the keywords in the original word sequence with the keywords, so as to obtain an effective word sequence of the target sentence; the expansion quantity of the keywords and the importance degree of the keywords are in positive correlation;
a relevance determining module 24, configured to determine a relevance between the target sentence and other target sentences according to a similarity between the valid word sequence of the target sentence and valid word sequences of other target sentences;
an impact weight determining module 25, configured to determine an impact weight of the target sentence according to a correlation between the target sentence and other target sentences, where the impact weight of the target sentence is used to represent an impact of the target sentence in the target text;
the summarization module 26 is configured to form a text summary of the target text based on the target sentences with the highest impact weights.
Optionally, the impact weight determining module 25 determines the impact weight of the target sentence according to the correlation degree between the target sentence and other target sentences, including:
performing iterative execution of multiple rounds of influence weight updating operation until an iteration ending condition is met, and taking the influence weight determined at the end of the iteration as the influence weight of a corresponding target sentence;
wherein the influence weight update operation of the kth round includes:
according to the kth-1 round of influence weight of the ith target sentence and the kth-1 round of influence weight of the jth target sentence, updating the correlation degree between the ith target sentence and the jth target sentence, and determining the kth round of correlation degree between the ith target sentence and the jth target sentence; the ith target sentence and the jth target sentence are any two target sentences in the target text; the k-th round of correlation between the ith target sentence and the jth target sentence is in positive correlation with the k-1-th round of influence weight of the ith target sentence and the k-1-th round of influence weight of the jth target sentence;
generating a k-th round of adjacency matrix M k The adjacency matrix M k The element in (a) represents the kth round of relatedness between the ith target sentence and the jth target sentence;
according to the k-th round adjacency matrix M k Updating the k-1 th round of influence weight of each target sentence, and determining the k-1 th round of influence weight of each target sentence, wherein the k-1 th round of influence weight of each target sentence satisfies the following conditions:
Figure SMS_35
wherein n represents the total number of target sentences, TR k (V i ) Representing the ith target sentence V i The kth round of influence weight, TR k-1 (V i ) Representing the ith target sentence V i Impact weight, m of the k-1 th round ki Representing the adjacency matrix M k The sum of all elements of row i, i=1, 2, …, n; d represents a preset damping coefficient, and d is more than 0 and less than 1; r represents a column vector of n dimensions with all elements being 1.
Optionally, the impact weight determining module 25 updates the relevance between the ith target sentence and the jth target sentence according to the kth-1 round of impact weight of the ith target sentence and the kth-1 round of impact weight of the jth target sentence, and determines the kth round of relevance between the ith target sentence and the jth target sentence, including:
determining a relativity correction term Deltaw between an ith target sentence and a jth target sentence according to the kth-1 round influence weight of the ith target sentence and the kth-1 round influence weight of the jth target sentence k
Figure SMS_36
;/>
Wherein a is a preset coefficient, a is more than 0 and less than 0.5, k represents the current round, and T is a preset iteration total round; f () is a preset function, and
Figure SMS_37
representing and i-th target sentence V i The k-1 th round of impact weight TR k-1 (V i ) Is a function of positive correlation and +.>
Figure SMS_38
<1;
Increasing the correction term Δw for the kth-1 round of correlation between the ith target sentence and the jth target sentence k And generating a kth round of relevance between the ith target sentence and the jth target sentence.
Optionally, the preset function satisfies:
Figure SMS_39
; or ,
Figure SMS_40
wherein ,
Figure SMS_41
the method comprises the steps of carrying out a first treatment on the surface of the u=1, 2, …, L being a preset positive integer.
The embodiment of the present invention also provides a computer storage medium storing computer-executable instructions containing a program for executing the above-described text digest generation method, the computer-executable instructions being capable of executing the method of any of the above-described method embodiments.
The computer storage media may be any available media or data storage device that can be accessed by a computer, including, but not limited to, magnetic storage (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical storage (e.g., CD, DVD, BD, HVD, etc.), and semiconductor storage (e.g., ROM, EPROM, EEPROM, nonvolatile storage (NAND FLASH), solid State Disk (SSD)), etc.
Fig. 3 shows a block diagram of an electronic device according to another embodiment of the invention. The electronic device 1100 may be a host server with computing capabilities, a personal computer PC, or a portable computer or terminal that is portable, etc. The specific embodiments of the present invention are not limited to specific implementations of electronic devices.
The electronic device 1100 includes at least one processor 1110, a communication interface (Communications Interface) 1120, a memory 1130, and a bus 1140. Wherein processor 1110, communication interface 1120, and memory 1130 communicate with each other through bus 1140.
The communication interface 1120 is used to communicate with network elements including, for example, virtual machine management centers, shared storage, and the like.
The processor 1110 is used to execute programs. The processor 1110 may be a central processing unit CPU, or an application specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention.
The memory 1130 is used for executable instructions. Memory 1130 may include high-speed RAM memory or non-volatile memory (nonvolatile memory), such as at least one magnetic disk memory. Memory 1130 may also be a memory array. Memory 1130 may also be partitioned and the blocks may be combined into virtual volumes according to certain rules. The instructions stored in memory 1130 may be executable by processor 1110 to enable processor 1110 to perform the text excerpt generation method of any of the method embodiments described above.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (8)

1. A text summary generation method, comprising:
acquiring a target text to be processed;
extracting keywords in the target text;
determining an original word sequence of a target sentence in the target text, and expanding the number of corresponding keywords based on the importance degree of the keywords in the original word sequence with the keywords to obtain an effective word sequence of the target sentence; the expansion quantity of the keywords and the importance degree of the keywords are in positive correlation;
determining the relativity between the target sentences and other target sentences according to the similarity between the effective word sequences of the target sentences and the effective word sequences of other target sentences;
determining influence weights of the target sentences according to the relativity between the target sentences and other target sentences, wherein the influence weights of the target sentences are used for representing influence of the target sentences in the target text;
and forming a text abstract of the target text based on a plurality of target sentences with highest influence weights.
2. The method of claim 1, wherein the determining the impact weight of the target sentence according to the degree of correlation between the target sentence and other target sentences comprises:
performing iterative execution of multiple rounds of influence weight updating operation until an iteration ending condition is met, and taking the influence weight determined at the end of the iteration as the influence weight of a corresponding target sentence;
wherein the influence weight update operation of the kth round includes:
according to the kth-1 round of influence weight of the ith target sentence and the kth-1 round of influence weight of the jth target sentence, updating the correlation degree between the ith target sentence and the jth target sentence, and determining the kth round of correlation degree between the ith target sentence and the jth target sentence; the ith target sentence and the jth target sentence are any two target sentences in the target text; the k-th round of correlation between the ith target sentence and the jth target sentence is in positive correlation with the k-1-th round of influence weight of the ith target sentence and the k-1-th round of influence weight of the jth target sentence;
generating a k-th round of adjacency matrix M k The adjacency matrix M k The element in (a) represents the kth round of relatedness between the ith target sentence and the jth target sentence;
according to the k-th round adjacency matrix M k Updating the k-1 th round of influence weight of each target sentence, and determining the k-1 th round of influence weight of each target sentence, wherein the k-1 th round of influence weight of each target sentence satisfies the following conditions:
Figure QLYQS_1
wherein n represents the total number of target sentences, TR k (V i ) Representing the ith target sentence V i The kth round of influence weight, TR k-1 (V i ) Representing the ith target sentence V i Impact weight, m of the k-1 th round ki Representing the adjacency matrix M k The sum of all elements of row i, i=1, 2, …, n; d represents a preset damping coefficient, and d is more than 0 and less than 1;
r represents a column vector of n dimensions with all elements being 1.
3. The method according to claim 2, wherein the updating the degree of correlation between the ith target sentence and the jth target sentence according to the kth-1 round of impact weight of the ith target sentence and the kth-1 round of impact weight of the jth target sentence, determining the kth round of degree of correlation between the ith target sentence and the jth target sentence, comprises:
determining a relativity correction term Deltaw between an ith target sentence and a jth target sentence according to the kth-1 round influence weight of the ith target sentence and the kth-1 round influence weight of the jth target sentence k
Figure QLYQS_2
Wherein a is a pre-preparationSetting coefficients, wherein a is more than 0 and less than 0.5, k represents the current round, and T is the preset iteration total round; f () is a preset function, and
Figure QLYQS_3
representing and i-th target sentence V i The k-1 th round of impact weight TR k-1 (V i ) Is a function of positive correlation and +.>
Figure QLYQS_4
< 1;
Increasing the correction term Δw for the kth-1 round of correlation between the ith target sentence and the jth target sentence k And generating a kth round of relevance between the ith target sentence and the jth target sentence.
4. A method according to claim 3, wherein the predetermined function satisfies:
Figure QLYQS_5
; or ,
Figure QLYQS_6
wherein ,
Figure QLYQS_7
the method comprises the steps of carrying out a first treatment on the surface of the u=1, 2, …, L being a preset positive integer.
5. A text digest generating apparatus, comprising:
the acquisition module is used for acquiring a target text to be processed;
the keyword extraction module is used for extracting keywords in the target text;
the word sequence updating module is used for determining an original word sequence of a target sentence in the target text, and expanding the number of corresponding keywords based on the importance degree of the keywords in the original word sequence with the keywords to obtain an effective word sequence of the target sentence; the expansion quantity of the keywords and the importance degree of the keywords are in positive correlation;
the relevance determining module is used for determining the relevance between the target sentence and other target sentences according to the similarity between the effective word sequence of the target sentence and the effective word sequences of other target sentences;
the influence weight determining module is used for determining influence weights of the target sentences according to the relativity between the target sentences and other target sentences, wherein the influence weights of the target sentences are used for representing influence of the target sentences in the target text;
and the abstract module is used for forming a text abstract of the target text based on a plurality of target sentences with highest influence weights.
6. The apparatus of claim 5, wherein the impact weight determination module determines the impact weight of the target sentence based on a degree of correlation between the target sentence and other target sentences, comprising:
performing iterative execution of multiple rounds of influence weight updating operation until an iteration ending condition is met, and taking the influence weight determined at the end of the iteration as the influence weight of a corresponding target sentence;
wherein the influence weight update operation of the kth round includes:
according to the kth-1 round of influence weight of the ith target sentence and the kth-1 round of influence weight of the jth target sentence, updating the correlation degree between the ith target sentence and the jth target sentence, and determining the kth round of correlation degree between the ith target sentence and the jth target sentence; the ith target sentence and the jth target sentence are any two target sentences in the target text; the k-th round of correlation between the ith target sentence and the jth target sentence is in positive correlation with the k-1-th round of influence weight of the ith target sentence and the k-1-th round of influence weight of the jth target sentence;
generating a k-th round of adjacency matrix M k The adjacency matrix M k The element in (a) represents the kth round of relatedness between the ith target sentence and the jth target sentence;
according to the k-th round adjacency matrix M k Updating the k-1 th round of influence weight of each target sentence, and determining the k-1 th round of influence weight of each target sentence, wherein the k-1 th round of influence weight of each target sentence satisfies the following conditions:
Figure QLYQS_8
wherein n represents the total number of target sentences, TR k (V i ) Representing the ith target sentence V i The kth round of influence weight, TR k-1 (V i ) Representing the ith target sentence V i Impact weight, m of the k-1 th round ki Representing the adjacency matrix M k The sum of all elements of row i, i=1, 2, …, n; d represents a preset damping coefficient, and d is more than 0 and less than 1;
r represents a column vector of n dimensions with all elements being 1.
7. A computer storage medium storing computer executable instructions for performing the text digest generation method of any one of claims 1-4.
8. An electronic device, comprising:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the text excerpt generation method of any of claims 1-4.
CN202310347275.1A 2023-04-04 2023-04-04 Text abstract generation method and device, storage medium and electronic equipment Active CN116108165B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310347275.1A CN116108165B (en) 2023-04-04 2023-04-04 Text abstract generation method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310347275.1A CN116108165B (en) 2023-04-04 2023-04-04 Text abstract generation method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN116108165A true CN116108165A (en) 2023-05-12
CN116108165B CN116108165B (en) 2023-06-13

Family

ID=86254655

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310347275.1A Active CN116108165B (en) 2023-04-04 2023-04-04 Text abstract generation method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN116108165B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100287162A1 (en) * 2008-03-28 2010-11-11 Sanika Shirwadkar method and system for text summarization and summary based query answering
CN102169501A (en) * 2011-04-26 2011-08-31 北京百度网讯科技有限公司 Method and device for generating abstract based on type information of document corresponding with searching result
CN106599148A (en) * 2016-12-02 2017-04-26 东软集团股份有限公司 Method and device for generating abstract
CN110765771A (en) * 2019-09-17 2020-02-07 阿里巴巴集团控股有限公司 Method and device for determining advertisement statement
CN110837557A (en) * 2019-11-05 2020-02-25 北京声智科技有限公司 Abstract generation method, device, equipment and medium
CN112347241A (en) * 2020-11-10 2021-02-09 华夏幸福产业投资有限公司 Abstract extraction method, device, equipment and storage medium
CN114781355A (en) * 2022-03-14 2022-07-22 华南理工大学 News text abstract extraction method, system and medium
CN115186654A (en) * 2022-09-07 2022-10-14 太极计算机股份有限公司 Method for generating document abstract

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100287162A1 (en) * 2008-03-28 2010-11-11 Sanika Shirwadkar method and system for text summarization and summary based query answering
CN102169501A (en) * 2011-04-26 2011-08-31 北京百度网讯科技有限公司 Method and device for generating abstract based on type information of document corresponding with searching result
CN106599148A (en) * 2016-12-02 2017-04-26 东软集团股份有限公司 Method and device for generating abstract
CN110765771A (en) * 2019-09-17 2020-02-07 阿里巴巴集团控股有限公司 Method and device for determining advertisement statement
CN110837557A (en) * 2019-11-05 2020-02-25 北京声智科技有限公司 Abstract generation method, device, equipment and medium
CN112347241A (en) * 2020-11-10 2021-02-09 华夏幸福产业投资有限公司 Abstract extraction method, device, equipment and storage medium
CN114781355A (en) * 2022-03-14 2022-07-22 华南理工大学 News text abstract extraction method, system and medium
CN115186654A (en) * 2022-09-07 2022-10-14 太极计算机股份有限公司 Method for generating document abstract

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HYEONJIN LEE 等: "A Brief Survey of text driven image generation and maniulation", 《2021 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS-ASIA (ICCE-ASIA)》, pages 1 - 2 *
张峰: "基于自然语言处理的自动文摘系统", 《中国优秀博硕士学位论文全文数据库 (硕士) 信息科技辑》, no. 12, pages 138 - 1020 *

Also Published As

Publication number Publication date
CN116108165B (en) 2023-06-13

Similar Documents

Publication Publication Date Title
US11544474B2 (en) Generation of text from structured data
CN107644010B (en) Text similarity calculation method and device
US7743062B2 (en) Apparatus for selecting documents in response to a plurality of inquiries by a plurality of clients by estimating the relevance of documents
Zhao et al. Incorporating linguistic constraints into keyphrase generation
CN106407280B (en) Query target matching method and device
JP2019528502A (en) Method and apparatus for optimizing a model applicable to pattern recognition and terminal device
WO2021189951A1 (en) Text search method and apparatus, and computer device and storage medium
WO2021072850A1 (en) Feature word extraction method and apparatus, text similarity calculation method and apparatus, and device
Li et al. A generalized hidden markov model with discriminative training for query spelling correction
CN109582970B (en) Semantic measurement method, semantic measurement device, semantic measurement equipment and readable storage medium
CN112632261A (en) Intelligent question and answer method, device, equipment and storage medium
CN112328735A (en) Hot topic determination method and device and terminal equipment
Iscen et al. Improving image recognition by retrieving from web-scale image-text data
CN116108165B (en) Text abstract generation method and device, storage medium and electronic equipment
CN113505196B (en) Text retrieval method and device based on parts of speech, electronic equipment and storage medium
CN115860009A (en) Sentence embedding method and system for introducing auxiliary samples for comparison learning
CN116303968A (en) Semantic search method, device, equipment and medium based on technical keyword extraction
WO2012134396A1 (en) A method, an apparatus and a computer-readable medium for indexing a document for document retrieval
JP2009116593A (en) Word vector generation device, word vector generation method, program, and recording medium with program recorded therein
CN112183117B (en) Translation evaluation method and device, storage medium and electronic equipment
Yu et al. Domain adaptation problem in sketch based image retrieval
Yin et al. Query-focused multi-document summarization based on query-sensitive feature space
Staš et al. Semantic indexing and document retrieval for personalized language modeling
CN105022836B (en) Compact depth CNN aspect indexing methods based on SIFT insertions
CN115409130B (en) Optimization method and system for updating classification labels

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant