CN110162778B - Text abstract generation method and device - Google Patents

Text abstract generation method and device Download PDF

Info

Publication number
CN110162778B
CN110162778B CN201910263357.1A CN201910263357A CN110162778B CN 110162778 B CN110162778 B CN 110162778B CN 201910263357 A CN201910263357 A CN 201910263357A CN 110162778 B CN110162778 B CN 110162778B
Authority
CN
China
Prior art keywords
sentences
compactness
sentence
specified
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910263357.1A
Other languages
Chinese (zh)
Other versions
CN110162778A (en
Inventor
赵智源
周书恒
郭亚
黄同同
祝慧佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Advanced New Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced New Technologies Co Ltd filed Critical Advanced New Technologies Co Ltd
Priority to CN201910263357.1A priority Critical patent/CN110162778B/en
Publication of CN110162778A publication Critical patent/CN110162778A/en
Application granted granted Critical
Publication of CN110162778B publication Critical patent/CN110162778B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars

Abstract

One or more embodiments of the present disclosure disclose a method and an apparatus for generating a text summary, so as to simply and quickly extract a text summary related to a personalized topic. The method comprises the following steps: acquiring a plurality of sentences in a target text; predicting a first compactness between each sentence and a specified topic according to a preset topic classification mode; and determining the similarity between the sentences; correcting the first compactness at least once by using a specified algorithm according to the similarity between the sentences to obtain a second compactness between each sentence and the specified subject; and processing each sentence according to the second closeness to obtain a text abstract of the target text, which is related to the specified subject.

Description

Text abstract generation method and device
Technical Field
The present disclosure relates to the field of text processing technologies, and in particular, to a method and an apparatus for generating a text abstract.
Background
In the field of summary generation of long text, there are two ways of automatic text summary generation of the main stream: one is based on Textrank to look for one or several sentences closest to central thought from the original text, and the generation formula is that the computer reads the original text; the other is a generation formula based on a deep neural network, and the method often needs a large amount of corpus labeling and has high cost.
In addition to conventional summary summaries (such summaries are typically generated using Textran), there are also some scenarios in which it is desirable to generate qualitative topic-related summaries, such as personalized summary generation for users, risk summary generation for content reviews, and so forth. In this case, since Textrank is mainly extracted for the center sentence of the entire content, it is impossible to extract the abstract related to the target topic. Although the abstract generation mode based on the deep neural network can solve the problems through personalized labeling, the labeling cost of abstract class is higher, and the abstract class cannot be realized conveniently. Therefore, the current abstract generation method cannot be well utilized in the generation of the qualitative theme related abstract.
Disclosure of Invention
The purpose of one or more embodiments of the present disclosure is to provide a method and an apparatus for generating a text abstract, so as to simply and quickly extract a text abstract related to a personalized theme.
To solve the above technical problems, one or more embodiments of the present specification are implemented as follows:
in one aspect, one or more embodiments of the present disclosure provide a method for generating a text abstract, including:
Acquiring a plurality of sentences in a target text;
predicting a first compactness between each sentence and a specified topic according to a preset topic classification mode; and determining the similarity between the sentences;
correcting the first compactness at least once by using a specified algorithm according to the similarity between the sentences to obtain a second compactness between each sentence and the specified subject;
and processing each sentence according to the second closeness to obtain a text abstract of the target text, which is related to the specified subject.
In one embodiment, predicting the first closeness between each sentence and the specified topic according to the preset topic classification mode includes:
determining a subject word corresponding to the specified subject;
analyzing the sentence to determine the number of the subject words and/or related words of the subject words contained in the sentence;
predicting a first compactness between the sentence and the specified topic based on the number; wherein the number is positively correlated with the first closeness.
In one embodiment, the specified algorithm is a Pagerank algorithm;
correspondingly, the correcting the first compactness at least once by using a specified algorithm according to the similarity between sentences comprises the following steps:
Respectively taking each sentence as a node, and creating a relation network diagram between the nodes according to the similarity between the sentences;
determining each first compactness as initial weight corresponding to each sentence;
and according to the relation network diagram, performing at least one iteration on the initial weight by using the Pagerank algorithm to obtain the final weight corresponding to each sentence.
In one embodiment, the creating a relationship network graph between the nodes according to the similarity between the sentences includes:
when the similarity between two sentences reaches a preset threshold value, determining that an edge exists between two nodes corresponding to the two sentences;
and creating a relation network graph between the nodes according to the edges between the nodes.
In one embodiment, the obtaining a plurality of sentences in the target text includes:
and splitting the target text according to the appointed punctuation marks to obtain the sentences.
In one embodiment, the processing each sentence according to the second compactness to obtain a text abstract of the target text related to the specified subject includes:
Sequencing and splicing all sentences according to the sequence from the high degree to the low degree of the second compactness to obtain a sequential text;
and determining the ordered text as a text abstract of the target text, which is related to the specified theme.
In one embodiment, the sorting and stitching the sentences according to the order of the second compactness from big to small includes:
screening out a first sentence corresponding to the second compactness reaching a second preset threshold;
and aiming at the first sentences, sequencing and splicing the first sentences according to the sequence from the big second compactness to the small second compactness corresponding to the first sentences.
In another aspect, one or more embodiments of the present disclosure provide a text abstract generating apparatus, including:
the acquisition module is used for acquiring a plurality of sentences in the target text;
the prediction and determination module is used for predicting the first compactness between each sentence and the appointed theme according to a preset theme classification mode; and determining the similarity between the sentences;
the correction module is used for correcting the first compactness at least once by using a specified algorithm according to the similarity between the sentences to obtain a second compactness between the sentences and the specified subject;
And the processing module is used for processing each sentence according to the second compactness to obtain a text abstract of the target text, which is related to the specified theme.
In one embodiment, the prediction and determination module comprises:
a first determining unit, configured to determine a subject word corresponding to the specified subject;
the analysis unit is used for analyzing the sentences to determine the number of the subject words and/or the related words of the subject words contained in the sentences;
a prediction unit configured to predict a first compactness between the sentence and the specified topic according to the number; wherein the number is positively correlated with the first closeness.
In one embodiment, the specified algorithm is a Pagerank algorithm;
correspondingly, the correction module comprises:
the creating unit is used for respectively taking each sentence as a node and creating a relation network diagram among the nodes according to the similarity among the sentences;
a second determining unit, configured to determine each of the first affinities as an initial weight corresponding to each of the sentences;
and the iteration unit is used for carrying out at least one iteration on the initial weight by utilizing the Pagerank algorithm according to the relation network diagram to obtain the final weight corresponding to each sentence.
In an embodiment, the creation unit is further configured to:
when the similarity between two sentences reaches a preset threshold value, determining that an edge exists between two nodes corresponding to the two sentences;
and creating a relation network graph between the nodes according to the edges between the nodes.
In one embodiment, the acquisition module includes:
and the splitting unit is used for splitting the target text according to the appointed punctuation marks to obtain the sentences.
In one embodiment, the processing module comprises:
the sorting and splicing unit is used for sorting and splicing the sentences according to the sequence from the big degree to the small degree of the second compactness to obtain ordered texts;
and a third determining unit, configured to determine that the ordered text is a text abstract of the target text related to the specified subject.
In one embodiment, the sorting and stitching unit is further configured to:
screening out a first sentence corresponding to the second compactness reaching a second preset threshold;
and aiming at the first sentences, sequencing and splicing the first sentences according to the sequence from the big second compactness to the small second compactness corresponding to the first sentences.
In still another aspect, one or more embodiments of the present specification provide a text excerpt generating apparatus, including:
a processor; and
a memory arranged to store computer executable instructions that, when executed, cause the processor to:
acquiring a plurality of sentences in a target text;
predicting a first compactness between each sentence and a specified topic according to a preset topic classification mode; and determining the similarity between the sentences;
correcting the first compactness at least once by using a specified algorithm according to the similarity between the sentences to obtain a second compactness between each sentence and the specified subject;
and processing each sentence according to the second closeness to obtain a text abstract of the target text, which is related to the specified subject.
In yet another aspect, embodiments of the present application provide a storage medium storing computer-executable instructions that, when executed, implement the following:
acquiring a plurality of sentences in a target text;
predicting a first compactness between each sentence and a specified topic according to a preset topic classification mode; and determining the similarity between the sentences;
Correcting the first compactness at least once by using a specified algorithm according to the similarity between the sentences to obtain a second compactness between each sentence and the specified subject;
and processing each sentence according to the second closeness to obtain a text abstract of the target text, which is related to the specified subject.
By adopting the technical scheme of one or more embodiments of the present specification, by acquiring a plurality of sentences in a target text and predicting a first compactness between each sentence and a specified topic according to a preset topic classification mode, an initial compactness (also referred to as rough compactness) between each sentence and the specified topic is acquired; correcting the first compactness by using a specified algorithm according to the similarity between each sentence to obtain a second compactness between each sentence and a specified subject, wherein the second compactness is more accurate compactness; and further processing each sentence according to the second compactness to obtain a text abstract related to the appointed theme. Therefore, the technical scheme obtains the text abstracts related to the specified subject based on the compactness between each sentence of the target text and the specified subject, so that the obtained text abstracts are necessarily in high conformity with the specified subject, and the generation purpose of personalized text abstracts is realized, namely, the text abstracts with different subject preferences can be extracted from the same text. In addition, the technical scheme does not need to carry out additional labeling on the text, belongs to an unsupervised algorithm, and therefore saves a great amount of labeling cost.
Drawings
In order to more clearly illustrate one or more embodiments of the present specification or the prior art, the drawings that are required for the description of the embodiments or the prior art will be briefly described, it being apparent that the drawings in the following description are only some of the embodiments described in one or more embodiments of the present specification, and that other drawings may be obtained from these drawings without inventive faculty for a person of ordinary skill in the art.
FIG. 1 is a schematic flow chart of a method of generating a text excerpt according to an embodiment of the present specification;
FIG. 2 is a diagram of a relationship network between sentence nodes according to one embodiment of the present disclosure;
FIG. 3 is a schematic block diagram of a text excerpt generation device according to an embodiment of the present specification;
fig. 4 is a schematic block diagram of a text excerpt generating device according to an embodiment of the present specification.
Detailed Description
One or more embodiments of the present disclosure provide a method and an apparatus for generating a text summary, so as to simply and quickly extract a text summary related to a personalized theme.
In order to enable a person skilled in the art to better understand the technical solutions in one or more embodiments of the present specification, the technical solutions in one or more embodiments of the present specification will be clearly and completely described below with reference to the drawings in one or more embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which may be made by one of ordinary skill in the art based on one or more embodiments of the present disclosure without departing from the scope of the invention as defined by the claims.
Fig. 1 is a schematic flowchart of a method for generating a text excerpt according to an embodiment of the present specification, as shown in fig. 1, the method includes:
s102, acquiring a plurality of sentences in the target text.
S104, predicting a first compactness between each sentence and a specified topic according to a preset topic classification mode; and determining the similarity between sentences.
S106, correcting the first compactness at least once by using a specified algorithm according to the similarity between each sentence, and obtaining a second compactness between each sentence and a specified subject.
S108, processing each sentence according to the second compactness to obtain a text abstract of the target text, which is related to the specified subject.
By adopting the technical scheme of one or more embodiments of the present specification, by acquiring a plurality of sentences in a target text and predicting a first compactness between each sentence and a specified topic according to a preset topic classification mode, an initial compactness (also referred to as rough compactness) between each sentence and the specified topic is acquired; correcting the first compactness by using a specified algorithm according to the similarity between each sentence to obtain a second compactness between each sentence and a specified subject, wherein the second compactness is more accurate compactness; and further processing each sentence according to the second compactness to obtain a text abstract related to the appointed theme. Therefore, the technical scheme obtains the text abstracts related to the specified subject based on the compactness between each sentence of the target text and the specified subject, so that the obtained text abstracts are necessarily in high conformity with the specified subject, and the generation purpose of personalized text abstracts is realized, namely, the text abstracts with different subject preferences can be extracted from the same text. In addition, the technical scheme does not need to carry out additional labeling on the text, belongs to an unsupervised algorithm, and therefore saves a great amount of labeling cost.
The method for generating the text abstract provided by the above embodiment is described in detail below.
First, a plurality of sentences in a target text is acquired. In this step, the target text may be split according to the specified punctuation marks, thereby obtaining a plurality of sentences. For example, a given punctuation mark includes a period. "and semicolon"; "then punctuation in the target text is utilized. "and"; and splitting the target text to obtain a plurality of sentences.
After a plurality of sentences in the target text are acquired, a first closeness between each sentence and the specified subject can be predicted according to a preset subject classification mode. The topic classification mode may include a topic model or topic classification rules. The specified subject can be a subject related to the text content of any target text, for example, various subjects such as sports, food and the like.
The topic model is a statistical model for clustering the implicit semantic structures of the corpus in an unsupervised learning manner, and can perform semantic analysis and text mining on each sentence in the text, such as collecting, classifying, dimension reducing and the like on the text according to topics.
In one embodiment, a first closeness between sentences and a specified topic may be predicted using any existing topic model. For example, the specified subject is sports, then when the subject model is utilized to analyze the target text, any sports-related information including basketball, swimming, running, fitness and the like in each sentence can be analyzed, and then the first compactness between each sentence and the specified subject can be predicted through the analyzed information. Wherein the higher the frequency of occurrence of sports-related words in a sentence, the higher the first closeness between the sentence and the specified topic.
In another embodiment, a first closeness between a sentence and a specified topic may be predicted using a preset topic classification rule. The topic classification rules may include the following:
first, a subject word corresponding to a specified subject is determined.
Wherein the subject term may include keywords related to the specified subject. For example, designating the subject as sports, the subject words may include keywords "sports," "sports," and the like. The subject terms can be set according to the actual requirements of the user, for example, if the text abstract related to the general category of 'sports' is to be generated, the subject terms can be set to comprise 'sports' and 'sports'; if it is desired to generate a text abstract of all relevant topics related to "sports", more refined keywords may be set, such as "sports", "basketball", "shuttlecock", and "swimming", etc.
And secondly, analyzing the sentence to determine the number of the related words containing the subject words and/or the subject words in the sentence.
Wherein, the related words of the subject word may include a paraphrase of the subject word. For example, when the subject term is "food", words having similar meanings such as "food" and "food" contained in a sentence can be mined when analyzing the sentence.
Thirdly, predicting a first compactness between the sentence and the appointed theme according to the number of related words containing the subject words and/or the subject words in the sentence; wherein the sentence contains subject words and/or positive correlations between the number of related words of the subject words and the first closeness. That is, the greater the number of related words in a sentence that contain subject words and/or subject words, the higher the first closeness between the sentence and a specified subject.
Then the similarity between sentences is determined. In one embodiment, any existing similarity algorithm may be employed to calculate the similarity between sentences. For example, the Levenshtein distance (i.e., the string encoding distance) between sentences is taken as the similarity between sentences.
After determining the similarity between sentences, the predicted first closeness may be modified using a specified algorithm.
In one embodiment, the specified algorithm is the Pagerank algorithm. Based on this, the first compactness may be modified at least once in the following way:
first, each sentence is used as a node, and a relationship network diagram between the nodes is created according to the similarity between the sentences.
In this step, when the relational network graph is created, it may be determined whether an edge between each sentence node needs to be created according to the similarity between each sentence. Specifically, when the similarity between two sentences reaches a preset threshold, determining that an edge exists between two nodes corresponding to the two sentences, and then creating a relationship network diagram between the sentence nodes according to the edge between the sentence nodes.
FIG. 2 illustrates a network of relationships between sentence nodes in a specific embodiment. As shown in fig. 2, the relationship network diagram includes A, B, C sentence nodes, where, since the similarity between the sentence a and the sentence B and between the sentence a and the sentence C all reach the preset threshold, there is an edge between the corresponding sentence node a and the sentence node B, and there is an edge between the sentence node a and the sentence node C; since the similarity between the sentences B and C does not reach the preset threshold, there is no edge between the corresponding sentence node B and sentence node C.
And secondly, determining each first compactness as an initial weight corresponding to each sentence.
And thirdly, according to the relation network diagram, carrying out at least one iteration on the initial weight by using the Pagerank algorithm to obtain the final weight corresponding to each sentence.
In this step, the iteration can be performed by multiplying the initial weight by an iteration matrix, as shown in the following formula (1), wherein A is the iteration matrix and P is n And the weight obtained by the nth iteration. And stopping iteration until the difference value between the weights corresponding to each sentence is smaller than a preset threshold value, wherein the obtained weight is the final weight corresponding to each sentence.
P n+1 =AP n (1)
And the final weight corresponding to each sentence is the second compactness between each sentence and the appointed theme.
After obtaining the second closeness between each sentence and the specified topic, each sentence may be processed according to the second closeness to obtain a text summary of the target text that is related to the specified topic.
In one embodiment, the sentences can be sequenced and spliced according to the sequence from the big degree to the small degree of the second compactness to obtain the ordered text; the ordered text is then determined to be a text excerpt of the target text that is related to the specified subject matter.
In one embodiment, first sentences corresponding to the second compactness reaching the second preset threshold value can be screened out, and then the first sentences are sequenced and spliced according to the sequence from the big to the small of the second compactness corresponding to the first sentences, so that a text abstract of the target text related to the specified subject is obtained.
In this embodiment, sentences corresponding to the second compactness reaching a certain threshold are screened out, and the sentences are sequenced and spliced, so that the compactness between the obtained text abstract and the appointed theme is higher.
In summary, particular embodiments of the present subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may be advantageous.
The method for generating the text abstract provided by one or more embodiments of the present specification further provides a device for generating the text abstract based on the same thought.
Fig. 3 is a schematic flowchart of a text excerpt generating apparatus according to an embodiment of the present specification, and as shown in fig. 3, the text excerpt generating apparatus 300 includes:
an obtaining module 310, configured to obtain a plurality of sentences in the target text;
a prediction and determination module 320, configured to predict a first compactness between each sentence and a specified topic according to a preset topic classification manner; and, determining the similarity between sentences;
the correction module 330 is configured to correct the first compactness at least once by using a specified algorithm according to the similarity between each sentence, so as to obtain a second compactness between each sentence and a specified topic;
and the processing module 340 is configured to process each sentence according to the second closeness to obtain a text abstract of the target text related to the specified subject.
In one embodiment, the prediction and determination module 320 includes:
a first determining unit, configured to determine a subject word corresponding to a specified subject;
the analysis unit is used for analyzing the sentences to determine the number of related words containing the subject words and/or the subject words in the sentences;
A prediction unit for predicting a first compactness between sentences and specified topics according to the number; wherein the number is positively correlated with the first closeness.
In one embodiment, the designated algorithm is the Pagerank algorithm;
accordingly, the correction module 330 includes:
the creating unit is used for respectively taking each sentence as a node and creating a relation network diagram among the nodes according to the similarity among the sentences;
the second determining unit is used for determining that each first compactness is an initial weight corresponding to each sentence respectively;
and the iteration unit is used for carrying out at least one iteration on the initial weight by using the Pagerank algorithm according to the relation network diagram to obtain the final weight corresponding to each sentence.
In an embodiment, the creation unit is further for:
when the similarity between two sentences reaches a preset threshold value, determining that one edge exists between two nodes corresponding to the two sentences;
and creating a relation network graph among the nodes according to the edges among the nodes.
In one embodiment, the acquisition module 310 includes:
and the splitting unit is used for splitting the target text according to the appointed punctuation marks to obtain a plurality of sentences.
In one embodiment, the processing module 340 includes:
The sequencing and splicing unit is used for sequencing and splicing all sentences according to the sequence from the second compactness to the small compactness to obtain ordered texts;
and a third determining unit for determining that the ordered text is the text abstract of the target text and related to the specified subject.
In one embodiment, the sorting and stitching unit is further configured to:
screening out a first sentence corresponding to the second compactness reaching a second preset threshold value;
and aiming at the first sentences, sequencing and splicing the first sentences according to the sequence from the big second compactness to the small second compactness corresponding to the first sentences.
By adopting the device of one or more embodiments of the present disclosure, a plurality of sentences in a target text are obtained, and a first compactness between each sentence and a specified topic is predicted according to a preset topic classification mode, namely, an initial compactness (also referred to as rough compactness) between each sentence and the specified topic is obtained; correcting the first compactness by using a specified algorithm according to the similarity between each sentence to obtain a second compactness between each sentence and a specified subject, wherein the second compactness is more accurate compactness; and further processing each sentence according to the second compactness to obtain a text abstract related to the appointed theme. Therefore, the technical scheme obtains the text abstracts related to the specified subject based on the compactness between each sentence of the target text and the specified subject, so that the obtained text abstracts are necessarily in high conformity with the specified subject, and the generation purpose of personalized text abstracts is realized, namely, the text abstracts with different subject preferences can be extracted from the same text. In addition, the technical scheme does not need to carry out additional labeling on the text, belongs to an unsupervised algorithm, and therefore saves a great amount of labeling cost.
It should be understood by those skilled in the art that the above-mentioned text abstract generating apparatus can be used to implement the text abstract generating method described above, and the detailed description thereof should be similar to the description of the method section above, so as to avoid complexity and avoid redundancy.
Based on the same thought, one or more embodiments of the present disclosure further provide a text abstract generating device, as shown in fig. 4. The text abstract generating device may have a relatively large difference due to different configurations or performances, and may include one or more processors 401 and a memory 402, where the memory 402 may store one or more storage applications or data. Wherein the memory 402 may be transient storage or persistent storage. The application program stored in memory 402 may include one or more modules (not shown), each of which may include a series of computer-executable instructions in a device for generating a text excerpt. Still further, the processor 401 may be arranged to communicate with the memory 402 and execute a series of computer executable instructions in the memory 402 on the text excerpt generating device. The text excerpt generating device may also include one or more power supplies 403, one or more wired or wireless network interfaces 404, one or more input/output interfaces 405, and one or more keyboards 406.
In particular, in this embodiment, the apparatus for generating a text excerpt includes a memory, and one or more programs, where the one or more programs are stored in the memory, and the one or more programs may include one or more modules, and each module may include a series of computer executable instructions in the apparatus for generating a text excerpt, and executing the one or more programs by one or more processors includes computer executable instructions for:
acquiring a plurality of sentences in a target text;
predicting a first compactness between each sentence and a specified topic according to a preset topic classification mode; and determining the similarity between the sentences;
correcting the first compactness at least once by using a specified algorithm according to the similarity between the sentences to obtain a second compactness between each sentence and the specified subject;
and processing each sentence according to the second closeness to obtain a text abstract of the target text, which is related to the specified subject.
Optionally, the computer executable instructions, when executed, may further cause the processor to:
Determining a subject word corresponding to the specified subject;
analyzing the sentence to determine the number of the subject words and/or related words of the subject words contained in the sentence;
predicting a first compactness between the sentence and the specified topic based on the number; wherein the number is positively correlated with the first closeness.
Optionally, the specified algorithm is a Pagerank algorithm;
accordingly, the computer-executable instructions, when executed, may further cause the processor to:
respectively taking each sentence as a node, and creating a relation network diagram between the nodes according to the similarity between the sentences;
determining each first compactness as initial weight corresponding to each sentence;
and according to the relation network diagram, performing at least one iteration on the initial weight by using the Pagerank algorithm to obtain the final weight corresponding to each sentence.
Optionally, the computer executable instructions, when executed, may further cause the processor to:
when the similarity between two sentences reaches a preset threshold value, determining that an edge exists between two nodes corresponding to the two sentences;
And creating a relation network graph between the nodes according to the edges between the nodes.
Optionally, the computer executable instructions, when executed, may further cause the processor to:
and splitting the target text according to the appointed punctuation marks to obtain the sentences.
Optionally, the computer executable instructions, when executed, may further cause the processor to:
sequencing and splicing all sentences according to the sequence from the high degree to the low degree of the second compactness to obtain a sequential text;
and determining the ordered text as a text abstract of the target text, which is related to the specified theme.
Optionally, the computer executable instructions, when executed, may further cause the processor to:
screening out a first sentence corresponding to the second compactness reaching a second preset threshold;
and aiming at the first sentences, sequencing and splicing the first sentences according to the sequence from the big second compactness to the small second compactness corresponding to the first sentences.
One or more embodiments of the present specification also provide a computer-readable storage medium storing one or more programs, the one or more programs including instructions, which when executed by an electronic device comprising a plurality of application programs, enable the electronic device to perform the method for generating a text digest described above, and specifically for performing:
Acquiring a plurality of sentences in a target text;
predicting a first compactness between each sentence and a specified topic according to a preset topic classification mode; and determining the similarity between the sentences;
correcting the first compactness at least once by using a specified algorithm according to the similarity between the sentences to obtain a second compactness between each sentence and the specified subject;
and processing each sentence according to the second closeness to obtain a text abstract of the target text, which is related to the specified subject.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing one or more embodiments of the present description.
One skilled in the art will appreciate that one or more embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, one or more embodiments of the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
One or more embodiments of the present specification are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
One or more embodiments of the present specification may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
The foregoing description is merely one or more embodiments of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to one or more embodiments of this description will be apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of one or more embodiments of the present disclosure, are intended to be included within the scope of the claims of one or more embodiments of the present disclosure.

Claims (16)

1. A method of generating a text excerpt, comprising:
acquiring a plurality of sentences in a target text;
predicting a first compactness between each sentence and a specified topic according to a preset topic classification mode, wherein the first compactness is determined by the number of topic words and/or related words of the topic words corresponding to the topic classification mode, and the number is positively correlated with the first compactness; and determining the similarity between the sentences;
correcting the first compactness at least once by using a specified algorithm according to the similarity between the sentences to obtain a second compactness between each sentence and the specified subject;
and processing each sentence according to the second closeness to obtain a text abstract of the target text, which is related to the specified subject.
2. The method of claim 1, wherein predicting the first closeness between each sentence and the specified topic according to the preset topic classification method comprises:
determining a subject word corresponding to the specified subject;
analyzing the sentence to determine the number of the subject words and/or related words of the subject words contained in the sentence;
A first closeness between the sentence and the specified topic is predicted from the number.
3. The method of claim 1, the specified algorithm being a Pagerank algorithm;
correspondingly, the correcting the first compactness at least once by using a specified algorithm according to the similarity between sentences comprises the following steps:
respectively taking each sentence as a node, and creating a relation network diagram between the nodes according to the similarity between the sentences;
determining each first compactness as initial weight corresponding to each sentence;
and according to the relation network diagram, performing at least one iteration on the initial weight by using the Pagerank algorithm to obtain the final weight corresponding to each sentence.
4. A method according to claim 3, said creating a relationship network graph between the nodes from similarities between the sentences, comprising:
when the similarity between two sentences reaches a preset threshold value, determining that an edge exists between two nodes corresponding to the two sentences;
and creating a relation network graph between the nodes according to the edges between the nodes.
5. The method of claim 1, the obtaining a plurality of sentences in the target text comprising:
and splitting the target text according to the appointed punctuation marks to obtain the sentences.
6. The method of claim 1, wherein the processing each sentence according to the second compactness to obtain a text summary of the target text related to the specified topic comprises:
sequencing and splicing all sentences according to the sequence from the high degree to the low degree of the second compactness to obtain a sequential text;
and determining the ordered text as a text abstract of the target text, which is related to the specified theme.
7. The method of claim 6, the sorting and stitching each sentence according to the order of the second compactness from big to small, comprising:
screening out a first sentence corresponding to the second compactness reaching a second preset threshold;
and aiming at the first sentences, sequencing and splicing the first sentences according to the sequence from the big second compactness to the small second compactness corresponding to the first sentences.
8. A text abstract generating device, comprising:
The acquisition module is used for acquiring a plurality of sentences in the target text;
the prediction and determination module is used for predicting a first compactness between each sentence and a specified topic according to a preset topic classification mode, wherein the first compactness is determined by the number of topic words and/or related words of the topic words corresponding to the topic classification mode, and the number is positively correlated with the first compactness; and determining the similarity between the sentences;
the correction module is used for correcting the first compactness at least once by using a specified algorithm according to the similarity between the sentences to obtain a second compactness between the sentences and the specified subject;
and the processing module is used for processing each sentence according to the second compactness to obtain a text abstract of the target text, which is related to the specified theme.
9. The device of claim 8, the prediction and determination module comprising:
a first determining unit, configured to determine a subject word corresponding to the specified subject;
the analysis unit is used for analyzing the sentences to determine the number of the subject words and/or the related words of the subject words contained in the sentences;
A prediction unit for predicting a first compactness between the sentence and the specified topic based on the number.
10. The apparatus of claim 8, the specified algorithm is a Pagerank algorithm;
correspondingly, the correction module comprises:
the creating unit is used for respectively taking each sentence as a node and creating a relation network diagram among the nodes according to the similarity among the sentences;
a second determining unit, configured to determine each of the first affinities as an initial weight corresponding to each of the sentences;
and the iteration unit is used for carrying out at least one iteration on the initial weight by utilizing the Pagerank algorithm according to the relation network diagram to obtain the final weight corresponding to each sentence.
11. The apparatus of claim 10, the creation unit further to:
when the similarity between two sentences reaches a preset threshold value, determining that an edge exists between two nodes corresponding to the two sentences;
and creating a relation network graph between the nodes according to the edges between the nodes.
12. The apparatus of claim 8, the acquisition module comprising:
and the splitting unit is used for splitting the target text according to the appointed punctuation marks to obtain the sentences.
13. The apparatus of claim 8, the processing module comprising:
the sorting and splicing unit is used for sorting and splicing the sentences according to the sequence from the big degree to the small degree of the second compactness to obtain ordered texts;
and a third determining unit, configured to determine that the ordered text is a text abstract of the target text related to the specified subject.
14. The device of claim 13, the ordering and stitching unit further to:
screening out a first sentence corresponding to the second compactness reaching a second preset threshold;
and aiming at the first sentences, sequencing and splicing the first sentences according to the sequence from the big second compactness to the small second compactness corresponding to the first sentences.
15. A text excerpt generation device, comprising:
a processor; and
a memory arranged to store computer executable instructions that, when executed, cause the processor to:
acquiring a plurality of sentences in a target text;
predicting a first compactness between each sentence and a specified topic according to a preset topic classification mode, wherein the first compactness is determined by the number of topic words and/or related words of the topic words corresponding to the topic classification mode, and the number is positively correlated with the first compactness; and determining the similarity between the sentences;
Correcting the first compactness at least once by using a specified algorithm according to the similarity between the sentences to obtain a second compactness between each sentence and the specified subject;
and processing each sentence according to the second closeness to obtain a text abstract of the target text, which is related to the specified subject.
16. A storage medium storing computer-executable instructions that when executed implement the following:
acquiring a plurality of sentences in a target text;
predicting a first compactness between each sentence and a specified topic according to a preset topic classification mode, wherein the first compactness is determined by the number of topic words and/or related words of the topic words corresponding to the topic classification mode, and the number is positively correlated with the first compactness; and determining the similarity between the sentences;
correcting the first compactness at least once by using a specified algorithm according to the similarity between the sentences to obtain a second compactness between each sentence and the specified subject;
and processing each sentence according to the second closeness to obtain a text abstract of the target text, which is related to the specified subject.
CN201910263357.1A 2019-04-02 2019-04-02 Text abstract generation method and device Active CN110162778B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910263357.1A CN110162778B (en) 2019-04-02 2019-04-02 Text abstract generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910263357.1A CN110162778B (en) 2019-04-02 2019-04-02 Text abstract generation method and device

Publications (2)

Publication Number Publication Date
CN110162778A CN110162778A (en) 2019-08-23
CN110162778B true CN110162778B (en) 2023-05-26

Family

ID=67638967

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910263357.1A Active CN110162778B (en) 2019-04-02 2019-04-02 Text abstract generation method and device

Country Status (1)

Country Link
CN (1) CN110162778B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110704607A (en) * 2019-08-26 2020-01-17 北京三快在线科技有限公司 Abstract generation method and device, electronic equipment and computer readable storage medium
CN111046672B (en) * 2019-12-11 2020-07-14 山东众阳健康科技集团有限公司 Multi-scene text abstract generation method
CN111723196B (en) * 2020-05-21 2023-03-24 西北工业大学 Single document abstract generation model construction method and device based on multi-task learning
CN113836296A (en) * 2021-09-28 2021-12-24 平安科技(深圳)有限公司 Method, device, equipment and storage medium for generating Buddhist question-answer abstract
CN114627581B (en) * 2022-05-16 2022-08-05 深圳零匙科技有限公司 Coerced fingerprint linkage alarm method and system for intelligent door lock

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998041930A1 (en) * 1997-03-18 1998-09-24 Siemens Aktiengesellschaft Method for automatically generating a summarized text by a computer
CN106599148A (en) * 2016-12-02 2017-04-26 东软集团股份有限公司 Method and device for generating abstract
CN107133213A (en) * 2017-05-06 2017-09-05 广东药科大学 A kind of text snippet extraction method and system based on algorithm
CN109101489A (en) * 2018-07-18 2018-12-28 武汉数博科技有限责任公司 A kind of text automatic abstracting method, device and a kind of electronic equipment

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100785927B1 (en) * 2006-06-02 2007-12-17 삼성전자주식회사 Method and apparatus for providing data summarization
CN104156452A (en) * 2014-08-18 2014-11-19 中国人民解放军国防科学技术大学 Method and device for generating webpage text summarization
CN105488024B (en) * 2015-11-20 2017-10-13 广州神马移动信息科技有限公司 The abstracting method and device of Web page subject sentence
CN106294863A (en) * 2016-08-23 2017-01-04 电子科技大学 A kind of abstract method for mass text fast understanding
US20180349360A1 (en) * 2017-01-05 2018-12-06 Social Networking Technology, Inc. Systems and methods for automatically generating news article
CN108062351A (en) * 2017-11-14 2018-05-22 厦门市美亚柏科信息股份有限公司 Text snippet extracting method, readable storage medium storing program for executing on particular topic classification
CN108090049B (en) * 2018-01-17 2021-02-05 山东工商学院 Multi-document abstract automatic extraction method and system based on sentence vectors

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998041930A1 (en) * 1997-03-18 1998-09-24 Siemens Aktiengesellschaft Method for automatically generating a summarized text by a computer
CN106599148A (en) * 2016-12-02 2017-04-26 东软集团股份有限公司 Method and device for generating abstract
CN107133213A (en) * 2017-05-06 2017-09-05 广东药科大学 A kind of text snippet extraction method and system based on algorithm
CN109101489A (en) * 2018-07-18 2018-12-28 武汉数博科技有限责任公司 A kind of text automatic abstracting method, device and a kind of electronic equipment

Also Published As

Publication number Publication date
CN110162778A (en) 2019-08-23

Similar Documents

Publication Publication Date Title
CN110162778B (en) Text abstract generation method and device
WO2018049960A1 (en) Method and apparatus for matching resource for text information
CN107818085B (en) Answer selection method and system for reading understanding of reading robot
US9104979B2 (en) Entity recognition using probabilities for out-of-collection data
CN109388801B (en) Method and device for determining similar word set and electronic equipment
CN110322281B (en) Similar user mining method and device
CN110019669B (en) Text retrieval method and device
CN111078858A (en) Article searching method and device and electronic equipment
CN110196910B (en) Corpus classification method and apparatus
CN109492401B (en) Content carrier risk detection method, device, equipment and medium
CN112818126B (en) Training method, application method and device for network security corpus construction model
CN110309355B (en) Content tag generation method, device, equipment and storage medium
CN108875743A (en) A kind of text recognition method and device
CN108229564B (en) Data processing method, device and equipment
CN107908649B (en) Text classification control method
KR20180113444A (en) Method, apparauts and system for named entity linking and computer program thereof
CN116028626A (en) Text matching method and device, storage medium and electronic equipment
CN110795562A (en) Map optimization method, device, terminal and storage medium
CN110955845A (en) User interest identification method and device, and search result processing method and device
CN110851600A (en) Text data processing method and device based on deep learning
CN111144098B (en) Recall method and device for extended question
CN110968691B (en) Judicial hotspot determination method and device
Lai et al. An unsupervised approach to discover media frames
Tang et al. Labeled phrase latent Dirichlet allocation
US20220147574A9 (en) Expert stance classification using computerized text analytics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20200923

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

Effective date of registration: 20200923

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Applicant before: Advanced innovation technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant