CN114661892A - Manuscript abstract generation method and device, equipment and storage medium - Google Patents

Manuscript abstract generation method and device, equipment and storage medium Download PDF

Info

Publication number
CN114661892A
CN114661892A CN202210298879.7A CN202210298879A CN114661892A CN 114661892 A CN114661892 A CN 114661892A CN 202210298879 A CN202210298879 A CN 202210298879A CN 114661892 A CN114661892 A CN 114661892A
Authority
CN
China
Prior art keywords
paragraph
sentences
manuscript
hit
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210298879.7A
Other languages
Chinese (zh)
Inventor
史峰霖
赵永飞
左贵森
郑妍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Economic Information Service Co ltd
Original Assignee
China Economic Information Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Economic Information Service Co ltd filed Critical China Economic Information Service Co ltd
Priority to CN202210298879.7A priority Critical patent/CN114661892A/en
Publication of CN114661892A publication Critical patent/CN114661892A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure relates to a method for generating a manuscript abstract, which comprises: and performing word segmentation processing on the query obtained currently to obtain the keywords in the query. Based on the keywords, the manuscripts containing the entity words in the keywords are extracted from the database to be used as hit manuscripts. And for each hit manuscript, extracting sentences containing the keywords according to paragraph distribution to serve as sentences to be merged, and determining the paragraph relation of each sentence to be merged. Combining and reconstructing the sentences to be combined with paragraph relations to obtain paragraph sentences, and taking the sentences to be combined without the paragraph relations as a paragraph sentence. And combining the statements of the paragraphs in sequence to construct a manuscript abstract. The method and the system are suitable for searching news manuscripts in a database according to query provided by a user and forming a manuscript abstract for the searched manuscripts according to keywords and paragraph relations. The formed abstract can be more fit with the center meaning of the article, and the readability is stronger, and the word number is controllable.

Description

Manuscript abstract generating method and device, equipment and storage medium
Technical Field
The present disclosure relates to the technical field of abstract generation methods, and in particular, to a method, an apparatus, a device, and a storage medium for generating an abstract of a manuscript.
Background
The existing abstract generation methods mainly comprise two kinds, one is a static abstract generation method, and the other is a dynamic abstract generation method. The static abstract is to generate the abstract of an article in advance, the center of the article is taken as a main body, and the disadvantage is that corresponding manuscript information cannot be found according to the information which a user wants to obtain. The dynamically generated abstract completely takes the query provided by the user as a main body, and the central idea of the article cannot be completely expressed. In the related art, both the static summary generation method and the dynamic summary generation method can be implemented by adopting an extraction type summary generation mode. However, the abstraction-type abstract generation method mainly considers word frequency, and has no excessive semantic information, so that complete semantic information in paragraphs cannot be established.
Therefore, how to accurately generate the abstract of the manuscript meeting the user's requirement and hooked with the meaning of the article center becomes an urgent problem to be solved by those skilled in the art.
Disclosure of Invention
In view of this, the present disclosure provides a method, an apparatus, a device, and a storage medium for generating a summary of a manuscript, which are used for generating a summary of a hit manuscript in a database based on a query provided by a user.
According to an aspect of the present disclosure, there is provided a manuscript summary generation method, including:
performing word segmentation processing on the query obtained currently to obtain keywords in the query;
based on the keywords, extracting a manuscript containing the entity words in the keywords from a database as a hit manuscript;
for each hit manuscript, extracting sentences containing the keywords according to paragraph distribution to serve as sentences to be merged, and determining paragraph relations of the sentences to be merged;
reconstructing and combining the sentences to be combined with paragraph relations to obtain paragraph sentences, and taking the sentences to be combined without the paragraph relations as a paragraph sentence;
and combining the paragraph sentences in sequence to construct a manuscript abstract.
In a possible implementation manner, when performing word segmentation processing on the query currently acquired, the processing is performed based on a word list constructed in advance.
In a possible implementation manner, when performing word segmentation processing on a query currently acquired based on a pre-constructed word list, the method includes:
determining whether a word recorded in the word list exists in the query;
and not performing word segmentation processing on the words recorded in the word list in the query.
In a possible implementation manner, when a manuscript containing an entity word in the keyword is extracted from a database as a hit manuscript based on the keyword, the method includes:
performing part-of-speech prediction on each keyword to obtain the part-of-speech of each keyword;
determining entity words in the keywords according to the parts of speech of the keywords;
and extracting all manuscripts containing the entity words from a database as hit manuscripts based on the determined entity words.
In a possible implementation manner, after extracting a manuscript containing the entity words in the keyword from the database as a hit manuscript, the method further includes: and sequencing each extracted hit manuscript.
In a possible implementation manner, when the extracted hit manuscripts are sorted, the extracted hit manuscripts are sorted according to a preset sorting rule;
wherein the ordering rule is: the documents with all virtual and real words hit, the documents with partial virtual and real words hit, and the documents with partial real words hit.
In a possible implementation manner, when merging and reconstructing the to-be-merged sentence with a paragraph relationship according to a corresponding paragraph relationship to obtain the paragraph sentence, the method includes:
and compressing the sentences to be merged, and sleeving the compressed sentences to be merged with paragraph relations into corresponding paragraph relation sentences to reconstruct and merge.
In a possible implementation manner, after combining the paragraph statements in order and constructing a contribution abstract, the method further includes:
and arranging the generated abstracts of the manuscripts according to the order of the hit manuscripts.
According to another aspect of the present application, there is provided a manuscript digest generation apparatus including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to carry out the executable instructions when implementing the method of any one of claims 1 to 8.
According to another aspect of the application, there is provided a non-transitory computer readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the method of any one of claims 1 to 8.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the above-described method.
The method and the system are suitable for searching news manuscripts in a database according to query provided by a user and forming a manuscript abstract for the searched manuscripts according to keywords and paragraph relations. The word segmentation processing is carried out on the query obtained currently to obtain the keywords, and all manuscripts containing the keywords of the entity words in the database are used as hit manuscripts, so that the searched manuscripts can better meet the requirements of users. And extracting sentences containing keywords in the hit manuscript to be used as sentences to be combined, combining and reconstructing the sentences to be combined according to corresponding paragraph relations to obtain paragraph sentences, independently using the sentences to be combined without the paragraph relations as a paragraph sentence, combining the paragraph sentences in sequence to construct a manuscript summary, so that the formed summary can better fit the central meaning of the article, and the readability is stronger and the number of words is controllable. And finally, showing the titles of all the hit manuscripts and the corresponding manuscripts summaries at the front end.
Other features and aspects of the present application will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate exemplary embodiments, features, and aspects of the application and, together with the description, serve to explain the principles of the application.
FIG. 1 shows a flow diagram of a method of contribution summary generation of an embodiment of the present disclosure;
fig. 2 shows a main body structure diagram of the manuscript digest generation apparatus of the embodiment of the present disclosure.
Detailed Description
Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.
Fig. 1 shows a flowchart of a manuscript digest generation method according to an embodiment of the present disclosure. Fig. 2 illustrates a main body structure diagram of a manuscript digest generation apparatus according to an embodiment of the present disclosure. As shown in fig. 1, the method for generating a summary of a manuscript includes: step S100: and performing word segmentation on the query obtained currently to obtain the keywords in the query. Step S200: based on the keywords, the manuscripts containing the entity words in the keywords are extracted from the database to be used as hit manuscripts. Step S300: and for each hit manuscript, extracting sentences containing the keywords according to paragraph distribution to serve as sentences to be merged, and determining the paragraph relation of each sentence to be merged. Step S400: and combining and reconstructing the sentences to be combined with paragraph relations to obtain paragraph sentences, and independently using the sentences to be combined without paragraph relations as a paragraph sentence. Step S500: and combining the paragraph sentences in sequence to construct a manuscript abstract.
The method and the system are suitable for searching news manuscripts in a database according to query provided by a user and forming a manuscript abstract for the searched manuscripts according to keywords and paragraph relations. The word segmentation processing is carried out on the query obtained currently to obtain the keywords, and all manuscripts containing the keywords of the entity words in the database are used as hit manuscripts, so that the searched manuscripts can better meet the requirements of users. And extracting sentences containing keywords in the hit manuscript to be used as sentences to be merged, merging and reconstructing the sentences to be merged according to corresponding paragraph relations to obtain paragraph sentences, independently using the sentences to be merged without the paragraph relations as one paragraph sentence, merging the paragraph sentences according to a sequence to construct a manuscript abstract, so that the formed abstract can be more fit with the central meaning of the article, and the readability is strong, and the number of words is controllable. And finally, showing the titles of all the hit manuscripts and the corresponding manuscripts summaries at the front end.
In a possible implementation manner, when the word segmentation processing is performed on the query obtained currently, the word segmentation processing can be performed based on a word list constructed in advance, and compared with a traditional word segmentation word list, the word list used in the disclosure contains news domain terms and other related terms, so that the word segmentation of the query can be matched with news manuscripts in a database better.
In a possible implementation manner, when performing word segmentation processing on a query currently acquired based on a pre-constructed word list, the method includes: it is determined whether there is a word recorded in the vocabulary in the query. And performing no word segmentation processing on the words recorded in the word list in the query. The vocabulary for performing word segmentation processing on the query may be constructed by using conventional technical means in the art, and is not described herein again.
It should be noted that after the query is segmented, the segmented words are segmented into stop words and non-stop words according to the stop word list, then the segmented non-stop words are sequentially covered and predicted by using a mask mechanism of a bert model, and if the edit distance between the predicted words and the segmented non-stop words is smaller than or equal to the preset number of characters or the segmented non-stop words have similar words in a pre-constructed word list, the segmented non-stop words are keywords.
For example, the query is segmented to obtain word a, word B, word C, word D, and word E. And determining that the word A, the word C and the word D are non-stop words and the word B and the word E are stop words according to the stop word list. Then, the non-stop words (namely, the word A, the word C and the word D) are predicted in sequence by using a mask mechanism of the bert model, and a predicted word A ' of the word A, a predicted word C ' of the word C and a predicted word D ' of the word D are obtained respectively. And then, respectively calculating the editing distance between the word A and the word A ', the editing distance between the word C and the word C ' and the editing distance between the word D and the word D ', and determining that the non-stop words with the editing distance smaller than or equal to the preset number of characters are the keywords when the calculated editing distance is smaller than or equal to the preset number of characters. Such as: in this embodiment, when the edit distance between the word a and the word a 'is calculated to be less than or equal to the preset number of characters and the edit distance between the word C and the word C' is calculated to be less than or equal to the preset number of characters, it may be determined that both the word a and the word C are keywords. The mask mechanism of the bert model is a conventional technical means in the field, and is not described in detail herein.
Meanwhile, in the possible implementation manners, when each word obtained after the word segmentation processing is judged to be a keyword enough, the word segmentation processing can be implemented according to a manner of whether a corresponding similar word exists in a pre-constructed word list. That is, after dividing the stop word and the non-stop word into the word a, the word B, the word C, the word D, and the word E, respectively, it is further determined whether there is a corresponding stop word (i.e., the word a, the word C, and the word D) in the pre-constructed word list, and if there is a corresponding stop word, the non-stop word is determined to be a keyword, and if there is no stop word, the non-stop word is determined to be a non-keyword.
For example, the word a, the word C, and the word D are compared in similarity in sequence from a pre-constructed vocabulary in a one-by-one comparison manner, and the words with similarity values greater than or equal to a preset threshold are similar words. Such as: in this embodiment, if a word a whose similarity to the word a is greater than or equal to a preset threshold exists in the pre-constructed word list, and a word C whose similarity to the word C is greater than or equal to a preset threshold exists in the pre-constructed word list, it may be determined that both the word a and the word C are keywords.
The stop word list can be flexibly selected according to actual use conditions or user preferences, and is not limited herein.
It should be noted that, when determining whether the non-stop word is the keyword according to the condition that the editing distance between the predicted word and the divided non-stop word and whether the divided non-stop word has a similar word in the pre-constructed word list, the non-stop word can be directly used as the keyword by determining whether the non-stop word has a similar word in the pre-constructed word list and not calculating the editing distance when the non-stop word has a similar word. If no similar words exist in the pre-constructed word list, the predicted words and the stop words are subjected to editing distance calculation, and when the editing distance is calculated to be smaller than or equal to the preset number of characters, the non-stop words are determined to be the keywords. And when the calculated editing distance is larger than the preset number of characters, processing the non-stop word as a non-keyword.
It will be appreciated by those skilled in the art that the calculation of the edit distance may be accomplished by calculating how a predicted word modifies several words into corresponding non-stop words. That is, the edit distance refers to the number of words that are modified when a predicted word is converted to its corresponding non-stop word. That is, the predicted word can be changed into the corresponding non-stop word by modifying one word, and the edit distance between the two words is 1.
Here, it should also be noted that, when the edit distance between the predicted word and the divided non-stop word is 0, the non-stop word is determined to be a keyword. That is, in this embodiment of the present application, the value range of the preset number of characters may be set as: 0-2, and preferably, the value of the preset number of characters can be set to 1.
Further, when the edit distance between the predicted word and the corresponding non-stop word is calculated, the calculation can be implemented by adopting a common technical means in the field, and details are not described here.
It should be noted that when the divided non-stop words are judged whether there is a similar word in the pre-constructed word list, the following manner can be adopted. The similarity comparison is carried out on the non-stop words and the words in the word list constructed in advance, the similarity value of each non-stop word and the words in the word list is obtained, and then whether each non-stop word and the words in the word list are similar words or not is determined according to the obtained similarity value.
When similarity comparison is carried out on each non-stop word and words in a pre-constructed word list, a one-by-one comparison mode can be adopted, or words with consistent number of characters can be extracted from the word list on the basis of the number of characters of each non-stop word, and then similarity comparison is carried out on each non-stop word and words with consistent number of characters extracted from the word list in sequence. In the process of similarity comparison, when the similarity value is larger than or equal to a preset threshold value, the fact that similar words exist in a pre-constructed word list of non-stop words is directly determined.
In a possible implementation manner, the value of the preset threshold may be flexibly set according to the actual use condition or the preference of the user. Such as: the value range of the preset threshold may be set to 0.7 to 0.95, and preferably, the value of the preset threshold may be set to 0.87.
Meanwhile, it should be noted that, when comparing the similarity of each non-stop word with each word in the pre-constructed word list one by one, the similarity can be implemented by adopting the conventional similarity comparison technical means in the field, and the details are not repeated here.
After extracting the keywords from the query input by the user, the part-of-speech detection can be performed on the extracted keywords, and the keywords are divided into entity words and imaginary words. In a possible implementation manner, a CRF model trained based on corpora can be used to predict the part of speech of a keyword, so as to predict the part of speech of the keyword, and the keyword is divided into a real word and a dummy word. Here, it is understood by those skilled in the art that the solid word refers to a word having an actual meaning, and the imaginary word serves as a receiving word in the paragraph, so that the paragraph sentence is more coherent. The CRF model is a logarithmic linear model, and by training the CRF model, the learning of parameters is to find the parameters which can maximize the conditional probability according to the training data. Then, given a known sentence, but with unknown label, the part of speech of each word with the highest probability is inferred by the model, i.e. the optimal group is found, and the part of speech is determined.
By any of the above methods, after the keywords are divided into the entity words and the dummy words, the manuscript can be extracted. In a possible implementation manner, when the manuscript is extracted from the database based on the keyword, the manuscript containing each entity word may be extracted from the database as a hit manuscript based on all the entity words obtained by the division.
That is, in the method according to the embodiment of the present application, when a document is extracted from the database as a hit document, part-of-speech prediction is performed on each keyword to obtain the part-of-speech of each keyword. And determining the entity words in the keywords according to the part of speech of each keyword. Based on the determined entity words, all manuscripts containing the entity words are extracted from the database to be used as hit manuscripts, so that the hit manuscripts in the database are more in line with the requirements of users.
Here, it should be explained that, when a document including an entity word extracted from the database is a hit document, the extracted document includes: a manuscript containing all the entity words, and a manuscript containing part of the entity words. That is, the manuscript stored in the database may be a hit manuscript as long as one of all the entity words is included.
After all the hit manuscripts are extracted, all the hit manuscripts are sorted. In a possible implementation manner, after a manuscript containing entity words in the keywords is extracted from the database as a hit manuscript, the method further includes: and sequencing the extracted hit manuscripts so that all the extracted hit manuscripts can be sequenced according to the correlation degree of the extracted hit manuscripts and the input query. All the ordered hits make the user search more convenient.
In a possible implementation manner, when the extracted hits are sorted, the hits are sorted according to a preset sorting rule. Wherein the sequencing rule is as follows: the full hit manuscript of the virtual and solid words, the full hit manuscript of the solid words, the partial hit manuscript of the virtual and solid words, and the partial hit manuscript of the solid words. And then the hit manuscripts are sorted according to the degree of correlation, so that the search by the user is further facilitated.
Here, it should be noted that the documents with the same hit conditions are sorted by the document distribution time. That is, for the manuscripts hit by all the virtual-entity words, the manuscripts are sorted according to the sequence of the issuing time of the manuscripts. And for the manuscripts with all the entity words hit, sequencing according to the sequence of the issuing time of the manuscripts. And so on.
After all the hit manuscripts are sequenced, sentences to be merged and paragraph sentences can be extracted from all the hit manuscripts respectively, then all the paragraph sentences are arranged in sequence to form a manuscript abstract, wherein when the sentences to be merged are extracted from all the hit manuscripts, the extracted sentences to be merged are sentences containing keywords, and the sentences to be merged are extracted from the hit manuscripts according to paragraph distribution. Meanwhile, when the sentences containing the keywords are extracted from the manuscript and used as the sentences to be merged, the sentences can be realized by adopting the technical means commonly used in the field, and the details are not repeated here.
And after the sentences containing the keywords are extracted from the manuscripts and used as the sentences to be merged, reconstructing and merging the sentences to be merged. In the method of the embodiment of the present application, reconstruction and merging may be performed according to paragraph relations between the statements to be merged.
Specifically, when the to-be-merged sentences with paragraph relations are merged and reconstructed according to the corresponding paragraph relations to obtain the paragraph sentences, the method includes: and compressing the sentences to be merged, and then reconstructing and merging the compressed sentences to be merged according to the corresponding paragraph relation sentence pattern to generate a sentence.
The language model can be adopted to realize the compression of the sentences to be merged. Namely, the language model is used for judging whether each statement to be merged needs to be compressed or not, and meanwhile, the statements to be merged are directly compressed when the fact that the statements need to be compressed is judged, so that the sentence length of the statements to be merged is shortened. Here, it should be noted that the determination of whether the sentence to be merged is to be compressed may be implemented by comparing the number of characters of the sentence to be merged with a preset number of characters. And when the number of the characters of the sentence to be merged is greater than or equal to the preset number of the characters, determining that the sentence to be merged needs to be compressed. And when the character number of the sentences to be merged is smaller than the preset character number by comparison, determining that the sentences to be merged do not need to be compressed. In a possible implementation manner, the value of the preset number of characters can be flexibly set according to the actual situation, such as: the setting may be made according to the total number of characters of the digest to be currently generated. Generally, the preset number of characters can be set as: 5-8.
When reconstructing and merging the compressed sentences to be merged according to the corresponding paragraph relation sentence pattern, the sentence to be merged needs to be judged in paragraph relation. Here, it should be explained that the judgment of the paragraph relationship of the sentences to be merged refers to the judgment of the paragraph relationship of adjacent sentences to be merged belonging to the same paragraph. Meanwhile, in a possible implementation manner, a classification model may be adopted when the paragraph relationship is determined for adjacent to-be-merged statements, and preferably, the classification model is constructed based on a bert model and an LSTM model, which are common technical means in the field and are not described herein again
The paragraph relation comprises at least one of a general branch relation, a parallel relation, a causal relation, a bearing relation, a turning relation and a non-relation. Meanwhile, paragraph relations can be classified by using a classification model, sentences to be merged which belong to the same paragraph and have paragraph relations are merged together to obtain corresponding paragraph sentences (for example, sentences to be merged which have causal relations are merged together in a format of. And then obtaining the abstract of the manuscript, wherein the obtained abstract of the manuscript can be well related to the idea of the article center.
Here, when the classification model is used to classify a relationship, if the confidence of all the relationships does not exceed a threshold, it is considered that there is no relationship. The confidence degrees belonging to the paragraph relations can be obtained through a classification model, the minimum confidence degree value belonging to the paragraph relations is determined through a test set in advance, if the minimum confidence degree value exceeds the preset confidence degree value, the corresponding paragraph relations are considered to be in the paragraphs, otherwise, the relations are considered to be irrelevant, and if no relations are hit, the relations are considered to be irrelevant.
It should be noted that, the language model and the syntactic analysis model are used to compress and reconstruct the sentences to be merged, the sentences to be merged are compressed into one sentence, the sentences to be merged having paragraph relations can be nested into the originally set paragraph relation sentence pattern (for example, causal relations, so.) to reconstruct the sentences to be merged, and if the sentences to be merged have only one sentence, the sentences are not changed.
It should be noted that, when reconstructing and merging sentences to be merged having paragraph relations, one way is to reconstruct and merge paragraphs according to the paragraph patterns corresponding to the paragraph relations, and the other way is to directly merge the sentences to be merged having paragraph relations into a sentence. In a possible implementation manner, for the two reconstruction and combination manners, the reconstruction and combination manner of the sentence to be combined having the paragraph relationship may be determined according to the size of the PPL value by calculating the PPL value of the sentence to be reconstructed and combined. Such as: and a reconstruction combination mode with a smaller PPL value can be selected to reconstruct and combine the sentences to be combined with paragraph relations.
For example, in one of the extracted sections of the hit manuscript a, extracting the sentences to be merged according to the keywords respectively to obtain: the method comprises the steps of a statement a to be merged, a statement b to be merged, a statement c to be merged, a statement d to be merged, a statement e to be merged and a statement f to be merged.
After compressing the sentences to be merged, each sentence to be merged and the adjacent sentence to be merged are subjected to paragraph relation judgment, for example: and carrying out paragraph relation judgment on the sentence a to be combined and the sentence b to be combined, carrying out paragraph relation judgment on the sentence b to be combined and the sentence c to be combined, and carrying out paragraph relation judgment … … on the sentence c to be combined and the sentence d to be combined. And by analogy, the paragraph relation between the adjacent sentences to be merged in the same paragraph is obtained.
Such as: and judging that the statement a to be merged and the statement b to be merged are in causal relationship, the statement b to be merged and the statement c to be merged are in parallel relationship, the statement c to be merged and the statement d to be merged are unrelated, the statement d to be merged and the statement e to be merged are in turning relationship, and the statement e to be merged and the statement f to be merged are unrelated.
And then sheathing the sentence a to be merged, the sentence b to be merged and the sentence c to be merged into a template sentence pattern of '… … and … … because … …', calculating the PPL value (probability distribution of the whole sentence) after the sentence a to be merged, the sentence b to be merged and the sentence c to be merged are directly merged (a.b.c.) without using the template sentence pattern by using a bert model, and taking the merging mode with the minimum PPL value (b and c.) after the sentence a to be merged, the sentence b to be merged and the sentence c to be merged are used as the paragraph sentence A.
Nesting the sentence d to be merged and the sentence e to be merged into a template sentence pattern of '… … but … …', and calculating the PPL value of the sentence d to be merged and the sentence e to be merged (d.e.) directly merged without using the template sentence pattern and the PPL value of the sentence e merged (d but e.) by using the template sentence pattern by using a bert model. And taking the combination mode with the minimum PPL value as the paragraph statement D. The sentence F to be merged is retained as the paragraph sentence F.
And sequentially arranging the paragraph statement A, the paragraph statement D and the paragraph statement F into a paragraph, and further obtaining the abstract of one paragraph of the hit manuscript A. And by analogy, performing the above operation on other sections in the hit manuscript A to obtain the abstracts of each section of the hit manuscript, and sequentially arranging and combining the abstracts of each section into one section according to the sequence of the sections to obtain the final abstracts of the hit manuscript A.
After the abstracts of the manuscripts are generated, the abstracts of the manuscripts and the corresponding titles of the hit manuscripts are displayed at the front end in a one-to-one correspondence mode. In a possible implementation manner, after merging the paragraphs in order and constructing a contribution abstract, the method further includes: and arranging the generated manuscript summaries in sequence according to the sequence of the hit manuscripts, so that the front end can display all the ordered manuscript titles and the relevant content summaries.
Still further, according to another aspect of the present disclosure, there is also provided a manuscript digest generation apparatus 200. Referring to fig. 2, the manuscript summary generation device 200 according to the embodiment of the present disclosure includes a processor 210 and a memory 220 for storing executable instructions of the processor 210. Wherein, the processor 210 is configured to implement any of the above manuscript summary generation methods when executing the executable instructions.
Here, it should be noted that the number of the processors 210 may be one or more. Meanwhile, in the manuscript digest generation apparatus 200 according to the embodiment of the present disclosure, an input device 230 and an output device 240 may be further included. The processor 210, the memory 220, the input device 230, and the output device 240 may be connected via a bus, or may be connected via other methods, which is not limited in detail herein.
The memory 220, which is a computer-readable storage medium, may be used to store software programs, computer-executable programs, and various modules, such as: the method for generating the abstract of the manuscript in the embodiment of the disclosure corresponds to a program or a module. The processor 210 executes various functional applications and data processing of the manuscript digest creation device 200 by running software programs or modules stored in the memory 220.
The input device 230 may be used to receive an input number or signal. Wherein the signal may be a key signal generated in connection with user settings and function control of the device/terminal/server. The output device 240 may include a display device such as a display screen.
According to another aspect of the present disclosure, there is also provided a non-transitory computer readable storage medium having stored thereon computer program instructions which, when executed by the processor 210, implement the manuscript digest generation method of any of the preceding.
The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (10)

1. A method for generating a manuscript abstract is characterized by comprising the following steps:
performing word segmentation processing on the query obtained currently to obtain keywords in the query;
based on the keywords, extracting a manuscript containing the entity words in the keywords from a database as a hit manuscript;
for each hit manuscript, extracting sentences containing the keywords according to paragraph distribution to serve as sentences to be merged, and determining paragraph relations of the sentences to be merged;
reconstructing and combining the sentences to be combined with paragraph relations to obtain paragraph sentences, and taking the sentences to be combined without the paragraph relations as a paragraph sentence;
and combining the paragraph sentences in sequence to construct a manuscript abstract.
2. The method of claim 1, wherein when performing word segmentation processing on the currently obtained query, the processing is performed based on a pre-constructed word list.
3. The method of claim 2, wherein when performing word segmentation processing on the currently obtained query based on a pre-constructed word list, the method comprises:
determining whether a word recorded in the word list exists in the query;
and not performing word segmentation processing on the words recorded in the word list in the query.
4. The method of claim 1, wherein extracting a contribution containing the entity words in the keyword from the database as a hit contribution based on the keyword comprises:
performing part-of-speech prediction on each keyword to obtain the part-of-speech of each keyword;
determining entity words in the keywords according to the parts of speech of the keywords;
and extracting all manuscripts containing the entity words from a database as hit manuscripts based on the determined entity words.
5. The method of claim 1, wherein after extracting a contribution containing the entity words in the keyword from the database as a hit contribution, further comprising: and sequencing each extracted hit manuscript.
6. The method of claim 5, wherein when sorting the extracted hits, sorting is performed according to a preset sorting rule;
wherein the ordering rule is: the documents with all virtual and real words hit, the documents with partial virtual and real words hit, and the documents with partial real words hit.
7. The method according to claim 6, wherein when merging and reconstructing the to-be-merged sentences having paragraph relations according to the corresponding paragraph relations to obtain the paragraph sentences, the method comprises:
and compressing the sentences to be merged, and sleeving the compressed sentences to be merged with paragraph relations into corresponding paragraph relation sentences to reconstruct and merge.
8. The method of claim 5, wherein after combining the paragraph statements in order and constructing a contribution summary, the method further comprises:
and arranging the generated abstracts of the manuscripts according to the order of the hit manuscripts.
9. A manuscript digest generation apparatus, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to carry out the executable instructions when implementing the method of any one of claims 1 to 8.
10. A non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the method of any one of claims 1 to 8.
CN202210298879.7A 2022-03-25 2022-03-25 Manuscript abstract generation method and device, equipment and storage medium Pending CN114661892A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210298879.7A CN114661892A (en) 2022-03-25 2022-03-25 Manuscript abstract generation method and device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210298879.7A CN114661892A (en) 2022-03-25 2022-03-25 Manuscript abstract generation method and device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114661892A true CN114661892A (en) 2022-06-24

Family

ID=82031427

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210298879.7A Pending CN114661892A (en) 2022-03-25 2022-03-25 Manuscript abstract generation method and device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114661892A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115965013A (en) * 2023-03-16 2023-04-14 北京朗知网络传媒科技股份有限公司 Automobile media article generation method and device based on demand identification

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115965013A (en) * 2023-03-16 2023-04-14 北京朗知网络传媒科技股份有限公司 Automobile media article generation method and device based on demand identification
CN115965013B (en) * 2023-03-16 2023-11-28 北京朗知网络传媒科技股份有限公司 Automobile media article generation method and device based on demand identification

Similar Documents

Publication Publication Date Title
CN106897428B (en) Text classification feature extraction method and text classification method and device
CN107832414B (en) Method and device for pushing information
CN106156204B (en) Text label extraction method and device
WO2021093755A1 (en) Matching method and apparatus for questions, and reply method and apparatus for questions
CN111190997B (en) Question-answering system implementation method using neural network and machine learning ordering algorithm
CN113268995B (en) Chinese academy keyword extraction method, device and storage medium
CN111324728A (en) Text event abstract generation method and device, electronic equipment and storage medium
CN109325201A (en) Generation method, device, equipment and the storage medium of entity relationship data
CN111090771B (en) Song searching method, device and computer storage medium
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
CN112613293B (en) Digest generation method, digest generation device, electronic equipment and storage medium
CN112131876A (en) Method and system for determining standard problem based on similarity
CN109325122A (en) Vocabulary generation method, file classification method, device, equipment and storage medium
CN112988784B (en) Data query method, query statement generation method and device
Liu et al. Open intent discovery through unsupervised semantic clustering and dependency parsing
CN114997288A (en) Design resource association method
CN112307336A (en) Hotspot information mining and previewing method and device, computer equipment and storage medium
CN114492420B (en) Text classification method, device and equipment and computer readable storage medium
CN114462392A (en) Short text feature expansion method based on topic relevance and keyword association
CN115526171A (en) Intention identification method, device, equipment and computer readable storage medium
CN110020024B (en) Method, system and equipment for classifying link resources in scientific and technological literature
CN114661892A (en) Manuscript abstract generation method and device, equipment and storage medium
JP2006227823A (en) Information processor and its control method
JP2019128925A (en) Event presentation system and event presentation device
CN113641788B (en) Unsupervised long and short film evaluation fine granularity viewpoint mining method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination