CN111581374A - Text abstract obtaining method and device and electronic equipment - Google Patents

Text abstract obtaining method and device and electronic equipment Download PDF

Info

Publication number
CN111581374A
CN111581374A CN202010387665.8A CN202010387665A CN111581374A CN 111581374 A CN111581374 A CN 111581374A CN 202010387665 A CN202010387665 A CN 202010387665A CN 111581374 A CN111581374 A CN 111581374A
Authority
CN
China
Prior art keywords
text
abstract
vector
statement
sentences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010387665.8A
Other languages
Chinese (zh)
Inventor
史文丽
谭松波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN202010387665.8A priority Critical patent/CN111581374A/en
Publication of CN111581374A publication Critical patent/CN111581374A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a text abstract obtaining method, a text abstract obtaining device and electronic equipment, wherein the text abstract obtaining method comprises the following steps: obtaining a text to be processed, wherein the text comprises a plurality of text sentences; carrying out vector conversion on characters in a text statement to obtain a word vector and a position vector of the characters; encoding the word vector and the position vector by using an encoder to obtain an encoded vector; decoding the coding vector by using a decoder corresponding to the coder to obtain an initial abstract statement corresponding to the text statement; and extracting target abstract sentences meeting abstract extraction conditions from the initial abstract sentences. Therefore, in the method, vector conversion is performed on the text statement first, then the initial abstract statement is generated in a coding and decoding mode, the initial abstract statement capable of accurately expressing the content of the text statement is obtained in a generating mode, and finally the statement is extracted, so that the content of the text can be comprehensively and accurately expressed while the correctness of the grammar can be ensured by the obtained target abstract statement.

Description

Text abstract obtaining method and device and electronic equipment
Technical Field
The present application relates to the field of text processing technologies, and in particular, to a method and an apparatus for obtaining an abstract of a text, and an electronic device.
Background
At present, abstracts of text such as articles or novels can be obtained using a decimation algorithm.
However, although the abstract obtained by the extraction algorithm can ensure the correctness of the grammar, the content of the text cannot be expressed comprehensively and accurately, so that the abstract is more comprehensive.
Disclosure of Invention
In view of the above, the present application provides a method, an apparatus and an electronic device for obtaining a text abstract, which includes:
a text abstract acquisition method comprises the following steps:
obtaining a text to be processed, wherein the text comprises a plurality of text sentences, and the text sentences comprise at least one character;
performing vector conversion on characters in the text statement to obtain a word vector and a position vector of the characters;
encoding the word vector and the position vector by using an encoder to obtain an encoded vector;
decoding the coding vector by using a decoder corresponding to the encoder to obtain an initial abstract statement corresponding to the text statement;
and extracting target abstract sentences meeting abstract extraction conditions from the initial abstract sentences.
Preferably, the method performs vector conversion on the characters in the text sentence to obtain the word vector and the position vector of the characters, and includes:
processing the text statement to obtain a word list corresponding to the text statement, wherein the word list comprises characters in the text statement;
processing the characters in the character list by using a preset bert model to obtain a character vector of the characters;
and carrying out position coding on the characters in the character list by using a preset coding function so as to obtain the position vectors of the characters.
Preferably, the method, encoding the word vector and the position vector by using an encoder to obtain an encoded vector, includes:
and inputting the word vector and the position vector into a plurality of encoders constructed at least based on a self-attention mechanism and a neural network so as to obtain an encoding vector corresponding to the text statement output by the encoders.
The method preferably obtains the text to be processed, and includes:
performing form conversion on data to be processed to obtain an initial text;
sentence dividing operation is carried out on the initial text by using the sentence dividing separators to obtain a plurality of text sentences;
and carrying out sentence screening on the plurality of text sentences to obtain the text to be processed.
Preferably, the method for screening the text sentence to obtain the text to be processed includes:
eliminating sentences of the plurality of text sentences with the length smaller than a first threshold value;
and/or deleting other sentences except the target sentence at the target position in the plurality of text sentences if the number of sentences in the plurality of text sentences is larger than a second threshold value.
Preferably, the method for extracting a target abstract statement satisfying an abstract extraction condition from the plurality of initial abstract statements includes:
scoring at least one scoring mode of the initial abstract sentences in the plurality of initial abstract sentences to obtain at least one score of the initial abstract sentences;
obtaining a sentence score of the initial abstract sentence at least according to the at least one score;
and obtaining at least one target abstract statement of which the statement score meets abstract extraction conditions from the plurality of initial abstract statements.
Preferably, the obtaining the sentence score of the initial abstract sentence according to at least the at least one score includes:
and carrying out weighted summation on the at least one score according to a preset weight value of the corresponding scoring mode to obtain the sentence score of the initial abstract sentence.
Preferably, in the above method, the digest extraction conditions include: the length of the target abstract statement is smaller than a third threshold, and/or the target abstract statement is sorted in the initial abstract statement from top to bottom according to the statement score, wherein N is a positive integer larger than or equal to 1.
An apparatus for obtaining a summary of a text, comprising:
the text processing device comprises a text obtaining unit, a processing unit and a processing unit, wherein the text obtaining unit is used for obtaining a text to be processed, the text comprises a plurality of text sentences, and the text sentences comprise at least one character;
the vector conversion unit is used for carrying out vector conversion on characters in the text statement to obtain a word vector and a position vector of the characters;
the vector coding unit is used for coding the word vector and the position vector by using a coder to obtain a coded vector;
a vector decoding unit, configured to decode the encoded vector by using a decoder corresponding to the encoder to obtain an initial abstract statement corresponding to the text statement;
and the statement extraction unit is used for extracting the target abstract statement meeting the abstract extraction condition from the initial abstract statement.
An electronic device, comprising:
the memory is used for storing an application program and data generated by the running of the application program;
a processor for executing the application to implement: obtaining a text to be processed, wherein the text comprises a plurality of text sentences, and the text sentences comprise at least one character; performing vector conversion on characters in the text statement to obtain a word vector and a position vector of the characters; encoding the word vector and the position vector by using an encoder to obtain an encoded vector; decoding the coding vector by using a decoder corresponding to the encoder to obtain an initial abstract statement corresponding to the text statement; and extracting target abstract sentences meeting abstract extraction conditions from the initial abstract sentences.
According to the technical scheme, after the text containing a plurality of text sentences to be processed is obtained, the characters in each text sentence are subjected to vector conversion, the word vectors and the position vectors obtained through conversion are encoded by the encoder to obtain the encoding vectors, then the encoding vectors are decoded by the decoder corresponding to the encoder, and finally the target abstract sentences meeting the abstract extraction conditions are extracted from the initial abstract sentences. Therefore, before sentence extraction is carried out on sentences, vector conversion is carried out on text sentences firstly, then an initial abstract sentence is generated in a coding and decoding mode, an initial abstract sentence capable of accurately expressing the content of the text sentences is obtained in a generating mode, and finally sentence extraction is carried out.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a flowchart of a method for obtaining a text abstract according to an embodiment of the present application;
FIGS. 2-4 are partial flow charts of a first embodiment of the present application;
fig. 5 is a schematic structural diagram of a text abstract acquiring apparatus according to a second embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to a third embodiment of the present application;
fig. 7 is a model architecture diagram of an embodiment of the present application in a specific implementation.
Detailed Description
Text summarization has played a fundamental and important role in natural Language processing (nlp). The inventor of the present application found through research that: due to the complexity of the article, most of abstracts generated by algorithms are mainly abstract type, and although the abstract type can ensure the correctness of grammar, the abstract type still has unsatisfactory performance on summarized sentences, for example, most of the articles have low probability of appearing most of the summarized sentences, and the summarized sentences are generally long, so that the result of the abstract has conclusion comprehensiveness. Although the generative abstract generation algorithm is described in summary on the basis of understanding the original articles, and a general conclusion is obtained, the generative abstract generation algorithm still has no power for the case that the articles are long in length, which is mainly caused by the feature extractor of the generative abstract generation algorithm.
In view of the above problems, the inventors of the present application have further studied and proposed an abstract obtaining scheme capable of integrating the advantages of both the decimation algorithm and the generation algorithm, so as to generate summarized and comprehensive abstract sentences to the maximum extent. The method comprises the following specific steps:
according to the technical scheme, after a text containing a plurality of text sentences to be processed is obtained, vector conversion is carried out on characters in the text sentences to obtain word vectors and position vectors of the characters, the word vectors and the position vectors are encoded by an encoder to obtain encoding vectors, then a decoder corresponding to the encoder is used for decoding the encoding vectors to obtain initial abstract sentences corresponding to the text sentences, and finally target abstract sentences meeting abstract extraction conditions are extracted from the initial abstract sentences.
Therefore, before sentence extraction is carried out on sentences, vector conversion is carried out on text sentences firstly, then an initial abstract sentence is generated in a coding and decoding mode, an initial abstract sentence capable of accurately expressing the content of the text sentences is obtained in a generating mode, and finally sentence extraction is carried out.
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, a flowchart of an implementation of a text abstract obtaining method provided in an embodiment of the present application is shown, where the method may be applied to an electronic device, such as a computer or a server, capable of performing data processing, particularly text processing. The technical scheme in the embodiment is mainly used for ensuring that the obtained abstract sentences can comprehensively and accurately express the content of the text while ensuring the grammar correctness of the obtained abstract sentences when the text is abstracted, and avoiding the situation that the obtained abstract contents are unilateral.
In a specific implementation, the method in this embodiment may include the following steps:
step 101: and obtaining the text to be processed.
The text includes a plurality of text sentences, and each text sentence is composed of at least one character, for example, "x _ llc company shares 800 shares" in the text sentence includes a plurality of characters, and the characters may be one or more of words, numbers, letters, and the like.
In a specific implementation, in this embodiment, text arrangement may be performed on the whole article to obtain a text to be processed, for example, contents such as pictures and emoticons in the article are removed to obtain the text to be processed; or after reading the whole article, the article is taken as the text to be processed.
Step 102: and carrying out vector conversion on the characters in the text sentence to obtain a word vector and a position vector of the characters.
In this embodiment, the characters in the text sentence may be connected and converted by using a bert model, a coding function, and the like, so as to obtain a word vector and a position vector of each character.
It should be noted that both the word vector and the position vector of the character can be represented by a matrix. For example, a word vector of a character in a text sentence may be a vector with a matrix dimension of 25 × 768 dimensions (25 is the number of characters in the text sentence in which the character exists), and a position vector of the character may be a vector with a matrix dimension of 512 × 769 dimensions (512 is the number of characters in the text in which the character exists).
Step 103: and encoding the word vector and the position vector by using an encoder to obtain an encoded vector.
The encoder in this embodiment may have a plurality of encoders, and the structure and implementation algorithm of each encoder are matched, for example, the encoder may be an encoder constructed based on a self-attention mechanism, that is, a neural network. Accordingly, in this embodiment, the word vector and the position vector may be respectively input to the plurality of encoders constructed at least based on the attention mechanism and the neural network, so as to obtain the encoding vector corresponding to the text statement output by the encoders.
For example, in this embodiment, the word vector and the position vector may be added to obtain a vector with a matrix dimension of 1 × 512 × 768, and then the vector is input to an encoder for encoding, where there may be 6 encoders, each encoder may first pass through a self-attention (self-attention) layer, and the self-attention layer helps the encoder to view other single characters of the input sentence during the process of encoding a single character, and then the output of the self-attention layer is connected to a fully-connected feedforward neural network, based on which the encoded vector output after being encoded by the encoder is a vector with a matrix dimension of 1 × 512 × 768.
Step 104: and decoding the coding vector by using a decoder corresponding to the coder at least to obtain an initial abstract statement corresponding to the text statement.
The decoder in this embodiment corresponds to the encoder, the decoders may be multiple, and each decoder in the decoder group also has a hierarchical structure similar to the encoder, except that the attention layer in the decoder is one more attention layer of the encoder-decoder, which helps the decoder to focus on the single character corresponding to the input sentence. For example, the output after decoding by the decoder is a vector with a matrix dimension of 512 × 768 dimensions, which is the initial digest statement.
Step 105: and extracting target abstract sentences meeting abstract extraction conditions from the initial abstract sentences.
In this embodiment, partial sentences may be extracted from a plurality of initial abstract sentences generated by encoding and decoding, and then target abstract sentences satisfying the abstract extraction condition are extracted, so that the target abstract sentences form a text abstract.
It should be noted that the abstract extraction conditions may be: in the initial abstract statement, the length of the target abstract statement is smaller than a specific threshold, and/or the target abstract statement is ordered at the top N bits when being ordered according to specific parameters in the initial abstract statement, wherein N is a positive integer greater than or equal to 1, and the like.
According to the above scheme, in the method for obtaining the abstract of the text provided in the embodiment of the present application, after the text containing a plurality of text sentences to be processed is obtained, the characters in each text sentence are subjected to vector conversion, the word vectors and the position vectors obtained by the conversion are encoded by using the encoder to obtain the encoded vectors, then the encoded vectors are decoded by using the decoder corresponding to the encoder, and finally the target abstract sentences meeting the abstract extraction conditions are extracted from the initial abstract sentences. It can be seen that, in this embodiment, before sentence extraction, vector conversion is performed on a text sentence, and then an initial abstract sentence is generated in an encoding and decoding manner, so that an initial abstract sentence capable of accurately expressing the content of the text sentence is obtained in a generating manner, and finally, sentence extraction is performed.
In one implementation, step 102 may be implemented by performing vector conversion on characters in a text sentence to obtain a word vector and a position vector of the characters, as shown in fig. 2:
step 201: and processing the text sentence to obtain a word list corresponding to the text sentence.
Wherein the word list contains characters in the text sentence. For example, the sentence ". times.ltd. responsibility company holds 800 shares of company" is processed into a word list, i.e., a word list formed by the root word end, so that new words can be maximally recognized, such as the word list: the examples of the "pharmaceutical composition" include [ ', ' finite ', ' responsible ', ' arbitrary ', ' public ', ' si ', ' holding ', ' existing ', ' public ', ' si ', ' strand ', ' portion ', '80', ' # #0', ' strand ' ].
Step 202: and processing the characters in the character list by using a preset bert model to obtain a character vector of the characters.
For example, in this embodiment, a bert model is used to perform vector conversion on each character in the word list in each text statement, so as to obtain a word vector of each character, where the position vector may be represented by a matrix, for example, the word vector of a character in a text statement may be a vector with a matrix dimension of 25 × 768 dimensions (25 is the number of characters in the text statement where the character is located).
Step 203: and carrying out position coding on the characters in the character list by using a preset coding function to obtain the position vectors of the characters.
The coding function may be a trigonometric function sin or a cos function. For example, the position vector of each character is encoded by a trigonometric function in the present embodiment, wherein the position vector can be represented by a matrix. For example, the position vector of the character may be a vector with a matrix dimension of 512 × 769 (512 is the number of characters of the text where the character is located).
In a specific implementation, the step 102-104 in this embodiment may be implemented by using a training model constructed based on a self-attention mechanism and a neural network, where a bert model is configured in the training model to perform vector transformation on characters, and an encoder and a decoder constructed based on the self-attention mechanism and the neural network are configured, so that a generated abstract statement acquisition scheme is implemented in a manner of combining the bert and a transform mechanism, thereby ensuring that the generated initial abstract statement can comprehensively and accurately express semantic content of a corresponding text statement.
It should be noted that, the execution sequence of step 202 and step 203 is not limited by the sequence in the drawings, and in other implementation manners, step 203 may be executed first and then step 202 is executed, or two steps may be executed simultaneously, and the implemented technical solutions that the execution sequence of the two steps is different are all within the protection scope of the present application.
In one implementation, step 101, when obtaining the text to be processed, may be implemented by the following manner, as shown in fig. 3:
step 301: performing form conversion on data to be processed to obtain an initial text;
for example, in this embodiment, an illegal character such as a messy code in the article data to be processed is deleted, so as to obtain an initial text, where the initial text includes a plurality of characters.
Step 302: and carrying out sentence splitting operation on the initial text by using the sentence splitting separators to obtain a plurality of text sentences.
The text sentence after sentence division may be divided into a plurality of paragraphs, and each paragraph contains a plurality of text sentences.
Step 303: and (4) carrying out statement screening on the plurality of text statements to obtain the text to be processed.
For example, in this embodiment, the implementation manner of performing statement screening on a plurality of text statements may include one or more of the following:
in one implementation, in this embodiment, a sentence with a length smaller than a first threshold value in the plurality of text sentences is removed, that is, a short sentence in the text sentence is deleted;
in another implementation, if the number of sentences in the plurality of text sentences is greater than the second threshold, if the number of sentences in the plurality of text sentences exceeds 2000, deleting the other sentences in the plurality of text sentences except the target sentence at the target position to reserve the target sentence at the specified target position, deleting all the other sentences, and if the number of sentences does not exceed the second threshold, not deleting the sentences and using all the sentences to obtain the subsequent abstract sentences. That is, if the number of text sentences that have passed through a sentence is too large, the target sentences at corresponding target positions may be retained according to the text description prior knowledge (e.g., several sentences before and after a paragraph are usually sentences that summarize content), for example, text sentences of three sentences to five sentences before and after each paragraph are retained, and sentences in the middle are all deleted.
In one implementation, when the target abstract statement satisfying the abstract extraction condition is extracted from the plurality of initial abstract statements in step 105, the following steps may be implemented, as shown in fig. 4:
step 401: and scoring at least one scoring mode of the initial abstract sentences in the plurality of initial abstract sentences to obtain at least one score of the initial abstract sentences.
The scoring manner in this embodiment may be one or more, such as a scoring manner for similarity, a scoring manner for maximum boundary ranking, and the like.
For example, in this embodiment, a bert model may be used to process each initial abstract statement, so as to directly obtain a sentence vector of each initial abstract statement, then a cosine distance calculation algorithm is used to calculate a similarity value between each initial abstract statement for the sentence vector, so as to obtain a sentence similarity set, where the similarity value includes a similarity value between any two initial abstract statements, and finally, the similarity set is sorted by using textrank, so as to obtain a sentence score set of the sorted initial abstract statements, where the sentence score set includes the sorting score of each initial abstract statement, that is, a score value corresponding to the initial abstract statement in the scoring manner for the similarity;
for another example, in this embodiment, a maximum boundary correlation algorithm mmr (maximum margin reservance) may be used to obtain a sentence diversity score of the top M sentences, that is, a score corresponding to the initial abstract sentence in the scoring mode aiming at the maximum boundary sorting, and the assigned diversity score of other sentences is 0, where M is a positive integer greater than or equal to 1.
Step 402: and obtaining a sentence score of the initial abstract sentence according to at least one score.
For example, in this embodiment, at least one score may be weighted and summed according to a preset weight of a corresponding scoring mode, so as to obtain a sentence score of the initial abstract sentence.
The weight value can be preset according to requirements or priori knowledge, and the statement score of the initial abstract statement is obtained based on the weight value. The sentence score can represent the confidence that its corresponding initial abstract sentence can represent the text content.
Step 403: in a plurality of initial abstract sentences, at least one target abstract sentence with a sentence score meeting abstract extraction conditions is obtained.
Wherein, the abstract extraction conditions may include: and the length of the target abstract statement is smaller than a third threshold, and/or the target abstract statement is sorted in the initial abstract statement from big to small according to statement scores, wherein N is a positive integer larger than or equal to 1.
For example, because of the length restriction of the target abstract sentences and the rules such as sentence sequencing order in the abstract extraction condition, part of sentences in the initial abstract sentences are extracted to obtain target abstract sentences meeting the length restriction and sequencing order, and the sentences form an abstract of the text to generally express the content of the text.
Referring to fig. 5, a schematic structural diagram of a text abstract obtaining apparatus according to a second embodiment of the present disclosure is provided, where the apparatus may be configured in an electronic device, such as a computer or a server, capable of performing data processing, particularly text processing. The technical scheme in the embodiment is mainly used for ensuring that the obtained abstract sentences can comprehensively and accurately express the content of the text while ensuring the grammar correctness of the obtained abstract sentences when the text is abstracted, and avoiding the situation that the obtained abstract contents are unilateral.
In a specific implementation, the apparatus in this embodiment may include the following units:
a text obtaining unit 501, configured to obtain a text to be processed, where the text includes a plurality of text statements, and a text statement is composed of at least one character;
a vector conversion unit 502, configured to perform vector conversion on characters in a text statement to obtain a word vector and a position vector of the characters;
a vector encoding unit 503, configured to encode the word vector and the position vector by using an encoder to obtain an encoded vector;
a vector decoding unit 504, configured to decode the encoded vector by using a decoder corresponding to the encoder to obtain an initial abstract statement corresponding to the text statement;
a statement extracting unit 505, configured to extract a target abstract statement satisfying an abstract extracting condition from the initial abstract statement.
With the above technical solution, in the text abstract obtaining apparatus provided in the second embodiment of the present application, after a text to be processed, which includes a plurality of text sentences, is obtained, the characters in each text sentence are subjected to vector conversion, the converted word vectors and position vectors are encoded by using an encoder to obtain encoded vectors, then the encoded vectors are decoded by using a decoder corresponding to the encoder, and finally, target abstract sentences meeting abstract extraction conditions are extracted from the initial abstract sentences. It can be seen that, in this embodiment, before sentence extraction, vector conversion is performed on a text sentence, and then an initial abstract sentence is generated in an encoding and decoding manner, so that an initial abstract sentence capable of accurately expressing the content of the text sentence is obtained in a generating manner, and finally, sentence extraction is performed.
In one implementation, the vector conversion unit 502 is specifically configured to: processing the text statement to obtain a word list corresponding to the text statement, wherein the word list comprises characters in the text statement; processing the characters in the character list by using a preset bert model to obtain a character vector of the characters; and carrying out position coding on the characters in the character list by using a preset coding function so as to obtain the position vectors of the characters.
In one implementation, the vector encoding unit 503 is specifically configured to: and inputting the word vector and the position vector into a plurality of encoders constructed at least based on a self-attention mechanism and a neural network so as to obtain an encoding vector corresponding to the text statement output by the encoders.
In one implementation, the text obtaining unit 501 is specifically configured to: performing form conversion on data to be processed to obtain an initial text; sentence dividing operation is carried out on the initial text by using the sentence dividing separators to obtain a plurality of text sentences; and carrying out sentence screening on the plurality of text sentences to obtain the text to be processed. For example, eliminating sentences of the plurality of text sentences with lengths smaller than a first threshold value; and/or deleting other sentences except the target sentence at the target position in the plurality of text sentences if the number of sentences in the plurality of text sentences is larger than a second threshold value.
In one implementation, the statement extraction unit 505 is specifically configured to: scoring at least one scoring mode of the initial abstract sentences in the plurality of initial abstract sentences to obtain at least one score of the initial abstract sentences; obtaining a sentence score of the initial abstract sentence at least according to the at least one score, for example, performing weighted summation on the at least one score according to a preset weight value of the corresponding scoring mode to obtain the sentence score of the initial abstract sentence; and obtaining at least one target abstract statement of which the statement score meets abstract extraction conditions from the plurality of initial abstract statements.
Specifically, the abstract extraction conditions include: the length of the target abstract statement is smaller than a third threshold, and/or the target abstract statement is sorted in the initial abstract statement from top to bottom according to the statement score, wherein N is a positive integer larger than or equal to 1.
It should be noted that, for the specific implementation of each unit in the present embodiment, reference may be made to the corresponding content in the foregoing, and details are not described here.
Referring to fig. 6, a schematic structural diagram of an electronic device according to a third embodiment of the present disclosure is provided, where the electronic device may be an electronic device capable of performing data processing, particularly text processing, such as a computer or a server. The technical scheme in the embodiment is mainly used for ensuring that the obtained abstract sentences can comprehensively and accurately express the content of the text while ensuring the grammar correctness of the obtained abstract sentences when the text is abstracted, and avoiding the situation that the obtained abstract contents are unilateral.
In a specific implementation, the electronic device in this embodiment may include the following structure:
a memory 601 for storing applications and data generated by the application operation;
a processor 602 for executing an application to implement: obtaining a text to be processed, wherein the text comprises a plurality of text sentences, and the text sentences comprise at least one character; carrying out vector conversion on characters in a text statement to obtain a word vector and a position vector of the characters; encoding the word vector and the position vector by using an encoder to obtain an encoded vector; decoding the coding vector by using a decoder corresponding to the coder to obtain an initial abstract statement corresponding to the text statement; and extracting target abstract sentences meeting abstract extraction conditions from the initial abstract sentences.
According to the scheme, after a text containing a plurality of text sentences to be processed is obtained, vector conversion is performed on characters in each text sentence, then a word vector and a position vector obtained through conversion are encoded by using an encoder to obtain an encoding vector, then a decoder corresponding to the encoder is used for decoding the encoding vector, and finally a target abstract sentence meeting abstract extraction conditions is extracted from the initial abstract sentences. It can be seen that, in this embodiment, before sentence extraction, vector conversion is performed on a text sentence, and then an initial abstract sentence is generated in an encoding and decoding manner, so that an initial abstract sentence capable of accurately expressing the content of the text sentence is obtained in a generating manner, and finally, sentence extraction is performed.
It should be noted that, the specific implementation of the processor in the present embodiment may refer to the corresponding content in the foregoing, and is not described in detail here.
The technical solution of the present application is exemplified as follows:
firstly, in the present application, deep learning models such as seq2seq + transformer + bert (embedding) are adopted to train a sentence set, a summarized candidate sentence set is generated, then a summary sentence is selected by using machine learning models such as sequencing, and a model architecture of the deep learning model is shown in fig. 7, wherein:
part a in fig. 7 is a model architecture of a generative abstract acquisition scheme, and an object generated by a trained model is reduced to a sentence, so that semantic content in an input document is not interfered, mainly to ensure the integrity of the sentence; the adoption of the bert model to obtain the word vector can accelerate the convergence of the model and prevent the non-convergence of the model to a certain extent; part b of fig. 7 is a model for extracting a sentence.
The following exemplifies the training process of the deep learning model in the generative abstract acquisition scheme with reference to the model architecture in fig. 7:
the method comprises the following steps: processing training data into a text form, removing illegal characters such as messy codes and the like, further performing sentence splitting operation on the text according to the sentence separating character, and discarding sentences with shorter length;
step two: processing sentences (text sentences) in the text into a word list (a word list formed by root and end words aiming at maximally recognizing new words), and then obtaining vector representation of each word by adopting a bert model;
for example, the original sentence: ' Shanghai X company, Inc. has company shares 70864800 ″;
the processed sentence: 'upper', 'sea', 'left', 'right', 'part', '70', '# 86', '# 480', '# 0', 'left', 'right', 'left', 'right', 'left', '70', 'right', 'left', etc.;
then, using the bert original model, a 25 x 768-dimensional word vector is obtained:
[[-0.52868992,-0.33727037,-0.25479231,...,0.82865638,0.72021463,-0.60483735],
[0.07796129,-0.72891465,-0.78627246,...,-0.37028891,-0.15304047,-0.30661304],
[-0.4783215,-0.57373971,-0.817014,...,-0.66768738,-0.25344727,-0.61553605],...,
[-0.60039644,-0.90145559,-0.23709044,...,0.90242452,-0.35032376,-0.71535249],
[-0.07113459,-0.03734476,-0.03696322,...,-0.70003593,0.43044283,-0.37690701],
[0.49816821,-0.27851639,0.86999363,...,-0.96345849,-0.18449764,-0.77752639]]。
step three: and coding the position information of each word by adopting a trigonometric function to obtain a position vector.
Position vector encoding method:
Figure BDA0002484392380000151
wherein 2i represents an even position, and 2i +1 represents an odd position; m represents a constant; dmodelRepresenting a position-coding vector.
Such as: if the length of the input word is 512 and the length of the coding vector is 768, then a vector with 768 dimensions is coded for each position of the 512 words; the positions are divided into odd numbers and even numbers, the odd number position codes are coded by adopting a trigonometric function cos, and the even number position codes are coded by adopting a sin trigonometric function; wherein i is a value in the range of 0-255; dmodelRepresents [0, 1, … … 767]Basis vectors within each channel.
The encoded position vector is a vector with dimensions 512 x 769:
[[0.07576533,-0.74129123,0.74230213,...,0.19788399,-0.18362323,-0.82365925],
[-0.79449879,0.54501459,-0.29112336,...,0.50878517,-0.26038884,-0.78666357],
[-0.23927395,0.17397682,0.81768419,...,0.02806539,-0.68603508,0.34588507],...,
[-0.37004809,-0.59452888,-0.54524573,...,0.44491184,0.01405959,-0.8031904],
[0.15480315,0.02079151,0.95436379,...,-0.82639238,0.79291451,0.49712922]]。
step four: adding sentence vectors formed by the word vectors in the step two and position vectors of the words in the step three (vectors with the dimension of 1 × 512 × 768) and inputting the added sentence vectors and the position vectors into an encoder for encoding;
the total number of the encoders Encoder is 6, the input of each Encoder firstly passes through a self-attention (self-attention) layer, and the self-attention layer helps the Encoder to look at other words in the input sequence in the process of coding the words; the self-attention output is then fed into a fully connected feed forward neural network; the output after encoder encoding is a vector with dimensions of 1 × 512 × 768:
[[[-0.31785784,-0.44023365,-0.98039687,...,-0.16302418,-0.38265433,0.42322826],
[0.19755334,-0.21705407,-0.77999795,...,-0.63782412,-0.53579996,0.18919797],
[-0.37845417,-0.16544959,-0.65037916,...,-0.94554656,0.90543602,-0.06130548],...,
[-0.22345377,0.04282653,0.31751285,...,0.7046039,-0.28627987,-0.9067068],
[-0.03392206,-0.79479858,-0.50932794,...,0.36760341,-0.13086321,0.97047442]]]。
step five: inputting the coding vector in the fourth step into a decoder for decoding training;
each Decoder in the Decoder group also has a hierarchy similar to the encoder, but the Attention layer will have one more encoder-Decoder Attention layer to help the Decoder focus on the word corresponding to the input sentence. The output after decoding by the decoder is a vector with dimensions of 512 x 768:
[[0.86610139,-0.47983355,0.19597517,...,-0.10808993,-0.10989231,-0.83693135],
[-0.35534804,-0.18062721,-0.84655659,...,-0.19404586,-0.50114958,-0.34054017],
[-0.35698987,0.77805902,0.40153797,...,0.67373011,-0.6615155,0.06470368],...,
[-0.31221466,-0.15682832,-0.09495165,...,-0.42604178,0.92401401,0.85389091],
[0.48835136,0.14152029,-0.61680035,...,0.63365447,0.52220171,0.80573693],
[0.28935011,0.34773844,-0.87833781,...,0.76765974,0.31630329,-0.0747671]]。
step six: the decoder is followed by a full connection layer and a loss function layer, and then the model is updated by back propagation through calculating loss until the training is finished.
The following describes a processing procedure for obtaining a summary of a document (text) to be processed after the deep learning model training is completed:
the method comprises the following steps: processing document data into a text form, removing illegal characters such as messy codes and the like, and performing sentence splitting operation according to the sentence separator; if too many sentences (if the threshold value is exceeded 2000) exist after the document sentence dividing operation, only 3-5 sentences before and after each paragraph are taken; otherwise, using the full text;
step two: processing sentences into word lists, adopting bert original model branches to obtain vector representation of each word, and obtaining a position vector of each word by utilizing a trigonometric function;
step three: inputting sentence vectors formed by the word vectors in the step two and position vectors of the words into an encoder for encoding;
step four: and inputting the coding vectors in the third step into a decoder, and performing decoding operation to generate a summarized sentence of each sentence.
Step five: adopting a bert original model to branch and process sentences to directly obtain a sentence vector representation set;
step six: calculating the similarity between sentences by adopting cosine distance according to the sentence vectors in the step five to obtain a sentence similarity set;
step seven: inputting the similarity set of the six sentences into a TextRank as an initialization basis for TextRank sorting, and obtaining a sentence score set after TextRank sorting;
step eight: obtaining sentence diversity scores of the first 5 sentences by using MMR (also called as maximum edge algorithm), and assigning the diversity of other sentences to be 0;
step nine: weighting the sentence score sets obtained in the seventh step and the eighth step to obtain final sentence scores;
step ten: and obtaining the target abstract sentences according to the rules of the target abstract length limitation, sentence sequence and the like.
Therefore, the problem that the generation algorithm cannot be directly used at the article level at present can be solved, and the generation of the abstract is more summarized through the combination of deep learning and traditional machine learning.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A text abstract acquisition method comprises the following steps:
obtaining a text to be processed, wherein the text comprises a plurality of text sentences, and the text sentences comprise at least one character;
performing vector conversion on characters in the text statement to obtain a word vector and a position vector of the characters;
encoding the word vector and the position vector by using an encoder to obtain an encoded vector;
decoding the coding vector by using a decoder corresponding to the encoder to obtain an initial abstract statement corresponding to the text statement;
and extracting target abstract sentences meeting abstract extraction conditions from the initial abstract sentences.
2. The method of claim 1, performing vector conversion on characters in the text statement to obtain a word vector and a position vector of the characters, comprising:
processing the text statement to obtain a word list corresponding to the text statement, wherein the word list comprises characters in the text statement;
processing the characters in the character list by using a preset bert model to obtain a character vector of the characters;
and carrying out position coding on the characters in the character list by using a preset coding function so as to obtain the position vectors of the characters.
3. The method of claim 1 or 2, encoding the word vector and the position vector with an encoder to obtain an encoded vector, comprising:
and inputting the word vector and the position vector into a plurality of encoders constructed at least based on a self-attention mechanism and a neural network so as to obtain an encoding vector corresponding to the text statement output by the encoders.
4. The method of claim 1 or 2, obtaining text to be processed, comprising:
performing form conversion on data to be processed to obtain an initial text;
sentence dividing operation is carried out on the initial text by using the sentence dividing separators to obtain a plurality of text sentences;
and carrying out sentence screening on the plurality of text sentences to obtain the text to be processed.
5. The method of claim 4, wherein the step of filtering the text sentence to obtain a text to be processed comprises:
eliminating sentences of the plurality of text sentences with the length smaller than a first threshold value;
and/or deleting other sentences except the target sentence at the target position in the plurality of text sentences if the number of sentences in the plurality of text sentences is larger than a second threshold value.
6. The method according to claim 1 or 2, wherein extracting a target abstract statement satisfying an abstract extraction condition from the plurality of initial abstract statements comprises:
scoring at least one scoring mode of the initial abstract sentences in the plurality of initial abstract sentences to obtain at least one score of the initial abstract sentences;
obtaining a sentence score of the initial abstract sentence at least according to the at least one score;
and obtaining at least one target abstract statement of which the statement score meets abstract extraction conditions from the plurality of initial abstract statements.
7. The method of claim 6, said obtaining a sentence score of the initial summary sentence based at least on the at least one score, comprising:
and carrying out weighted summation on the at least one score according to a preset weight value of the corresponding scoring mode to obtain the sentence score of the initial abstract sentence.
8. The method of claim 6, the summarization extraction condition comprising: the length of the target abstract statement is smaller than a third threshold, and/or the target abstract statement is sorted in the initial abstract statement from top to bottom according to the statement score, wherein N is a positive integer larger than or equal to 1.
9. An apparatus for obtaining a summary of a text, comprising:
the text processing device comprises a text obtaining unit, a processing unit and a processing unit, wherein the text obtaining unit is used for obtaining a text to be processed, the text comprises a plurality of text sentences, and the text sentences comprise at least one character;
the vector conversion unit is used for carrying out vector conversion on characters in the text statement to obtain a word vector and a position vector of the characters;
the vector coding unit is used for coding the word vector and the position vector by using a coder to obtain a coded vector;
a vector decoding unit, configured to decode the encoded vector by using a decoder corresponding to the encoder to obtain an initial abstract statement corresponding to the text statement;
and the statement extraction unit is used for extracting the target abstract statement meeting the abstract extraction condition from the initial abstract statement.
10. An electronic device, comprising:
the memory is used for storing an application program and data generated by the running of the application program;
a processor for executing the application to implement: obtaining a text to be processed, wherein the text comprises a plurality of text sentences, and the text sentences comprise at least one character; performing vector conversion on characters in the text statement to obtain a word vector and a position vector of the characters; encoding the word vector and the position vector by using an encoder to obtain an encoded vector; decoding the coding vector by using a decoder corresponding to the encoder to obtain an initial abstract statement corresponding to the text statement; and extracting target abstract sentences meeting abstract extraction conditions from the initial abstract sentences.
CN202010387665.8A 2020-05-09 2020-05-09 Text abstract obtaining method and device and electronic equipment Pending CN111581374A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010387665.8A CN111581374A (en) 2020-05-09 2020-05-09 Text abstract obtaining method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010387665.8A CN111581374A (en) 2020-05-09 2020-05-09 Text abstract obtaining method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN111581374A true CN111581374A (en) 2020-08-25

Family

ID=72113468

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010387665.8A Pending CN111581374A (en) 2020-05-09 2020-05-09 Text abstract obtaining method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN111581374A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183078A (en) * 2020-10-22 2021-01-05 上海风秩科技有限公司 Text abstract determining method and device
CN112347758A (en) * 2020-11-06 2021-02-09 中国平安人寿保险股份有限公司 Text abstract generation method and device, terminal equipment and storage medium
CN112347240A (en) * 2020-10-16 2021-02-09 小牛思拓(北京)科技有限公司 Text abstract extraction method and device, readable storage medium and electronic equipment
CN112883721A (en) * 2021-01-14 2021-06-01 科技日报社 Method and device for recognizing new words based on BERT pre-training model
CN113434642A (en) * 2021-08-27 2021-09-24 广州云趣信息科技有限公司 Text abstract generation method and device and electronic equipment
CN113487088A (en) * 2021-07-06 2021-10-08 哈尔滨工业大学(深圳) Traffic prediction method and device based on dynamic space-time diagram convolution attention model
WO2022142121A1 (en) * 2020-12-31 2022-07-07 平安科技(深圳)有限公司 Abstract sentence extraction method and apparatus, and server and computer-readable storage medium
CN114741499A (en) * 2022-06-08 2022-07-12 杭州费尔斯通科技有限公司 Text abstract generation method and system based on sentence semantic model
CN116108163A (en) * 2023-04-04 2023-05-12 之江实验室 Text matching method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108280112A (en) * 2017-06-22 2018-07-13 腾讯科技(深圳)有限公司 Abstraction generating method, device and computer equipment
CN108427771A (en) * 2018-04-09 2018-08-21 腾讯科技(深圳)有限公司 Summary texts generation method, device and computer equipment
CN109657051A (en) * 2018-11-30 2019-04-19 平安科技(深圳)有限公司 Text snippet generation method, device, computer equipment and storage medium
CN110119444A (en) * 2019-04-23 2019-08-13 中电科大数据研究院有限公司 A kind of official document summarization generation model that extraction-type is combined with production
CN110222188A (en) * 2019-06-18 2019-09-10 深圳司南数据服务有限公司 A kind of the company's bulletin processing method and server-side of multi-task learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108280112A (en) * 2017-06-22 2018-07-13 腾讯科技(深圳)有限公司 Abstraction generating method, device and computer equipment
CN108427771A (en) * 2018-04-09 2018-08-21 腾讯科技(深圳)有限公司 Summary texts generation method, device and computer equipment
CN109657051A (en) * 2018-11-30 2019-04-19 平安科技(深圳)有限公司 Text snippet generation method, device, computer equipment and storage medium
CN110119444A (en) * 2019-04-23 2019-08-13 中电科大数据研究院有限公司 A kind of official document summarization generation model that extraction-type is combined with production
CN110222188A (en) * 2019-06-18 2019-09-10 深圳司南数据服务有限公司 A kind of the company's bulletin processing method and server-side of multi-task learning

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347240A (en) * 2020-10-16 2021-02-09 小牛思拓(北京)科技有限公司 Text abstract extraction method and device, readable storage medium and electronic equipment
CN112183078A (en) * 2020-10-22 2021-01-05 上海风秩科技有限公司 Text abstract determining method and device
CN112183078B (en) * 2020-10-22 2023-01-10 上海风秩科技有限公司 Text abstract determining method and device
CN112347758A (en) * 2020-11-06 2021-02-09 中国平安人寿保险股份有限公司 Text abstract generation method and device, terminal equipment and storage medium
CN112347758B (en) * 2020-11-06 2024-05-17 中国平安人寿保险股份有限公司 Text abstract generation method and device, terminal equipment and storage medium
WO2022142121A1 (en) * 2020-12-31 2022-07-07 平安科技(深圳)有限公司 Abstract sentence extraction method and apparatus, and server and computer-readable storage medium
CN112883721B (en) * 2021-01-14 2024-01-19 科技日报社 New word recognition method and device based on BERT pre-training model
CN112883721A (en) * 2021-01-14 2021-06-01 科技日报社 Method and device for recognizing new words based on BERT pre-training model
CN113487088A (en) * 2021-07-06 2021-10-08 哈尔滨工业大学(深圳) Traffic prediction method and device based on dynamic space-time diagram convolution attention model
CN113434642A (en) * 2021-08-27 2021-09-24 广州云趣信息科技有限公司 Text abstract generation method and device and electronic equipment
CN114741499A (en) * 2022-06-08 2022-07-12 杭州费尔斯通科技有限公司 Text abstract generation method and system based on sentence semantic model
CN114741499B (en) * 2022-06-08 2022-09-06 杭州费尔斯通科技有限公司 Text abstract generation method and system based on sentence semantic model
CN116108163A (en) * 2023-04-04 2023-05-12 之江实验室 Text matching method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111581374A (en) Text abstract obtaining method and device and electronic equipment
CN108197111B (en) Text automatic summarization method based on fusion semantic clustering
CN110309287B (en) Retrieval type chatting dialogue scoring method for modeling dialogue turn information
CN112464993B (en) Multi-mode model training method, device, equipment and storage medium
CN109711121B (en) Text steganography method and device based on Markov model and Huffman coding
CN110163181B (en) Sign language identification method and device
CN111061861B (en) Text abstract automatic generation method based on XLNet
CN113158665A (en) Method for generating text abstract and generating bidirectional corpus-based improved dialog text
CN110795556A (en) Abstract generation method based on fine-grained plug-in decoding
Chitnis et al. Variable-length word encodings for neural translation models
CN109740158B (en) Text semantic parsing method and device
CN109993216B (en) Text classification method and device based on K nearest neighbor KNN
CN111401037B (en) Natural language generation method and device, electronic equipment and storage medium
CN112380319A (en) Model training method and related device
CN111814479B (en) Method and device for generating enterprise abbreviations and training model thereof
CN110942774A (en) Man-machine interaction system, and dialogue method, medium and equipment thereof
CN115908641A (en) Text-to-image generation method, device and medium based on features
CN112949255A (en) Word vector training method and device
CN111723194A (en) Abstract generation method, device and equipment
CN116663501A (en) Chinese variant text conversion method based on multi-modal sharing weight
CN115204181A (en) Text detection method and device, electronic equipment and computer readable storage medium
CN112686059B (en) Text translation method, device, electronic equipment and storage medium
CN115019137A (en) Method and device for predicting multi-scale double-flow attention video language event
WO2021042234A1 (en) Application introduction method, mobile terminal, and server
CN110825869A (en) Text abstract generating method of variation generation decoder based on copying mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination