CN111581341A - Method for acquiring text abstract and language model generation method - Google Patents

Method for acquiring text abstract and language model generation method Download PDF

Info

Publication number
CN111581341A
CN111581341A CN202010318584.2A CN202010318584A CN111581341A CN 111581341 A CN111581341 A CN 111581341A CN 202010318584 A CN202010318584 A CN 202010318584A CN 111581341 A CN111581341 A CN 111581341A
Authority
CN
China
Prior art keywords
sentence
clause
clauses
information
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010318584.2A
Other languages
Chinese (zh)
Inventor
陈栋
付骁弈
张�杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Minglue Artificial Intelligence Group Co Ltd
Original Assignee
Shanghai Minglue Artificial Intelligence Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Minglue Artificial Intelligence Group Co Ltd filed Critical Shanghai Minglue Artificial Intelligence Group Co Ltd
Priority to CN202010318584.2A priority Critical patent/CN111581341A/en
Publication of CN111581341A publication Critical patent/CN111581341A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a method for acquiring a text abstract, a language model generation method, a computer storage medium and a terminal.

Description

Method for acquiring text abstract and language model generation method
Technical Field
The present disclosure relates to, but not limited to, natural language analysis technologies, and in particular, to a method for obtaining a text abstract, a method for generating a language model, a computer storage medium, and a terminal.
Background
The text excerpt serves to summarize the original document as concisely as possible given the important content of the document or documents. The text abstract with good quality can play an important role in the information retrieval process, for example, the text abstract is used for replacing an original document to participate in indexing, so that the retrieval time can be effectively shortened, redundant information in a retrieval result can be reduced, and the user experience is improved.
The automatic text summarization is an important research topic in the field of natural language processing; according to the generation mode of the text abstract, the automatic text abstract can be divided into: extracting type text abstract, generating type text abstract and compression type text abstract; the extraction type text abstract generates the text abstract by calculating the weight of sentence components in the original text and extracting ready-made sentences from the original text, so that the error rate is low in syntax and syntax, and the quality of the text abstract is ensured to a certain extent. When calculating the weight of statement components in an original text, the abstract text needs to carry out vector expression on the statements; common vector expression models include: word Vector Model (Word to Vector) and Pre-training language Model (Pre-trained language Model); the pre-training language model is a language model obtained based on training modes of different linguistic hypotheses, can directly map sentences into vector expressions, and considers the similarity and word sequence relation among words according to mechanisms (bidirectional and attention and the like) in the model, and comprises an Embedded Language Model (ELMO), a bidirectional pre-training language model (Bert) and the like.
After the sentences are mapped into vector expressions through the pre-training language model, calculating the weight of each sentence in the original text according to the vector expressions of the sentences, and performing sentence extraction according to the calculated weight of the sentences by using the related technology to obtain a text abstract; when the obtained text abstract is consulted, the staff finds that sentence sequencing is not considered when the sentences are extracted to obtain the text abstract, the obtained text abstract has problems in sentence sequencing, and the text abstract extraction quality is to be further improved.
Disclosure of Invention
The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.
The embodiment of the invention provides a method for acquiring a text abstract, a language model generation method, a computer storage medium and a terminal, which can consider sentence sequencing during text abstract extraction and improve the text abstract extraction quality.
The embodiment of the invention provides a language model generation method, which comprises the following steps:
generating sentence characteristic vectors of each clause according to a preset generation strategy for the training text with the adjusted sentence arrangement sequence;
processing the generated sentence characteristic vectors of each clause through a preset characteristic extractor to obtain output vectors of each clause;
determining sentence arrangement sequence information of the clauses after the arrangement sequence is adjusted according to the obtained output vectors of all the clauses;
according to the standard sequencing information and the determined statement sequencing information, parameter adjustment is carried out on the feature extractor to obtain a language model for vector expression;
wherein the sentence feature vector comprises: character embedded characteristic information, characteristic information for distinguishing adjacent clauses and characteristic information for identifying word sequencing in the clauses; the standard ranking information includes: respectively adding numbers to the clauses of the training text with the clause arrangement sequence not adjusted in sequence; generating the number sequencing information of the training text without adjusting the arrangement sequence of the clauses according to the numbers of all the added clauses; the sentence ordering information includes: generated based on the added number of each clause: and after the sentence arrangement is adjusted, the numbering and sequencing information of all the sentences of the training text.
On the other hand, an embodiment of the present invention further provides a computer storage medium, where a computer program is stored, and when the computer program is executed by a processor, the method for generating a language model is implemented.
In another aspect, an embodiment of the present invention further provides a terminal, including: a memory and a processor, the memory having a computer program stored therein; wherein the content of the first and second substances,
the processor is configured to execute the computer program in the memory;
the computer program, when executed by the processor, implements the language model generation method as described above.
In another aspect, an embodiment of the present invention further provides a method for obtaining a text abstract, including:
carrying out vector expression on each clause of a text to be processed according to a pre-generated language model;
calculating the weight of each clause in the text to be processed according to the vector expression of each clause;
and extracting sentences from the text to be processed according to the calculated weight of each clause to obtain a text abstract.
In a further aspect, an embodiment of the present invention further provides a computer storage medium, where a computer program is stored in the computer storage medium, and when the computer program is executed by a processor, the method for obtaining a text abstract is implemented.
In another aspect, an embodiment of the present invention further provides a terminal, including: a memory and a processor, the memory having a computer program stored therein; wherein the content of the first and second substances,
the processor is configured to execute the computer program in the memory;
the computer program, when executed by the processor, implements a method of obtaining a text excerpt as described above.
The method and the device for extracting the text abstract generate the sentence characteristic vectors of each clause after the clause sorting is adjusted, after the sentence sorting information is determined according to the sentence characteristic vectors, the parameter adjustment is carried out on the characteristic extractor according to the standard sorting information and the determined sentence sorting information, the language model for vector expression considering the influence of the sentence sorting on the sentence weight is obtained, and technical support is provided for improving the extraction quality of the text abstract.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the example serve to explain the principles of the invention and not to limit the invention.
FIG. 1 is a flow chart of a method for generating a language model according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for obtaining a text abstract according to an embodiment of the present invention;
FIG. 3 is a block diagram of a speech model generation apparatus according to an embodiment of the present invention;
fig. 4 is a block diagram of an apparatus for obtaining a text abstract according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.
The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.
The inventor of the application analyzes and discovers that the vector expression obtained by the pre-training language model in the related technology does not consider the influence of the sequence of the sentences in the original text on the sentence weight, and influences the extraction result of the text abstract.
Fig. 1 is a flowchart of a language model generation method according to an embodiment of the present invention, as shown in fig. 1, including:
step 101, generating sentence characteristic vectors of each clause according to a preset generation strategy for training texts with adjusted sentence arrangement sequence;
the training texts in the embodiment of the invention include but are not limited to: and text to be processed is required to be extracted from the text abstract.
In an exemplary embodiment, for the clauses included in the training text, the sentence arrangement order may be adjusted through a preset random function;
in one illustrative example, the sentence feature vector comprises: character embedded characteristic information, characteristic information for distinguishing adjacent clauses and characteristic information for identifying word arrangement in the clauses;
in an exemplary embodiment, generating a sentence feature vector of each clause according to a preset generation strategy includes:
respectively adding preset start-stop marks to each clause of the training text with the adjusted sentence arrangement sequence;
embedding each clause added with the start-stop mark;
obtaining a sentence characteristic vector of each clause according to the embedding processing result of each clause;
wherein, start-stop sign includes: a start identity and a stop identity.
In an exemplary embodiment, the embedding process is performed on each clause added with the start-stop mark, and the embedding process includes:
performing character embedding on the clauses added with the start-stop marks;
for the clauses added with the start-stop marks, distinguishing the marks according to preset clauses and embedding the marks into segments;
embedding the in-sentence positions of the words in the clauses according to the preset in-sentence word sequencing marks for the clauses added with the start-stop marks;
wherein, the sentence distinguishing mark comprises: the marks are used for distinguishing adjacent clauses, and the clause distinguishing marks of words in the same clause are the same; the intra-sentence word ordering identification comprises the following steps: and the mark is used for distinguishing the arrangement sequence of each word in the clause.
It should be noted that the above processes of word embedding, segment embedding and position embedding in sentences have no requirement of execution order.
In one illustrative example, obtaining a sentence feature vector for each clause based on the embedding processing result for each clause includes:
and accumulating the results of word embedding, segment embedding and sentence internal position embedding of each clause to obtain the sentence characteristic vector of each clause.
Suppose the results of word Embedding, Segment Embedding, and intra-Sentence Position Embedding are Token Embedding of < CLS >, Segment Embedding of < CLS >, and Inner sequence Position Embedding of < CLS >, respectively; the embodiment of the invention can accumulate the word embedding, segment embedding and sentence internal position embedding results of each clause by the following formulas:
E1=Token Embedding of<CLS>+Segment Embedding of<CLS>+Inner SentencePosition Embedding of<CLS>。
102, processing the generated sentence characteristic vectors of each clause through a preset characteristic extractor to obtain output vectors of each clause;
in an exemplary embodiment, the feature extractor includes any one of the following model structures:
superimposed translation encoders, Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and long short term memory networks (LSTM).
In the embodiment of the invention, the output vector obtained by processing the feature extractor is input as the sentence feature vector obtained by the clauses of the training text adjusted based on the clause sorting sequence, so that the sentence sorting sequence information is contained.
103, determining sentence arrangement sequence information of the clauses with the adjusted arrangement sequence according to the obtained output vectors of all the clauses;
in an exemplary embodiment, determining the sentence arrangement order information of the arrangement order adjusted clause includes:
performing mask processing on the initial identifications in the obtained output vectors of all the clauses;
decoding the output vector after mask processing through a preset decoder to obtain statement sequencing information;
wherein the decoder comprises: a pointer network decoder.
Step 104, adjusting parameters of the feature extractor according to the standard sorting information and the determined statement sorting information to obtain a language model for vector expression;
in one illustrative example, the standard ranking information includes: respectively adding numbers to the clauses of the training text with the clause arrangement sequence not adjusted in sequence; generating the number sequencing information of the training text without adjusting the arrangement sequence of the clauses according to the numbers of all the added clauses; the sentence ordering information includes: generated based on the added number of each clause: and after the sentence arrangement is adjusted, the numbering and sequencing information of all the sentences of the training text.
The method and the device for extracting the sentence abstract have the advantages that the sentence characteristic vectors of all the clauses are generated after the sentence arrangement sequence is adjusted, after the sentence arrangement information is determined according to the sentence characteristic vectors, the parameter adjustment is carried out on the characteristic extractor according to the standard arrangement information and the determined sentence arrangement information, the language model for vector expression considering the influence of the sentence arrangement sequence on the sentence weight is obtained, and the technical support is provided for improving the extraction quality of the text abstract.
In one illustrative example, the parameter adjustment of the feature extractor comprises:
according to the standard sorting information and the loss and gradient information of the determined statement sorting information, parameter adjustment is carried out on the feature extractor through back propagation;
and in the process of adjusting the parameters of the feature extractor, when the loss values of the standard sequencing information and the sentence sequencing information are judged to have the minimum value and the minimum value lasts for a preset period and is unchanged, taking the feature extractor as a language model for vector expression.
The embodiment of the invention calculates the loss of statement ordering information and standard ordering information through a Cross Entropy loss function (Cross entry); the number of cycles can be set empirically by the skilled person, for example: 3-5 cycles; the specific process comprises the following steps: when the parameters of the feature extractor are adjusted through back propagation, whether the loss values of the standard sorting information and the statement sorting information of each adjusting period are the minimum values or not is determined; when the loss values of the standard sorting information and the sentence sorting information are the minimum values and the loss values of the standard sorting information and the sentence sorting information of the subsequent preset parameter adjustment period are both larger than or equal to the minimum values, determining that the model is converged, and therefore, taking the feature extractor at the moment as a language model for vector expression; the language model determined by the loss values of the standard sorting information and the sentence sorting information considers the influence of the sentence sorting order on the sentence weight, and provides technical support for improving the extraction quality of the text abstract.
The embodiment of the invention also provides a computer storage medium, wherein a computer program is stored in the computer storage medium, and when being executed by a processor, the computer program realizes the method for realizing the model training.
An embodiment of the present invention further provides a terminal, including: a memory and a processor, the memory having stored therein a computer program; wherein the content of the first and second substances,
the processor is configured to execute the computer program in the memory;
the computer program, when executed by a processor, implements a method for implementing model training as described above.
Fig. 2 is a flowchart of a method for extracting a text abstract according to an embodiment of the present invention, as shown in fig. 2, including:
step 201, performing vector expression on each clause of a text to be processed according to a pre-generated language model;
after the text to be processed is input into the language model, the language model processes each clause of the text to be processed, and the clause is mapped into vector expression. The vector expression obtained by the embodiment of the invention not only considers the similarity and word sequence relation among words, but also considers the sequencing information of the clauses.
Step 202, calculating the weight of each clause in the text to be processed according to the vector expression of each clause;
after the vector expression of each clause is obtained, the weight of each clause in the text to be processed can be calculated by referring to the correlation principle.
And step 203, extracting sentences from the text to be processed according to the calculated weight of each clause to obtain a text abstract.
After the weight of each clause is obtained through calculation, the text abstract can be obtained by referring to the existing weight-based sentence extraction processing.
In an exemplary embodiment, the embodiment of the present invention can refer to steps 101-104 to obtain a language model for vector expression.
The method and the device for extracting the text abstract generate the sentence characteristic vectors of each clause after the clause sorting is adjusted, after the sentence sorting information is determined according to the sentence characteristic vectors, the parameter adjustment is carried out on the characteristic extractor according to the standard sorting information and the determined sentence sorting information, the language model for vector expression considering the influence of the sentence sorting and the information among the sentences on the sentence weight is obtained, and the extraction quality of the text abstract is improved.
The embodiment of the invention also provides a computer storage medium, wherein a computer program is stored in the computer storage medium, and when being executed by a processor, the computer program realizes the method for extracting the text abstract.
An embodiment of the present invention further provides a terminal, including: a memory and a processor, the memory having stored therein a computer program; wherein the content of the first and second substances,
the processor is configured to execute the computer program in the memory;
the computer program, when executed by a processor, implements a method for performing text summarization as described above.
Fig. 3 is a block diagram of a structure of an apparatus for implementing model training according to an embodiment of the present invention, as shown in fig. 3, including: the device comprises a feature vector unit, a feature extraction unit, a determination sorting unit and a parameter adjusting unit; wherein the content of the first and second substances,
the feature vector unit is set to: generating sentence characteristic vectors of each clause according to a preset generation strategy for the training text with the adjusted sentence arrangement sequence;
wherein the sentence feature vector comprises: character embedded characteristic information, characteristic information for distinguishing adjacent clauses and characteristic information for identifying word sequencing in the clauses;
in one illustrative example, the feature vector unit is arranged to:
respectively adding preset start-stop marks to each clause of the training text with the adjusted sentence arrangement sequence;
embedding each clause added with the start-stop mark;
obtaining a sentence characteristic vector of each clause according to the embedding processing result of each clause;
wherein, start-stop sign includes: a start identity and a stop identity.
In an exemplary embodiment, the feature vector unit is configured to perform embedding processing on each clause added with the start-stop identifier, and includes:
performing character embedding on the clauses added with the start-stop marks;
for the clauses added with the start-stop marks, distinguishing the marks according to preset clauses and embedding the marks into segments;
embedding the in-sentence positions of the words in the clauses according to the preset in-sentence word sequencing marks for the clauses added with the start-stop marks;
wherein, the sentence distinguishing mark comprises: the marks are used for distinguishing adjacent clauses, and the clause distinguishing marks of words in the same clause are the same; the intra-sentence word ordering identification comprises the following steps: and the mark is used for distinguishing the arrangement sequence of each word in the clause.
In one illustrative example, the feature vector unit is a unit configured to obtain a sentence feature vector for each sentence according to an embedding processing result for each sentence, and includes:
and accumulating the results of word embedding, segment embedding and sentence internal position embedding of each clause to obtain the sentence characteristic vector of each clause.
The feature extraction unit is configured to: processing the generated sentence characteristic vectors of each clause through a preset characteristic extractor to obtain output vectors of each clause;
the determination sorting unit is set as: determining sentence arrangement sequence information of the clauses after the arrangement sequence is adjusted according to the obtained output vectors of all the clauses;
in one illustrative example, determining the ordering unit is arranged to:
performing mask processing on the initial identifications in the obtained output vectors of all the clauses;
decoding the output vector after mask processing through a preset decoder to obtain statement sequencing information;
wherein the decoder comprises: a pointer network decoder.
The parameter adjusting unit is set as follows: according to the standard sequencing information and the determined statement sequencing information, parameter adjustment is carried out on the feature extractor to obtain a language model for vector expression;
wherein the standard ranking information comprises: respectively adding numbers to the clauses of the training text with the clause arrangement sequence not adjusted in sequence; generating the number sequencing information of the training text without adjusting the arrangement sequence of the clauses according to the numbers of all the added clauses; the sentence ordering information includes: generated based on the added number of each clause: and after the sentence arrangement is adjusted, the numbering and sequencing information of all the sentences of the training text.
In an exemplary embodiment, the parameter adjustment unit is arranged to:
according to the standard sorting information and the loss and gradient information of the determined statement sorting information, parameter adjustment is carried out on the feature extractor through back propagation;
and in the process of adjusting the parameters of the feature extractor, when the loss values of the standard sequencing information and the sentence sequencing information are judged to have the minimum value and the minimum value lasts for a preset period and is unchanged, taking the feature extractor as a language model for vector expression.
The method and the device for extracting the text abstract generate the sentence characteristic vectors of each clause after the clause sorting is adjusted, after the sentence sorting information is determined according to the sentence characteristic vectors, the parameter adjustment is carried out on the characteristic extractor according to the standard sorting information and the determined sentence sorting information, the language model for vector expression considering the influence of the sentence sorting and the information among the sentences on the sentence weight is obtained, and technical support is provided for improving the extraction quality of the text abstract.
Fig. 4 is a block diagram of an apparatus for extracting a text abstract according to an embodiment of the present invention, as shown in fig. 4, including: the system comprises a feature vector unit, a feature extraction unit, a determination sorting unit, a parameter adjusting unit, a vector expression unit, a weight calculating unit and an extraction abstract unit; wherein the content of the first and second substances,
the feature vector unit is set to: generating sentence characteristic vectors of each clause according to a preset generation strategy for the training text with the adjusted sentence arrangement sequence;
wherein the sentence feature vector comprises: character embedded characteristic information, characteristic information for distinguishing adjacent clauses and characteristic information for identifying word sequencing in the clauses;
in one illustrative example, the feature vector unit is arranged to:
respectively adding preset start-stop marks to each clause of the training text with the adjusted sentence arrangement sequence;
embedding each clause added with the start-stop mark;
obtaining a sentence characteristic vector of each clause according to the embedding processing result of each clause;
wherein, start-stop sign includes: a start identity and a stop identity.
In an exemplary embodiment, the feature vector unit is configured to perform embedding processing on each clause added with the start-stop identifier, and includes:
performing character embedding on the clauses added with the start-stop marks;
for the clauses added with the start-stop marks, distinguishing the marks according to preset clauses and embedding the marks into segments;
embedding the in-sentence positions of the words in the clauses according to the preset in-sentence word sequencing marks for the clauses added with the start-stop marks;
wherein, the sentence distinguishing mark comprises: the marks are used for distinguishing adjacent clauses, and the clause distinguishing marks of words in the same clause are the same; the intra-sentence word ordering identification comprises the following steps: and the mark is used for distinguishing the arrangement sequence of each word in the clause.
In one illustrative example, the feature vector unit is a unit configured to obtain a sentence feature vector for each sentence according to an embedding processing result for each sentence, and includes:
and accumulating the results of word embedding, segment embedding and sentence internal position embedding of each clause to obtain the sentence characteristic vector of each clause.
The feature extraction unit is configured to: processing the generated sentence characteristic vectors of each clause through a preset characteristic extractor to obtain output vectors of each clause;
the determination sorting unit is set as: determining sentence arrangement sequence information of the clauses after the arrangement sequence is adjusted according to the obtained output vectors of all the clauses;
in one illustrative example, determining the ordering unit is arranged to:
performing mask processing on the initial identifications in the obtained output vectors of all the clauses;
decoding the output vector after mask processing through a preset decoder to obtain statement sequencing information;
wherein the decoder comprises: a pointer network decoder.
The parameter adjusting unit is set as follows: according to the standard sequencing information and the determined statement sequencing information, parameter adjustment is carried out on the feature extractor to obtain a language model for vector expression;
wherein the standard ranking information comprises: respectively adding numbers to the clauses of the training text with the clause arrangement sequence not adjusted in sequence; generating the number sequencing information of the training text without adjusting the arrangement sequence of the clauses according to the numbers of all the added clauses; the sentence ordering information includes: generated based on the added number of each clause: and after the sentence arrangement is adjusted, the numbering and sequencing information of all the sentences of the training text.
In an exemplary embodiment, the parameter adjustment unit is arranged to:
according to the standard sorting information and the loss and gradient information of the determined statement sorting information, parameter adjustment is carried out on the feature extractor through back propagation;
and in the process of adjusting the parameters of the feature extractor, when the loss values of the standard sequencing information and the sentence sequencing information are judged to have the minimum value and the minimum value lasts for a preset period and is unchanged, taking the feature extractor as a language model for vector expression.
The vector expression unit is set as: carrying out vector expression on each clause of a text to be processed according to a pre-generated language model;
the calculation weight unit is set as: calculating the weight of each clause in the text to be processed according to the vector expression of each clause;
the abstract extraction unit is used for: extracting sentences from the original text according to the calculated weight of each clause to obtain a text abstract;
the method and the device for extracting the text abstract generate the sentence characteristic vectors of each clause after the clause sorting is adjusted, after the sentence sorting information is determined according to the sentence characteristic vectors, the parameter adjustment is carried out on the characteristic extractor according to the standard sorting information and the determined sentence sorting information, the language model for vector expression considering the influence of the sentence sorting and the information among the sentences on the sentence weight is obtained, and the extraction quality of the text abstract is improved.
The following is a brief description of the embodiments of the present invention by way of application examples, which are only used to illustrate the present invention and are not intended to limit the scope of the present invention.
Application example
The application example divides the training text into sentences, generates standard ordering information of < S1, S2 and S3 … SN > according to the sentences of the training text, and then carries out ordering adjustment on the sentences according to a preset ordering adjustment strategy. The application example can determine a sorting adjustment strategy according to the data volume of the training text; generally, the larger the data size is, the smaller the number of times of sorting and adjusting the clauses is;
after the sentence arrangement sequence is adjusted, adding corresponding starting marks and ending marks to each sentence in the training text;
in one illustrative example, adding the start identifier and the end identifier comprises: adding a starting mark < CLS > Token before the sentence of each clause, and adding a terminating mark < SEP > Token at the tail of the sentence of each clause;
assuming that the training text contains clauses S1, S2 and S3, and the clauses are ranked into S2, S3 and S1 after ranking adjustment; assuming that the clause S2 includes words W1, W2 and W3, the clause S3 includes words W4 and W5, and the clause S1 includes words W6, W7 and W8, the beginning and end identifiers of each clause are added as follows:
S2=[<CLS>,W1,W2<SEP>];
S3=[<CLS>,W3,W4,W5,<SEP>];
S1=[<CLS>,W6,W7,W8,<SEP>];
the obtained training text after adjusting the sentence arrangement sequence is as follows: [ S2, S3, S1] [ < CLS >, < W1>, < W2>, < W3>, < SEP >, < CLS >, < W4>, < W5>, < SEP >, < CLS >, < W6>, < W7>, < W8>, < SEP > ];
in an exemplary embodiment, the application example generates a statement feature vector of each clause according to a preset generation strategy;
in one illustrative example, generating a sentence feature vector for each clause comprises:
performing character embedding on the clauses added with the start-stop marks;
for the clauses added with the start-stop marks, distinguishing the marks according to preset clauses and embedding the marks into segments;
embedding the in-sentence positions of the words in the clauses according to the preset in-sentence word sequencing marks for the clauses added with the start-stop marks;
and accumulating the results of word embedding, segment embedding and sentence internal position embedding of each clause to obtain a sentence characteristic vector.
The purpose of the example word Embedding (Token Embedding) of the present application is to vectorize each word and start-stop identifier, taking S2 [ < CLS >, W1, W2< SEP > ] as an example, the word Embedding includes:
Figure BDA0002460447160000131
the present application example segment embedding (SegmentEmbedding) is used to distinguish individual clauses within the same file. The sentence distinguishing mark used in segment embedding may include a distinguishing mark that can distinguish adjacent sentences; in the same clause, the clause distinguishing marks of the words are the same; [ < CLS >, < W1>, < W2>, < SEP >, < CLS >, < W3>, < W4>, < W5>, < SEP >, < CLS >, < W6>, < W7>, < W8>, < SEP > ], the sentence-distinguishing identification sequence generated by the sentence-distinguishing identification can be: [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3] or [1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1; the marks of different clauses in the clause distinguishing mark sequence can be replaced by other numbers or letters, and the like, after the clause distinguishing mark sequence is determined, the sequence generated by the clause distinguishing mark is vectorized through a segment embedding matrix, and the vectorization can be realized by referring to the word embedding design.
The Position Embedding (Inner sequence Position Embedding) in the example Sentence of the application is used for expressing the Position information of words in the clause; the intra-sentence word ordering identifier may be an identifier that can be set by a person skilled in the art according to experience and can be used for distinguishing the ordering of each word in the clause, and includes, but is not limited to, a numeric identifier and a letter identifier; text adjusted in rank: by way of example, [ < CLS >, < W1>, < W2>, < SEP >, < CLS >, < W3>, < W4>, < W5>, < SEP >, < CLS >, < W6>, < W7>, < W8>, < SEP > ], the sequence of identifications of text identified by intra-sentence word ordering of numbers may be: [0, 1, 2, 3, 4, 0, 1, 2, 3, 4], vectorizing the sequence generated by the sentence word ordering identification by the sentence embedding matrix, wherein the vectorizing representation can be realized by referring to the word embedding design.
The application example accumulates the results of word embedding, segment embedding and sentence internal position embedding of each clause through the following formulas to obtain a sentence characteristic vector:
E1=Token Embedding of<CLS>+Segment Embedding of<CLS>+Inner SentencePosition Embedding of<CLS>。
the application example inputs the generated statement feature vector into a preset feature extractor to obtain an output vector;
the model structure of the feature extractor of the application example can adopt an overlapped translation encoder (transformer encoder), and the number of overlapped layers can be customized according to task complexity. In addition, the feature extractor can also adopt other structures such as a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), a long-short term memory network (LSTM) and the like; the input and output of the characteristic extractor are in one-to-one correspondence;
after the application example carries out initial identification mask on the output vector, sentence sequencing information is obtained through a preset decoder; the starting mark mask (sequence CLS Masking) is used for filtering the CLS mark in the model output vector, and the CLS is the starting mark of a single clause and is also used as a vector representation result of the single clause.
The Decoder of the application example comprises a Pointer Network Decoder (Pointer Network Decoder) which is used for processing the model output vector and obtaining statement sequencing information.
Calculating loss and gradient information of the statement sequencing information and the standard sequencing information, adjusting parameters of a feature extractor, and taking the feature extractor as a language model for vector expression when the loss of the statement sequencing information and the standard sequencing information is minimum and is constant for a preset period of minimum in the parameter adjustment process of the feature extractor;
in an illustrative example, the application example calculates the loss and gradient information of statement ordering information and standard ordering information through a Cross Entropy loss function (Cross entry), and continuously adjusts the parameters of a feature extractor through back propagation to obtain a language model for vector expression; the language model obtained by the embodiment of the invention can be used for the downstream task of the automatic text summarization. The application example generates sentence characteristic vectors of each clause after clause sorting adjustment, and after sentence sorting information is determined according to the sentence characteristic vectors, parameter adjustment is carried out on the characteristic extractor according to standard sorting information and the determined sentence sorting information, so that a language model for vector expression considering influence of sentence sorting and information among sentences on sentence weight is obtained, and technical support is provided for improving extraction quality of text summaries.
"one of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art. ".

Claims (12)

1. A language model generation method, comprising:
generating sentence characteristic vectors of each clause according to a preset generation strategy for the training text with the adjusted sentence arrangement sequence;
processing the generated sentence characteristic vectors of each clause through a preset characteristic extractor to obtain output vectors of each clause;
determining sentence arrangement sequence information of the clauses after the arrangement sequence is adjusted according to the obtained output vectors of all the clauses;
according to the standard sequencing information and the determined statement sequencing information, parameter adjustment is carried out on the feature extractor to obtain a language model for vector expression;
wherein the sentence feature vector comprises: character embedded characteristic information, characteristic information for distinguishing adjacent clauses and characteristic information for identifying word sequencing in the clauses; the standard ranking information includes: respectively adding numbers to the clauses of the training text with the clause arrangement sequence not adjusted in sequence; generating the number sequencing information of the training text without adjusting the arrangement sequence of the clauses according to the numbers of all the added clauses; the sentence ordering information includes: generated based on the added number of each clause: and after the sentence arrangement is adjusted, the numbering and sequencing information of all the sentences of the training text.
2. The language model generation method of claim 1, wherein generating the sentence feature vector of each clause according to a preset generation strategy comprises:
respectively adding preset start-stop marks to each clause of the training text with the adjusted sentence arrangement sequence;
embedding each clause added with the start-stop mark;
obtaining the sentence characteristic vector of each clause according to the embedding processing result of each clause;
wherein the start-stop identification comprises: a start identity and a stop identity.
3. The language model generation method according to claim 2, wherein the embedding process for each clause to which the start-stop flag is added includes:
performing character embedding on the clauses added with the start-stop marks;
for the clauses added with the start-stop marks, distinguishing the marks according to preset clauses and embedding the marks into segments;
embedding the in-sentence positions of the words in the clauses according to the preset in-sentence word sequencing marks for the clauses added with the start-stop marks;
wherein, the sentence distinguishing mark comprises: the marks are used for distinguishing adjacent clauses, and the clause distinguishing marks of words in the same clause are the same; the intra-sentence word ordering identification comprises: and the mark is used for distinguishing the arrangement sequence of each word in the clause.
4. The language model generation method according to claim 3, wherein the obtaining the sentence feature vector of each clause from the embedding processing result of each clause includes:
and accumulating the results of the word embedding, the segment embedding and the sentence internal position embedding of each clause to obtain the sentence characteristic vector of each clause.
5. The language model generation method according to any one of claims 2 to 4, wherein the determining sentence ordering information of the order-adjusted clause includes:
masking the initial identifications in the output vectors of all the clauses;
decoding the output vector after the mask processing through a preset decoder to obtain the statement sequencing information;
wherein the decoder comprises: a pointer network decoder.
6. The method for generating language model according to any one of claims 1 to 4, wherein the parameter adjustment of the feature extractor comprises:
according to standard sorting information and the loss and gradient information of the determined statement sorting information, parameter adjustment is carried out on the feature extractor through back propagation;
and adjusting parameters of the feature extractor, and taking the feature extractor as the language model for vector expression when judging that the loss values of the standard sequencing information and the sentence sequencing information have minimum values and the minimum values are unchanged for a preset period.
7. A method of obtaining a text excerpt, comprising:
carrying out vector expression on each clause of a text to be processed according to a pre-generated language model;
calculating the weight of each clause in the text to be processed according to the vector expression of each clause;
and extracting sentences from the text to be processed according to the calculated weight of each clause to obtain a text abstract.
8. The method of claim 7, wherein the language model comprises a model obtained by:
generating sentence characteristic vectors of each clause according to a preset generation strategy for the training text with the adjusted sentence arrangement sequence;
processing the generated sentence characteristic vectors of each clause through a preset characteristic extractor to obtain output vectors of each clause;
determining sentence arrangement sequence information of the clauses after the arrangement sequence is adjusted according to the obtained output vectors of all the clauses;
according to the standard sequencing information and the determined statement sequencing information, parameter adjustment is carried out on the feature extractor to obtain a language model for vector expression;
wherein the sentence feature vector comprises: character embedded characteristic information, characteristic information for distinguishing adjacent clauses and characteristic information for identifying word sequencing in the clauses; the standard ranking information includes: respectively adding numbers to the clauses of the training text with the clause arrangement sequence not adjusted in sequence; generating the number sequencing information of the training text without adjusting the arrangement sequence of the clauses according to the numbers of all the added clauses; the sentence ordering information includes: generated based on the added number of each clause: and after the sentence arrangement is adjusted, the numbering and sequencing information of all the sentences of the training text.
9. A computer storage medium having stored therein a computer program which, when executed by a processor, implements a language model generation method as claimed in any one of claims 1 to 6.
10. A terminal, comprising: a memory and a processor, the memory having a computer program stored therein; wherein the content of the first and second substances,
the processor is configured to execute the computer program in the memory;
the computer program, when executed by the processor, implements a language model generation method as recited in any of claims 1-6.
11. A computer storage medium having stored thereon a computer program which, when being executed by a processor, carries out the method of retrieving a text excerpt as claimed in claim 7 or 8.
12. A terminal, comprising: a memory and a processor, the memory having a computer program stored therein; wherein the content of the first and second substances,
the processor is configured to execute the computer program in the memory;
the computer program, when executed by the processor, implements a method of retrieving a text excerpt as claimed in claim 7 or 8.
CN202010318584.2A 2020-04-21 2020-04-21 Method for acquiring text abstract and language model generation method Withdrawn CN111581341A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010318584.2A CN111581341A (en) 2020-04-21 2020-04-21 Method for acquiring text abstract and language model generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010318584.2A CN111581341A (en) 2020-04-21 2020-04-21 Method for acquiring text abstract and language model generation method

Publications (1)

Publication Number Publication Date
CN111581341A true CN111581341A (en) 2020-08-25

Family

ID=72124503

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010318584.2A Withdrawn CN111581341A (en) 2020-04-21 2020-04-21 Method for acquiring text abstract and language model generation method

Country Status (1)

Country Link
CN (1) CN111581341A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113204956A (en) * 2021-07-06 2021-08-03 深圳市北科瑞声科技股份有限公司 Multi-model training method, abstract segmentation method, text segmentation method and text segmentation device
CN114218923A (en) * 2021-12-20 2022-03-22 北京中科闻歌科技股份有限公司 Text abstract extraction method, device, equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113204956A (en) * 2021-07-06 2021-08-03 深圳市北科瑞声科技股份有限公司 Multi-model training method, abstract segmentation method, text segmentation method and text segmentation device
CN113204956B (en) * 2021-07-06 2021-10-08 深圳市北科瑞声科技股份有限公司 Multi-model training method, abstract segmentation method, text segmentation method and text segmentation device
CN114218923A (en) * 2021-12-20 2022-03-22 北京中科闻歌科技股份有限公司 Text abstract extraction method, device, equipment and storage medium
CN114218923B (en) * 2021-12-20 2022-08-30 北京中科闻歌科技股份有限公司 Text abstract extraction method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109918680B (en) Entity identification method and device and computer equipment
US10824874B2 (en) Method and apparatus for processing video
CN108304372B (en) Entity extraction method and device, computer equipment and storage medium
CN112287670A (en) Text error correction method, system, computer device and readable storage medium
CN110163181B (en) Sign language identification method and device
CN110164435A (en) Audio recognition method, device, equipment and computer readable storage medium
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN111339283A (en) Method and device for providing customer service answers aiming at user questions
CN111599340A (en) Polyphone pronunciation prediction method and device and computer readable storage medium
CN111428485A (en) Method and device for classifying judicial literature paragraphs, computer equipment and storage medium
CN112613293B (en) Digest generation method, digest generation device, electronic equipment and storage medium
CN112131881A (en) Information extraction method and device, electronic equipment and storage medium
CN111581341A (en) Method for acquiring text abstract and language model generation method
CN112818680A (en) Corpus processing method and device, electronic equipment and computer-readable storage medium
CN111160026B (en) Model training method and device, and text processing method and device
CN110717021A (en) Input text and related device for obtaining artificial intelligence interview
CN117668181A (en) Information processing method, device, terminal equipment and storage medium
CN111782789A (en) Intelligent question and answer method and system
CN115858776B (en) Variant text classification recognition method, system, storage medium and electronic equipment
CN116909435A (en) Data processing method and device, electronic equipment and storage medium
CN114637852B (en) Entity relation extraction method, device, equipment and storage medium of medical text
CN110851597A (en) Method and device for sentence annotation based on similar entity replacement
CN115129843A (en) Dialog text abstract extraction method and device
CN115759048A (en) Script text processing method and device
CN116129883A (en) Speech recognition method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20200825