CN113505218B - Text extraction method, text extraction system, electronic device and storage device - Google Patents

Text extraction method, text extraction system, electronic device and storage device Download PDF

Info

Publication number
CN113505218B
CN113505218B CN202111042292.1A CN202111042292A CN113505218B CN 113505218 B CN113505218 B CN 113505218B CN 202111042292 A CN202111042292 A CN 202111042292A CN 113505218 B CN113505218 B CN 113505218B
Authority
CN
China
Prior art keywords
processed
text
sentence
sentences
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111042292.1A
Other languages
Chinese (zh)
Other versions
CN113505218A (en
Inventor
李直旭
郑新
支洪平
王佳安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Iflytek Suzhou Technology Co Ltd
Original Assignee
Iflytek Suzhou Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Iflytek Suzhou Technology Co Ltd filed Critical Iflytek Suzhou Technology Co Ltd
Priority to CN202111042292.1A priority Critical patent/CN113505218B/en
Publication of CN113505218A publication Critical patent/CN113505218A/en
Application granted granted Critical
Publication of CN113505218B publication Critical patent/CN113505218B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a text extraction method, a text extraction system, an electronic device and a storage device, wherein the method comprises the following steps: performing self-attention mechanism-based encoding on a text to be processed to generate an encoded first vector, and in the self-attention mechanism-based encoding process, interacting a part of characters in the text to be processed with all characters in the text to be processed, and interacting another part of characters in the text to be processed with only a part of characters in the text to be processed; decoding the first vector to obtain an initial extracted text; and filtering the initial extracted text to obtain a target extracted text. By the aid of the scheme, the processing efficiency of the self-attention mechanism on the texts to be processed can be improved, and human resources for extracting the target extraction texts from the texts to be processed are saved.

Description

Text extraction method, text extraction system, electronic device and storage device
Technical Field
The present application relates to the field of text processing technologies, and in particular, to a text extraction method, a text extraction system, an electronic device, and a storage device.
Background
With the coming of the information age, people put higher requirements on information acquisition, and in massive information, a large amount of human resources are needed to extract important information from the information, so that short and bold texts are formed to facilitate reading. Taking the example of the retransmission of a sports event, there are about thirty thousand football games at home and abroad every year, wherein only less than 30% of the games have related news, and a large number of the games only have comment texts. When the comment text needs to be converted into the news text for the user to read, news workers need to screen and refine the comment text to obtain the news text, so that human resources are greatly consumed, and the cost is increased.
In the prior art, a text to be processed is rewritten by using a seq2seq model in part of application scenes, but the self-attention mechanism of the seq2seq model is not suitable for scenes with more than 512 characters. Taking a sports event as an example, the comment text of the sports event usually exceeds 512 characters, and the seq2seq model cannot be applied to scenes with a large number of characters, and has a large limitation. In view of this, how to improve the processing efficiency of the self-attention mechanism to-be-processed text and save the human resources for extracting the target extracted text from the to-be-processed text becomes an urgent problem to be solved.
Disclosure of Invention
The technical problem mainly solved by the application is to provide a text extraction method, a text extraction system, an electronic device and a storage device, which can improve the processing efficiency of a self-attention mechanism on a text to be processed and save human resources for extracting a target extracted text from the text to be processed.
In order to solve the above technical problem, a first aspect of the present application provides a text extraction method, including: performing self-attention mechanism-based encoding on a text to be processed to generate an encoded first vector, wherein in the self-attention mechanism-based encoding process, a part of characters in the text to be processed interact with all characters in the text to be processed, and another part of characters in the text to be processed only interact with a part of characters in the text to be processed; decoding the first vector to obtain an initial extracted text; and filtering the initial extraction text to obtain a target extraction text.
In order to solve the above technical problem, a second aspect of the present application provides a text extraction system, including: the encoding module is used for encoding a text to be processed based on a self-attention mechanism to generate an encoded first vector, and in the encoding process based on the self-attention mechanism, a part of characters in the text to be processed interact with all characters in the text to be processed, and another part of characters in the text to be processed only interact with a part of characters in the text to be processed; the decoding module is used for decoding the first vector to obtain an initial extracted text; and the filtering module is used for filtering the initial extraction text to obtain a target extraction text.
In order to solve the above technical problem, a third aspect of the present application provides an electronic device, which includes a memory and a processor, which are coupled to each other, wherein the memory stores program instructions, and the processor is configured to execute the program instructions to implement the text extraction method in the first aspect.
In order to solve the above technical problem, a fourth aspect of the present application provides a storage device, where the storage device stores program instructions capable of being executed by a processor, and the program instructions are used to implement the text extraction method in the first aspect.
The scheme improves the self-attention mechanism, so that in the encoding process based on the self-attention mechanism, a part of characters in the text to be processed are interacted with all characters in the text to be processed, and another part of characters in the text to be processed are only interacted with a part of characters in the text to be processed, so that the complexity of coding based on the self-attention mechanism is reduced, the processing efficiency of the self-attention mechanism on the text to be processed is improved, thereby completing the encoding of the text to be processed based on the self-attention mechanism, generating a first encoded vector, decoding the first vector to obtain an initial extracted text, filtering the initial extracted text to enable the initial extracted text to be smoother and smoother, therefore, the target extraction text is obtained, the readability of the target text is improved, and human resources for extracting the target extraction text from the text to be processed are saved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts. Wherein:
FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a text extraction method of the present application;
FIG. 2 is a schematic flow chart diagram illustrating another embodiment of a text extraction method according to the present application;
FIG. 3a is a schematic diagram of an application scenario of seq2seq model in a self-attention mechanism operation;
FIG. 3b is a schematic diagram of an application scenario of the improved model self-attention mechanism operation of the present application;
FIG. 4 is a schematic flow chart illustrating an embodiment of obtaining training text for training a first text classification model according to the present application;
FIG. 5 is a block diagram of an embodiment of a text extraction system;
FIG. 6 is a block diagram of an embodiment of an electronic device of the present application;
FIG. 7 is a block diagram of an embodiment of a memory device according to the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "system" and "network" are often used interchangeably herein. The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship. Further, the term "plurality" herein means two or more than two.
It should be noted that, although the seq2seq model has a strong learning ability, the self-attention mechanism is too complex to be applied to the text with more than 512 characters, so the present application makes an improvement on the seq2seq model to obtain an improved model, and the improved model can be used for implementing the method in any embodiment of the present application.
Referring to fig. 1, fig. 1 is a schematic flowchart illustrating an embodiment of a text extraction method according to the present application. Specifically, the method may include the steps of:
step S11: and in the encoding process based on the self-attention mechanism, a part of characters in the text to be processed interact with all characters in the text to be processed, and another part of characters in the text to be processed only interacts with a part of characters in the text to be processed.
Specifically, a text to be processed is encoded, wherein only part of characters in the text to be processed in the improved model self-attention mechanism interact with all characters in the text to be processed, and other characters interact with only part of characters in the text to be processed, so that the processing complexity of the self-attention mechanism is greatly reduced, the improved model self-attention mechanism can be compatible with application scenarios in which the number of characters is far more than 512 characters, and the applicable scenarios of the improved model are expanded.
In an implementation scenario, a two-classification model is trained in advance for judging the importance degree of sentences in a text to be processed, the text to be processed is input into the two-classification model, a first sentence to be processed consisting of a part of sentences with relatively high importance degree and a second sentence to be processed with relatively low importance degree are output, the text to be processed is sent into an encoder, characters in the text to be processed are converted into vectors with preset dimensions, in an encoding process based on a self-attention mechanism, the characters in the first sentence to be processed are interacted with all characters in the text to be processed, each character in the second sentence to be processed is interacted with characters in a preset range nearby the character, and a first vector corresponding to the text to be processed is output after passing through a feedforward neural network.
In a specific application scenario, a text to be processed is a sports event comment text, a binary model is obtained after training by using a comment text corresponding to a news text, a sentence with high matching degree with the news text in the comment text is used as first training data with relatively high importance degree, other sentences in the comment text are used as second training data with relatively low importance degree, so that the binary model can classify the comment text, the text to be processed is classified by using the binary model, a first sentence to be processed consisting of partial sentences with relatively high importance degree and a second sentence to be processed with relatively low importance degree are obtained, when the text to be processed is coded based on a self-attention mechanism, each character in the first sentence to be processed interacts with all characters in the text to be processed, each character in the second sentence to be processed interacts with partial characters in the text to be processed, and outputting a first vector corresponding to the text to be processed after encoding.
Step S12: the first vector is decoded to obtain an initial extracted text.
Specifically, the first vector is decoded to obtain an initial extracted text corresponding to the first vector.
In one implementation scenario, the first vector is decoded by a decoder that is pre-trained to decode the first vector encoded by the encoder to convert the first vector into the initial extracted text. The encoder and the decoder are trained in advance, when the text to be processed is encoded, the text to be processed is simplified to obtain a first vector, and therefore the decoder decodes the first vector to obtain an initial extracted text.
Step S13: and filtering the initial extracted text to obtain a target extracted text.
Specifically, partial sentences in the initial extracted text are filtered, so that the sentences which are not smooth in the initial text are discarded, and the sentences with grammar errors are corrected, and the target extracted text is obtained, so that the target extracted text is smooth, and the readability of the target extracted text is improved.
In an implementation scenario, the filter is trained in advance by using training data, so that discordant sentences in the text can be searched and discarded, sentences with grammar errors are searched and corrected based on grammar rules, and the trained filter is used for filtering the initial extracted text to obtain the target extracted text.
In a specific application scenario, the text to be processed is a sports event comment text, the filter is obtained based on the existing news text, the news text is subjected to adding, deleting and modifying operations, partial sentences are added, partial sentences are deleted, and partial characters are modified in the news text, so that training data is obtained to train the filter, the filter is optimized based on the difference between the output result of the filter and the news text, the filter is obtained, and accordingly the initially extracted text is filtered by the trained filter.
The scheme improves the self-attention mechanism, so that in the encoding process based on the self-attention mechanism, a part of characters in the text to be processed are interacted with all characters in the text to be processed, and another part of characters in the text to be processed are only interacted with a part of characters in the text to be processed, so that the complexity of coding based on the self-attention mechanism is reduced, the processing efficiency of the self-attention mechanism on the text to be processed is improved, thereby completing the encoding of the text to be processed based on the self-attention mechanism, generating a first encoded vector, decoding the first vector to obtain an initial extracted text, filtering the initial extracted text to enable the initial extracted text to be smoother and smoother, therefore, the target extraction text is obtained, the readability of the target text is improved, and human resources for extracting the target extraction text from the text to be processed are saved.
Referring to fig. 2, fig. 2 is a schematic flowchart illustrating another embodiment of the text extraction method of the present application. Specifically, the method may include the steps of:
s21: and grading the sentences to be processed in the text to be processed, classifying the sentences to be processed into a first sentence to be processed and a second sentence to be processed based on the graded values, and coding the sentences to be processed in the text to be processed based on a self-attention mechanism.
Specifically, scoring is carried out according to the importance degree of the to-be-processed sentences in the to-be-processed text to obtain scored values, the to-be-processed sentences are classified based on the scored values to obtain a first to-be-processed sentence and a second to-be-processed sentence, and the to-be-processed sentences in the to-be-processed text are encoded based on a self-attention mechanism. In the self-attention mechanism-based encoding process, each character in the first sentence to be processed is interacted with all characters in the text to be processed, and each character in the second sentence to be processed is interacted with a character in a nearby preset range. The preset range near each character may be a preset number of characters before and after each character, and the preset number may be any integer from 5 to 10, which is not specifically limited in the present application.
Please refer to fig. 3a and fig. 3b, where fig. 3a is a schematic view of an application scenario of a seq2seq model self-attention mechanism operation, and fig. 3b is a schematic view of an application scenario of the improved model self-attention mechanism operation. As shown by the grey small blocks in fig. 3a, each character input needs to be interacted with all characters in the self-attention mechanism of the seq2seq model, so when n characters are included in the text to be processed, the processing complexity of the self-attention mechanism of the seq2seq model is n2As shown by the grey small blocks in fig. 3b, in the self-attention mechanism of the improved model, only each character in the first sentence to be processed interacts with all characters in the text to be processed, i.e. the horizontal and vertical grey small squares in fig. 3b, and each character in the second sentence to be processed interacts with only the characters in the preset range nearby, i.e. the diagonal grey small squares in fig. 3b, while the improved model further includes a plurality of blank small squares compared to the self-attention mechanism of the seq2seq model, and the processing complexity of the improved self-attention mechanism is higherThe method is much smaller than the seq2seq model, so the processing efficiency is higher and the performance requirements on the CPU and the GPU are lower.
In an implementation scenario, a pre-trained first text classification model is used for scoring a sentence to be processed in a text to be processed, wherein the score of the scoring is positively correlated with the importance degree of the sentence to be processed.
Specifically, the first text classification model is trained in advance, and can score the to-be-processed sentences in the to-be-processed text based on the importance degree of the sentences, so that scores of scores corresponding to the to-be-processed sentences after passing through the first text classification model are positively correlated with the importance degree of the to-be-processed sentences, that is, the higher the importance degree of the to-be-processed sentences is, the larger the corresponding scores are. Therefore, the importance degree corresponding to the sentence to be processed can be obtained based on the output result of the first text classification model, so that the first sentence to be processed for interacting with all characters can be selected conveniently.
Further, the step of classifying the sentence to be processed into the first sentence to be processed and the second sentence to be processed based on the score of the score includes: and sorting the sentences to be processed according to the score of the score, selecting a preset number of sentences to be processed which are sorted at the top as first sentences to be processed, and using the rest sentences to be processed as second sentences to be processed.
Specifically, after the score of the score corresponding to the sentence to be processed is obtained, the score is sorted in a descending order according to the score, a predetermined number of sentences to be processed which are sorted in the front are selected from all the sentences to be processed as first sentences to be processed, and the rest sentences to be processed are selected as second sentences to be processed, so that all the sentences to be processed in the text to be processed are classified into the first sentences to be processed and the second sentences to be processed, the first sentences to be processed correspond to the sentences which are sorted in the front and have a predetermined number, the first sentences to be processed correspond to the sentences with a high importance degree, only the characters in the first sentences to be processed are interacted with all the characters in the text to be processed, on the basis of reducing the processing complexity of the self-attention mechanism, the characters with a high importance degree can be interacted with all the characters as much as possible, and the rationality of character interaction is improved, so that more accurate text can be extracted later.
In a specific application scenario, after the scores of the scores corresponding to the statements to be processed are obtained, the scores are sorted in a descending order according to the size of the scores, and then the statements to be processed sorted in the first three are selected as the first statements to be processed.
Optionally, before the step of scoring the sentence to be processed in the text to be processed, the method further includes: and training a first text classification model by using the training text, wherein the training sentences in the training text are classified into first training sentences with relatively higher importance degree and second training sentences with relatively lower importance degree.
Specifically, the first text classification model is trained by using a training text, so that the first text classification model after training and training is obtained, and training sentences in the training text are classified into first training sentences with relatively high importance degree and second training sentences with relatively low importance degree. The first text classification model is used for distinguishing sentences with relatively high importance degree and sentences with relatively low importance degree and giving scores matched with the importance degrees of the sentences.
Further, the training texts correspond to target texts, and the importance degrees of the first training sentences and the second training sentences are positively related to the similarity between the first training sentences and the target sentences in the target texts. The target text corresponding to the training text is the text formed by the sentences extracted and refined from the large number of sentences of the training text, so that the importance degree of the first training sentence and the second training sentence is positively correlated with the similarity of the target sentence, the higher the similarity of the sentences in the training text and the target sentence is, the higher the importance degree is, so that the first text classification model after training can classify the text to be processed into a first sentence to be processed and a second sentence to be processed, the similarity between the sentences in the first sentence to be processed and the text expected to be extracted finally is higher, and the rationality of the first sentence to be processed is improved, so that the target extracted text can be obtained.
In a specific application scenario, the training text is a sports event comment text, the target text corresponding to the training text is a news text corresponding to the sports event, the importance degrees of the first training sentence and the second training sentence are positively correlated with the similarity of the sentences of the news text, so that the input comment text can be classified by the trained training text, and a first to-be-processed sentence with high similarity to the finally expected news text is obtained.
Further, the training process of the first text classification model comprises the following steps: setting initial scores for a first training sentence and a second training sentence in a training text respectively, classifying the training text by using a first text classification model to obtain a first prediction sentence corresponding to the first training sentence and a second prediction sentence corresponding to the second training sentence, and optimizing the first text classification model based on the difference between the first training sentence and the first prediction sentence and the score corresponding to the first training sentence and the difference between the second training sentence and the second prediction sentence and the score corresponding to the second training sentence to obtain the trained first text classification model.
In an implementation scenario, please refer to fig. 4, where fig. 4 is a schematic flowchart of an embodiment of obtaining a training text for training a first text classification model according to the present application, and before the step of training the first text classification model by using the training text, the method further includes:
s41: the target text is divided into a plurality of target sentences based on separators in the target text.
Specifically, in the target text, the target text is divided into a plurality of target sentences according to separators for comparison with the sentences in the training text.
In a specific application scenario, the target text is a news text, and the news text is divided into a plurality of target sentences according to punctuations in the news text.
S42: and acquiring the similarity between each target sentence and all the training sentences in the training text.
Specifically, each target sentence is compared with all the training sentences in the training text, and the similarity between all the training sentences and each target sentence is obtained.
Optionally, the step of obtaining the similarity between each target sentence and all the training sentences in the training text includes: obtaining semantic similarity between a target sentence and a training sentence and character similarity between the target sentence and the training sentence; and carrying out weighted summation on the semantic similarity and the character similarity.
Specifically, semantic similarity between the target sentences and the training sentences on deep semantics and character similarity between the target sentences and the training sentences on surface layer characters are obtained, corresponding first weights and second weights are respectively set for the semantic similarity and the character similarity, and then the semantic similarity and the character similarity are weighted and summed to obtain the similarity between all the training sentences and each target sentence. The above process is formulated as follows:
S(x,y) =λ*BERTScore(x,y)+(1-λ)*ROUGE(x,y)(1)
wherein x represents a target sentence, y represents a training sentence, λ is a first weight, 1- λ is a second weight, BERTScore is a semantic similarity calculated based on deep semantics after the sentence is converted into a word vector, and route is a character similarity calculated based on surface characters of the sentence. The semantic similarity of the target sentence and the training sentence on deep semantics and the character similarity on surface characters are comprehensively considered, so that the similarity after consideration of multiple angles on multiple layers is obtained, and the similarity between the target sentence and the training sentence is more accurate.
Further, the first weight corresponding to the semantic similarity is greater than the second weight corresponding to the character similarity, and the sum of the first weight and the second weight is 1.
The first weight corresponding to the semantic similarity is larger than the second weight corresponding to the character similarity, that is, when the similarity between the target sentence and the training sentence is calculated, the semantic similarity is used as a reference item with higher specific gravity so as to improve the accuracy of the similarity. Taking a football match as an example, the training sentences are: on kicking Torensi, Spain gets the lead, the target sentence is: torries point shooting is carried out to only enter the ball in the whole field, and Spain 1-0 is smaller. The same characters on the surface characters of the training sentences and the target sentences only comprise Torres and Spain, the frequency of occurrence of Torres and Spain in other training sentences is higher, so that a plurality of training sentences with higher character similarity and target sentence similarity are possible, the deep semantics of ' one kick in football match in the training sentences is ' a player directly drives a ball ', the deep semantics of ' shooting ' in the football match in the target sentences is ' a player drives a ball through the ball ', therefore, the deep semantics can better obtain the similarity of the training sentences and the target sentences, and the obtained similarity is more accurate when the first weight is larger than the second weight and the sum of the first weight and the second weight is 1.
In one specific application scenario, the values of the semantic similarity and the character similarity are both between 0 and 1, the first weight is 0.7, the second weight is 0.3, in other specific application scenarios, the first weight may be any value between 0.5 and 1, the second weight may be any value between 0 and 0.5, and the sum of the first weight and the second weight is 1.
S43: the training sentences are classified into first training sentences and second training sentences based on the similarity.
Specifically, the similarity between each target sentence and all training sentences is obtained, the training sentences are ranked according to the similarity, comment sentences with preset numerical values, which are ranked at the top of the similarity of each target sentence, are selected to form first training sentences, and the rest training sentences form second training sentences, wherein the sentences with high similarity to the target sentences are used as the sentences with high importance degree, and the sentences with low similarity to the target sentences are used as the sentences with low importance degree, so that the importance degrees of the first training sentences and the second training sentences are positively related to the similarity between the first training sentences and the target sentences in the target text.
S22: the first vector is decoded to obtain an initial extracted text.
Specifically, after the text to be processed is encoded, the first vector is decoded by a decoder, and the first vector is converted into an initial extracted text. Wherein the Decoder can decode the first vector by using a multi-layer transform-Decoder, and the number of layers of the Decoder is equal to that of the encoder.
S23: and filtering the initial extracted text to obtain a target extracted text.
Specifically, the initially extracted text is filtered based on the fluency of the sentences by using a second text classification model trained in advance, so that a target extracted text is obtained.
In an implementation scenario, a target sentence corresponding to a training sentence when a first text classification model is trained is obtained, addition, deletion and modification operations are performed on the sentence and/or characters in the target sentence to obtain a training sample, the training sample is filtered by using a second text classification model to obtain a third prediction sentence, and the second text classification model is optimized based on the difference between the target sentence and the third prediction sentence to obtain a trained second text classification model. And after the second text classification model obtains the decoded initial extracted text, if character redundancy and/or character omission and/or sentence incompatibility exist in the initial extracted text, the second text classification model can search the sentence with grammar error and/or character error for correction based on the fluency of the sentence, so as to obtain the final target extracted text, thereby improving the fluency and readability of the target extracted text, and realizing an automatic process in the process of extracting the text from the input to the output of the target from the text to be processed, thereby saving human resources for extracting the target extracted text from the text to be processed.
In this embodiment, a pre-trained first text classification model is used to score a text to be processed, and the first text classification model is classified into a first sentence to be processed and a second sentence to be processed based on the score of the score, wherein a training text of the first text classification model corresponds to a target text, the first training sentence and the second training sentence are distinguished based on the similarity between the target sentence in the target text and a training sentence in the training text, so that the importance degrees of the first training sentence and the second training sentence are positively related to the similarity between the first training sentence and the target sentence in the target text, and thus the first sentence to be processed obtained after the text to be processed passes through the first text classification model has a high importance degree, and only characters in the first sentence to be processed interact with all characters in the text to be processed, on the basis of reducing the processing complexity of the self-attention mechanism, characters with high importance degree can be interacted with all the characters as far as possible, and the reasonability of character interaction is improved, so that more accurate target extraction texts can be extracted in the subsequent process.
Referring to fig. 5, fig. 5 is a schematic diagram of a framework of an embodiment of a text extraction system of the present application, and the text extraction system 50 includes an encoding module 51, a decoding module 52, and a filtering module 53. The encoding module 51 is configured to perform encoding based on an attention mechanism on a to-be-processed text to generate an encoded first vector, and in the encoding process based on the attention mechanism, a part of characters in the to-be-processed text interact with all characters in the to-be-processed text, and another part of characters in the to-be-processed text interact with only a part of characters in the to-be-processed text; the decoding module 52 is configured to decode the first vector to obtain an initial extracted text; the filtering module 53 is configured to filter the initial extracted text to obtain a target extracted text.
The above-described scheme, in the encoding module 51, improves the self-attention mechanism, so that in the encoding process based on the self-attention mechanism, a part of characters in the text to be processed are interacted with all characters in the text to be processed, and another part of characters in the text to be processed are only interacted with a part of characters in the text to be processed, so that the complexity of coding based on the self-attention mechanism is reduced, the processing efficiency of the self-attention mechanism on the text to be processed is improved, thereby completing the encoding of the text to be processed based on the self-attention mechanism, generating a first vector after encoding, decoding the first vector by the decoding module 52 to obtain an initial extracted text, filtering the initial extracted text by the filtering module 53 to make the initial extracted text more smooth, therefore, the target extraction text is obtained, the readability of the target text is improved, and human resources for extracting the target extraction text from the text to be processed are saved.
In some embodiments, the encoding module 51 is further configured to score the to-be-processed sentences in the to-be-processed text, classify the to-be-processed sentences into a first to-be-processed sentence and a second to-be-processed sentence based on the score of the score, and encode the to-be-processed sentences in the to-be-processed text based on the self-attention mechanism; in the self-attention mechanism-based encoding process, each character in the first sentence to be processed is interacted with all characters in the text to be processed, and each character in the second sentence to be processed is interacted with a character in a preset range nearby the character.
Different from the foregoing embodiment, the first to-be-processed sentence and the second to-be-processed sentence are obtained based on the score, only the characters in the first to-be-processed sentence are used for interacting with all the characters, and the second to-be-processed sentence is only used for interacting with the characters in the preset range, so that the processing complexity of the improved self-attention mechanism is far smaller than that of the seq2seq model, and therefore, the processing efficiency is higher, and the performance requirements on the CPU and the GPU are also lower.
In some embodiments, the encoding module 51 is further configured to score the to-be-processed sentences in the to-be-processed text by using the pre-trained first text classification model, where the score of the score is positively correlated to the importance of the to-be-processed sentences.
Different from the foregoing embodiment, the importance degree corresponding to the sentence to be processed can be obtained based on the output result of the first text classification model, so as to select the first sentence to be processed for interacting with all characters.
In some embodiments, the encoding module 51 is further configured to sort the sentences to be processed according to the score of the score, select a predetermined number of sentences to be processed that are sorted at the top as the first sentence to be processed, and select the rest sentences to be processed as the second sentence to be processed.
Different from the foregoing embodiment, the first sentence to be processed corresponds to the sentence with the high degree of importance, and then only the characters in the first sentence to be processed are interacted with all the characters in the text to be processed, so that on the basis of reducing the processing complexity of the self-attention mechanism, the characters with the high degree of importance can be interacted with all the characters as much as possible, and the rationality of character interaction is improved, so as to extract a more accurate text in the following.
In some embodiments, the filtering module 53 is further configured to filter the initially extracted text based on the fluency of the sentence using a second text classification model trained in advance.
In contrast to the foregoing embodiment, the second text classification model can search for an unhygienic sentence in the initial extracted text based on the fluency of the sentence, discard the unhygienic sentence, search for a sentence with a grammatical error and/or a character error, and correct the sentence to obtain a final target extracted text, thereby improving the fluency and readability of the target extracted text.
Referring to fig. 6, fig. 6 is a schematic frame diagram of an embodiment of an electronic device according to the present application. The electronic device 60 comprises a memory 61 and a processor 62 coupled to each other, the memory 61 stores program instructions, and the processor 62 is configured to execute the program instructions to implement the steps in any of the above-described embodiments of the text extraction method. For details, please refer to the method in any of the above embodiments, which is not described herein again.
It should be noted that the processor 62 is configured to control itself and the memory 61 to implement the steps in any of the above-described embodiments of the text extraction method. The processor 62 may also be referred to as a CPU (Central Processing Unit). The processor 62 may be an integrated circuit chip having signal processing capabilities. The Processor 62 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 62 may be commonly implemented by a plurality of integrated circuit chips.
The scheme improves the self-attention mechanism, so that in the encoding process based on the self-attention mechanism, a part of characters in the text to be processed are interacted with all characters in the text to be processed, and another part of characters in the text to be processed are only interacted with a part of characters in the text to be processed, so that the complexity of coding based on the self-attention mechanism is reduced, the processing efficiency of the self-attention mechanism on the text to be processed is improved, thereby completing the encoding of the text to be processed based on the self-attention mechanism, generating a first encoded vector, decoding the first vector to obtain an initial extracted text, filtering the initial extracted text to enable the initial extracted text to be smoother and smoother, therefore, the target extraction text is obtained, the readability of the target text is improved, and human resources for extracting the target extraction text from the text to be processed are saved.
Referring to fig. 7, fig. 7 is a schematic diagram of a memory device according to an embodiment of the present application. The memory device 70 stores program instructions 700 capable of being executed by the processor, the program instructions 700 being for implementing the steps in any of the above-described embodiments of the text extraction method.
By the aid of the scheme, the processing efficiency of the self-attention mechanism on the texts to be processed can be improved, and human resources for extracting the target extraction texts from the texts to be processed are saved.
In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims (9)

1. A method of text extraction, the method comprising:
performing self-attention mechanism-based encoding on a text to be processed to generate an encoded first vector, wherein in the self-attention mechanism-based encoding process, a first sentence to be processed in the text to be processed interacts with all characters in the text to be processed, and a second sentence to be processed in the text to be processed interacts with only part of characters in the text to be processed;
decoding the first vector to obtain an initial extracted text;
filtering the initial extraction text to obtain a target extraction text;
the step of encoding the text to be processed based on the self-attention mechanism comprises the following steps:
scoring the sentences to be processed in the texts to be processed, classifying the sentences to be processed into a first sentence to be processed and a second sentence to be processed based on the score of the score, and encoding the sentences to be processed in the texts to be processed based on an attention mechanism;
in the encoding process based on the self-attention mechanism, each character in the first sentence to be processed is interacted with all characters in the text to be processed, and each character in the second sentence to be processed is interacted with a character in a preset range.
2. The method of claim 1, wherein the step of scoring the sentence in the text to be processed comprises:
and scoring the sentences to be processed in the texts to be processed by utilizing a pre-trained first text classification model, wherein the score of the scoring is positively correlated with the importance degree of the sentences to be processed.
3. The method according to claim 2, wherein the step of classifying the sentence to be processed into a first sentence to be processed and a second sentence to be processed based on the score of the score comprises:
and sequencing the sentences to be processed according to the scored values, selecting a preset number of the sentences to be processed which are sequenced at the top as the first sentences to be processed, and taking the rest sentences to be processed as the second sentences to be processed.
4. The method of claim 2, wherein the step of scoring the sentence in the text to be processed is preceded by the step of:
training the first text classification model by using a training text, wherein training sentences in the training text are classified into first training sentences with relatively high importance degree and second training sentences with relatively low importance degree;
the training texts correspond to target texts, and the importance degrees of the first training sentences and the second training sentences positively correlate to the similarity between the first training sentences and the target sentences in the target texts.
5. The method of claim 4, wherein the step of training the first text classification model using training text is preceded by the step of:
dividing the target text into a plurality of target sentences based on separators in the target text;
acquiring the similarity between each target sentence and all training sentences in the training text;
classifying the training sentence into a first training sentence and a second training sentence based on the similarity.
6. The method of claim 5, wherein the step of obtaining the similarity between each target sentence and all training sentences in the training text comprises:
obtaining semantic similarity between the target sentence and the training sentence and character similarity between the target sentence and the training sentence;
and carrying out weighted summation on the semantic similarity and the character similarity.
7. A text extraction system, the system comprising:
the encoding module is used for encoding a text to be processed based on a self-attention mechanism so as to generate an encoded first vector, and in the encoding process based on the self-attention mechanism, a first sentence to be processed in the text to be processed interacts with all characters in the text to be processed, and a second sentence to be processed in the text to be processed interacts with only part of characters in the text to be processed;
the decoding module is used for decoding the first vector to obtain an initial extracted text;
the filtering module is used for filtering the initial extraction text to obtain a target extraction text;
wherein, the step of encoding the text to be processed based on the self-attention mechanism comprises the following steps:
scoring the sentences to be processed in the texts to be processed, classifying the sentences to be processed into a first sentence to be processed and a second sentence to be processed based on the score of the score, and encoding the sentences to be processed in the texts to be processed based on an attention mechanism;
in the encoding process based on the self-attention mechanism, each character in the first sentence to be processed is interacted with all characters in the text to be processed, and each character in the second sentence to be processed is interacted with a character in a preset range.
8. An electronic device comprising a memory and a processor coupled to each other, the memory having stored therein program instructions, the processor being configured to execute the program instructions to implement the text extraction method of any one of claims 1 to 6.
9. A storage device storing program instructions executable by a processor to implement the text extraction method of any one of claims 1 to 6.
CN202111042292.1A 2021-09-07 2021-09-07 Text extraction method, text extraction system, electronic device and storage device Active CN113505218B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111042292.1A CN113505218B (en) 2021-09-07 2021-09-07 Text extraction method, text extraction system, electronic device and storage device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111042292.1A CN113505218B (en) 2021-09-07 2021-09-07 Text extraction method, text extraction system, electronic device and storage device

Publications (2)

Publication Number Publication Date
CN113505218A CN113505218A (en) 2021-10-15
CN113505218B true CN113505218B (en) 2021-12-21

Family

ID=78016831

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111042292.1A Active CN113505218B (en) 2021-09-07 2021-09-07 Text extraction method, text extraction system, electronic device and storage device

Country Status (1)

Country Link
CN (1) CN113505218B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114742035B (en) * 2022-05-19 2023-07-07 北京百度网讯科技有限公司 Text processing method and network model training method based on attention mechanism optimization

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110825845A (en) * 2019-10-23 2020-02-21 中南大学 Hierarchical text classification method based on character and self-attention mechanism and Chinese text classification method
CN111611346A (en) * 2020-05-09 2020-09-01 迟殿委 Text matching method and device based on dynamic semantic coding and double attention

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110232183B (en) * 2018-12-07 2022-05-27 腾讯科技(深圳)有限公司 Keyword extraction model training method, keyword extraction device and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110825845A (en) * 2019-10-23 2020-02-21 中南大学 Hierarchical text classification method based on character and self-attention mechanism and Chinese text classification method
CN111611346A (en) * 2020-05-09 2020-09-01 迟殿委 Text matching method and device based on dynamic semantic coding and double attention

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Abstractive social media text summarization using selective reinforced Seq2Seq attention model;Zeyu Liang 等;《Neurocomputing》;20200512;第432页-第44页 *
Self-Attention Encoding and Pooling for Speaker Recognition;Pooyan Safari 等;《arXiv》;20200803;第1页-第5页 *
卷积自注意力编码过滤的强化自动摘要模型;徐如阳 等;《小型微型计算机系统》;20200228;第41卷(第2期);第271页-第277页 *

Also Published As

Publication number Publication date
CN113505218A (en) 2021-10-15

Similar Documents

Publication Publication Date Title
CN108197111B (en) Text automatic summarization method based on fusion semantic clustering
CN106328147B (en) Speech recognition method and device
CN108287858B (en) Semantic extraction method and device for natural language
CN108573047A (en) A kind of training method and device of Module of Automatic Chinese Documents Classification
WO2021179701A1 (en) Multilingual speech recognition method and apparatus, and electronic device
CN110008309B (en) Phrase mining method and device
WO2020199595A1 (en) Long text classification method and device employing bag-of-words model, computer apparatus, and storage medium
CN113766314B (en) Video segmentation method, device, equipment, system and storage medium
KR20150037924A (en) Information classification based on product recognition
CN111581374A (en) Text abstract obtaining method and device and electronic equipment
US20220300708A1 (en) Method and device for presenting prompt information and storage medium
EP3757874A1 (en) Action recognition method and apparatus
CN109325109A (en) Attention encoder-based extraction type news abstract generating device
WO2016095645A1 (en) Stroke input method, device and system
CN110019776A (en) Article classification method and device, storage medium
CN113505218B (en) Text extraction method, text extraction system, electronic device and storage device
CN114780672A (en) Medical question and answer processing method and device based on network resources
CN104199813B (en) Pseudo-feedback-based personalized machine translation system and method
CN116320607A (en) Intelligent video generation method, device, equipment and medium
KR101944274B1 (en) Appratus and method for classfying situation based on text
CN114598933A (en) Video content processing method, system, terminal and storage medium
CN116245102B (en) Multi-mode emotion recognition method based on multi-head attention and graph neural network
CN115495578A (en) Text pre-training model backdoor elimination method, system and medium based on maximum entropy loss
EP4113383A1 (en) Method and device for presenting prompt information, and storage medium
CN114818644B (en) Text template generation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant