WO2023029354A1

WO2023029354A1 - Text information extraction method and apparatus, and storage medium and computer device

Info

Publication number: WO2023029354A1
Application number: PCT/CN2022/071444
Authority: WO
Inventors: 谯轶轩; 陈浩
Original assignee: 平安科技（深圳）有限公司
Priority date: 2021-08-30
Filing date: 2022-01-11
Publication date: 2023-03-09
Also published as: CN113722436A

Abstract

The present application relates to the technical field of artificial intelligence. Disclosed are a text information extraction method and apparatus, and a storage medium and a computer device, which can improve the accuracy of text information extraction. The method comprises: performing sentence recognition on a text paragraph to be subjected to extraction, so as to obtain an initial word vector group of said text paragraph; predicting, by using a pre-trained text extraction network model, a plurality of start word vectors to be determined, which are used for representing a text extraction start position, in the initial word vector group; according to the plurality of said start word vectors and the initial word vector group, predicting a plurality of end word vectors to be determined which correspond to said start word vectors in the initial word vector group; and according to the plurality of said start word vectors obtained by means of prediction, and the plurality of said end word vectors corresponding to said start word vectors, determining target text to be extracted. The present application is applicable to the extraction of target text in a data set.

Description

Text information extraction method, device, storage medium and computer equipment

This application claims the priority of the Chinese patent application with the application number 202111007458.6 and the application title "text information extraction method, device, computer equipment and storage medium" submitted to the China Patent Office on August 30, 2021, the entire content of which is incorporated by reference incorporated in the application.

technical field

The present application relates to the technical field of artificial intelligence, in particular to a text information extraction method, device, storage medium and computer equipment.

Background technique

As a technology to extract specific information from text data, text information extraction is developing in the direction of digitization, intelligence, and semantics with the development of artificial intelligence and other disciplines, and it plays a greater role in social knowledge management. Currently widely used text information extraction methods include, based on regular expressions, artificial filtering or matching rules, regular rule methods for text extraction; use of named entity recognition NER models to process by setting extraction tasks; And, other mainstream ways to predict individual words in text.

In the prior art, the inventor realized that the regular rule method has the problem of relying on artificial rules. When faced with a complex sentence environment and text with incomplete semantics, it cannot completely extract text information; NER model recognition is prone to overfitting, When faced with texts containing new corpus information, the accuracy of extraction drops significantly; and words in isolated texts are extracted, resulting in low accuracy of text information extraction.

Contents of the invention

In view of this, the present application provides a text information extraction method, device, storage medium and computer equipment.

According to one aspect of the present application, a method for extracting text information is provided, the method comprising:

By performing sentence recognition on the text paragraph to be extracted, the initial word vector group of the text paragraph to be extracted is obtained;

Using the pre-trained text extraction network model, predicting a plurality of initial word vectors to be determined for representing the starting position of text extraction in the initial word vector group;

Predicting a plurality of to-be-determined ending word vectors corresponding to each of the to-be-determined starting word vectors in the initial word vector group according to the plurality of to-be-determined starting word vectors and the described initial word vector groups;

According to the predicted multiple to-be-determined start word vectors and multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors, the target text to be extracted is determined.

According to another aspect of the present application, a text information extraction device is provided, the device comprising:

The sentence recognition module is used to carry out sentence recognition by the text paragraph to be extracted to obtain the initial word vector group of the text paragraph to be extracted;

The first position prediction module is used to use the pre-trained text extraction network model to predict a plurality of initial word vectors to be determined for representing the start position of text extraction in the initial word vector group;

The second position prediction module is used to predict a plurality of to-be-determined initial word vectors corresponding to each of the to-be-determined initial word vectors in the initial word vector group according to the plurality of to-be-determined initial word vectors and the said initial word vector group. Determine the end word vector;

The determination module is configured to determine the target text to be extracted according to the multiple to-be-determined start word vectors obtained through prediction, and the multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors.

According to another aspect of the present application, a storage medium is provided, on which computer-readable instructions are stored, and when the program is executed by a processor, the above text information extraction method is implemented, including:

According to still another aspect of the present application, a computer device is provided, including a storage medium, a processor, and computer-readable instructions stored on the storage medium and operable on the processor. When the processor executes the program, the above-mentioned Text information extraction methods, including:

By means of the above technical solution, the accuracy of text information extraction is effectively improved.

Description of drawings

The drawings described here are used to provide a further understanding of the application and constitute a part of the application. The schematic embodiments and descriptions of the application are used to explain the application and do not constitute an improper limitation to the application. In the attached picture:

FIG. 1 shows a schematic flow chart of a method for extracting text information provided by an embodiment of the present application;

FIG. 2 shows a schematic flow diagram of another text information extraction method provided by the embodiment of the present application;

FIG. 3 shows a schematic diagram of the text extraction network model architecture in the training phase provided by the embodiment of the present application;

FIG. 4 shows a schematic structural diagram of a text information extraction device provided by an embodiment of the present application;

FIG. 5 shows a schematic structural diagram of another apparatus for extracting text information provided by an embodiment of the present application.

Detailed ways

Hereinafter, the present application will be described in detail with reference to the drawings and embodiments. It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other.

The embodiments of the present application may acquire and process relevant data based on artificial intelligence technology. Among them, artificial intelligence (AI: Artificial Intelligence) is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .

Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.

In view of the technical problems of low accuracy of text information extraction in the regular rule method, NER model recognition method, and other mainstream methods in the prior art, taking the regular rule method as an example, in the context of data set reference, for other text information Quotations in often have the terms "Survey", "Data", "Study", "Database", "Statistics" appear with high frequency, and the words used will start with a capital. The regular rule method further realizes the extraction of text information by filtering the matched reference information, but the regular rule method is too simple, and the text extraction performance depends on the specification of manual rules, and the text extraction effect is relatively poor. Based on this, this embodiment provides a method for extracting text information, as shown in Figure 1, taking the application of this method to a computer device such as a server as an example for illustration, wherein the server can be an independent server, or it can provide a cloud service , cloud database, cloud computing, cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, content distribution network (CDN: Content Delivery Network), and big data and artificial intelligence platforms and other basic clouds Cloud servers for computing services, such as intelligent medical systems, digital medical platforms, etc. The above method comprises the following steps:

Step S101. Obtain an initial word vector group of the text paragraph to be extracted by performing sentence recognition on the text paragraph to be extracted.

In this embodiment, in order to facilitate the processing of text information by the text extraction network model, word segmentation processing is performed on the words of the text paragraph to be extracted, and the text paragraph after word segmentation is divided according to the preset sequence length, and one or more text paragraphs containing The initial data sequence of the complete sentence, and perform word vector conversion processing on the initial data sequence to obtain the initial word vector group. Specifically, the text paragraphs are divided into sentence units, and the text paragraphs smaller than the preset sequence length are completed.

In the exemplary embodiment of the present application, dividing the text paragraphs after word segmentation by 512 words can enhance the ability of the text extraction network model to extract long texts. Further, dividing the text paragraphs in units of sentences can effectively avoid text paragraphs In the division process, a complete sentence is divided into different data sequences, which in turn affects the accuracy of the text extraction network model for contextual semantic extraction.

According to the needs of actual application scenarios, for example, the response event of Baike Q&A, according to the question information input by the user and the obtained target text paragraphs, the splicing process is performed to obtain the text paragraphs to be extracted containing the question information, and according to the text to be extracted containing the question information The sentence recognition of the paragraph is carried out to obtain the initial word vector group of the text paragraph to be extracted, so as to further predict the position of the start word vector and the position of the end word vector in the text paragraph to be extracted. Due to the addition of the user's question information, the extracted text Information is more accurate.

Step S102 , using the pre-trained text extraction network model, predict a plurality of undetermined initial word vectors used to represent the start position of text extraction in the initial word vector group.

In this embodiment, the pre-training module (GPT: Generative Pre-training) in the pre-trained text extraction network model is used to enable each word vector in the initial word vector group to learn the semantic information of other word vectors to obtain contextual The first word vector group of semantic information; further, use the first position prediction module to obtain the predicted probability value of the starting position of each word vector in the first word vector group, and determine K starting positions in the first word vector group by traversing The to-be-determined starting word vector with the largest position prediction probability value.

In the exemplary embodiment of the present application, the pre-training model GPT adopts a multi-layer Transformer architecture, and the self-attention mechanism self-attention enables each word vector to extract grammar other than its own features after multi-layer learning. Syntactic and other deep-level semantic information establishes the contextual connection of each word vector in the initial word vector group, thereby improving the accuracy of text information extraction by the text extraction network model.

Step S103. According to the plurality of to-be-determined start word vectors and the initial word vector group, predict a plurality of to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors in the initial word vector group.

In this embodiment, in order to further improve the accuracy of word vector start and end position prediction, the word vector end position is predicted by using the predicted word vector start position information. Specifically, the K initial word vectors to be determined are respectively subjected to vector splicing processing with the initial word vector groups to obtain K spliced word vector groups for inputting into the second position prediction module. Utilize the second position prediction module to obtain the end position prediction probability value corresponding to each to-be-determined start word vector in the spliced word vector group, and determine the N end positions corresponding to each to-be-determined start word vector in the spliced word vector group by traversing The to-be-determined ending word vector with the largest predicted probability value. Wherein, K and N may be set to be equal or unequal according to requirements of actual application scenarios.

Step S104: Determine the target text to be extracted according to the predicted multiple to-be-determined start word vectors and the multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors.

In this embodiment, according to each to-be-determined start word vector, and the corresponding N end position prediction probability values of the corresponding to-be-determined end word vectors, K*N initially extracted text combinations are obtained, and K*N is further determined The to-be-determined extracted text combinations that meet the preset conditions in the initial extracted text combinations, and the predicted probability value according to the starting position of each to-be-determined start word vector in the to-be-determined extracted text combinations, and the corresponding multiple to-be-determined end words The predicted probability value of the end position of the vector is determined by product calculation to determine that the start word vector corresponding to the maximum product value is the target start word vector, and its corresponding end word vector is the target end word vector, thereby obtaining the target extracted text.

For this embodiment, according to the above scheme, the initial word vector group of the text paragraph to be extracted obtained by sentence recognition is input into the pre-trained text extraction network model, and a plurality of initial word vectors to be determined in the initial word vector group are predicted. , according to a plurality of to-be-determined start word vectors and initial word vector groups, predict a plurality of to-be-determined end word vectors corresponding to each to-be-determined start word vector, thereby obtaining a plurality of to-be-determined start word vectors according to the prediction, and A plurality of to-be-determined end word vectors corresponding to each to-be-determined start word vector is determined to extract the target text. Compared with the existing regular rule method, NER model recognition method, and other mainstream technical solutions, this embodiment can improve the extraction accuracy of the text extraction network model based on the initial word vector and the initial word vector group, thereby more accurately Extract the target text in the text paragraph. Further, as a refinement and extension of the specific implementation of the above embodiment, in order to fully describe the specific implementation process of this embodiment, another text information extraction method is provided, as shown in Figure 2, the method includes:

Step S201, training an initial text extraction network model.

Specifically, as shown in Figure 3, in order to improve the accuracy of text information extraction, the initial text extraction network model constructed includes a serial first position prediction module and a second position prediction module, namely the span module, which are used to implement Predict the start position and end position of the target extracted text; add a pre-training module GPT at the input of the first position prediction model to obtain the contextual semantic information of each word vector; in the model training stage, by adding a correction module , so that the update of the model parameters obtained from the training of the text extraction network model tends to be more stable.

In order to illustrate the specific implementation of step 201, as a preferred embodiment, step 201 may specifically include: training the initial text extraction network model according to the position labels corresponding to the starting position serial number and the ending position serial number in the training sample; when monitoring When the current loss value of the first loss function in the initial text extraction network model drops to a preset percentage of the initial loss value, the first-stage text extraction network model is obtained; using the first loss function and the preset The second loss function corresponding to the correction module performs secondary training on the first-stage text extraction network model according to the training samples ignoring the position label, to obtain a trained text extraction network model. Among them, the correction module is used to assist in training the first position prediction module and the second position prediction module in the first-stage text extraction network model.

The specific steps for training the initial text extraction network model constructed include:

1) Obtain a paragraph of text as a training sample, and the tagged text sequence used for extraction in the training sample is preset as [w100, w101, w102, w103], where w100 corresponds to the starting position of the referenced target, and w103 corresponds to the end of the referenced target Location.

2) Perform word segmentation on the training samples. English uses spaces for word segmentation, and Chinese uses Baidu's public word segmentation tool jieba for word segmentation to obtain text paragraphs after word segmentation.

3) Divide the segmented text paragraphs according to the preset sequence length to obtain one or more initial data sequences containing complete sentences. Specifically, the length of the sequence is set to 512 words, and the text paragraphs whose length is less than 512 words are completed to establish a set of initial data sequences containing complete sentences; the text paragraphs longer than 512 words are based on complete sentences. Truncation, after truncation, fill in the parts less than 512 words to establish a set of initial data sequences containing complete sentences, and continue to divide the remaining text paragraphs after truncation as a new text paragraph until the end of the division, Get multiple sets of initial data sequences containing complete sentences.

4) Use the trained word2vec or GloVe word vector module to convert each word in the initial data sequence into a word vector, and obtain the initial word vector group, expressed as [w1, w2, ... w512].

5) Use the GPT module in the initial text extraction network model to extract the semantic features of each word in the initial word vector group, and obtain the first word vector group containing contextual semantics, expressed as [h1, h2, ... h512] . Specifically, the GPT model adopts a multi-layer Transformer architecture, and each layer of the Transformer contains a self-attention mechanism self-attention, which can make each word in [w1, w2, ... w512] compare to words in other positions Word feature information is extracted, and the extracted feature information is used to update its own vector to obtain the deep relationship between other words and itself, that is, each word vector in the initial word vector group after multi-layer learning , to obtain word vectors containing grammar, syntax and other deep semantic information of all other positions in the initial word vector group, so as to obtain the first word vector group [h1, h2, ... h512] containing contextual semantics.

6) Input the first word vector group [h1, h2, ... h512] into the first position prediction module (span module) in the initial text extraction network model, and output the starting position prediction of the starting word vector h100 at the target starting position probability value.

7) Concatenate the first word vector h100 at the starting position of the target with each word vector in the first word vector group to obtain a spliced word vector group, expressed as [h100+h1, h100+h2,...h100 +h512].

8) Input the concatenated word vector group into the second position prediction module in the initial text extraction network model, and output the end position prediction probability value of the end word vector h103 at the target end position, wherein the first position prediction module and the second position prediction module same. Optionally, the first position prediction module and the second position prediction module can be two position prediction modules, or one position prediction module, which sequentially output the position prediction probability values of the start word vector and the end word vector, which is not correct here The position prediction module makes specific limitations.

9) During the training process, the target is to maximize the product of the position prediction probability value of the target start position and the target end position. If it is monitored that the current loss value of the first loss function L1 of the position prediction module drops to 30% of the initial loss value %, set the target start position and target end position as empty, and use the multi-task learning framework for secondary training to obtain a trained text extraction network model. Among them, the multi-task learning framework is to use the correction module to assist the training position prediction module, that is, when the current loss value L _m of the first loss function L1 is reduced to 30% of the initial loss value, the first-stage text extraction network model is obtained. According to the correction module The corresponding second loss function L2 and first loss function L1 continue to train the first-stage text extraction network model to obtain the second-stage text extraction network model as a trained text extraction network model. Specifically:

① Re-input the first word vector group [h1, h2, ... h512] containing the context semantics into the first position prediction module in the first-stage text extraction network model, and output the starting position prediction probability value of each first word vector , take the K word vectors with the largest predicted probability values of the starting positions as the starting word vectors to be determined; for each starting word vector to be determined, splice it with each word vector in the first word vector group respectively , to obtain the concatenated word vector group; through the second position prediction module in the text extraction network model of the first stage, the predicted end position probability value of each word vector corresponding to each start word vector to be determined is obtained, and N end positions are taken The word vector with the largest predicted probability value is used as the ending word vector to be determined; K*N initial Extract text combinations.

The position prediction module is used to calculate the position prediction probability value of the word vector, and its calculation formula is:

p=S(Wx+b)

Among them, W is the weight, b is the bias value, which is the network model parameter that is continuously updated through model training and learning, s represents the sigmoid function, and the expression is as follows:

② Synchronously input the first word vector group [h1, h2, ... h512] containing the context semantics into the correction module to directly extract text information, that is, to realize long text extraction by continuously predicting the next word. The specific structure is A fully connected layer plus a softmax layer, that is, P=softmax(wx+b), where x is a word vector group [h1, h2,...h512] containing contextual semantics, and the softmax layer is used to predict the value of each next position word Position probability values, output as a numeric vector summing to 1.

③ According to the target loss function L, the output results of ① and ② are iteratively trained until the training ends, and the trained text extraction network model is obtained. Specifically: the maximum number of iterations for model training is N rounds, and N is 10000 by default, which can be customized by the user. The target loss function is defined as L=L1+L2. The loss function L1 is used to calculate the negative logarithm of the target start position and target end position. The formula is as follows:

Wherein, the P _{start position} represents the position prediction probability value of the word vector corresponding to the target start position output by the first position prediction module; the P _{end position} represents the position prediction probability value of the word vector corresponding to the target end position output by the second position prediction module ; M is the size of the preset vocabulary, which is set to 50000 word vectors; y _hc indicates that the dimension value at the index c of the current word vector h is 1, and the other values are 0, 0<c<M; p _hc indicates the current word vector h is the probability at c, that is, the value corresponding to the cth dimension of the digital vector after the above-mentioned softmax layer processing.

Multi-task training is realized by adding a correction module, which can be closer to the actual scene of text extraction. At the same time, emptying the vector mark at the position of text extraction during the training process will lead to a sudden increase in the loss value of the model and increase the difficulty of learning. In the end, it was unable to converge, and the first loss function of the model was assisted by adding a correction module to make the update of model parameters more stable. It should be noted that in the actual application of text extraction, the trained text extraction network model The correction module is not included in , and the correction module is only used to further optimize the model parameters in the position prediction module.

In the PyTorch architecture, with the goal of minimizing the loss function L, the stochastic gradient descent algorithm SGD is used to iteratively update the network model parameters W and b in the initial text extraction network model to obtain a trained text extraction network model. Specifically, during the model training process, if the difference between L _m and L _m+1 obtained from two adjacent training sessions is less than the set value, that is, L _m -L _m+1 <0.01, the model is considered to have converged, and the decision After the training is over, the trained text extraction network model is obtained.

Step S202, perform word segmentation processing on the text paragraph to be extracted, and obtain a text paragraph after word segmentation processing.

Step S203. According to the preset sequence length, an initial data sequence including complete sentences is obtained.

Step S204, performing word vector conversion processing on the initial data sequence to obtain an initial word vector group.

Step S205, according to the initial word vector group, use the pre-trained module in the pre-trained text extraction network model to obtain the first word vector group containing contextual semantic information.

Step S206, using the first position prediction module in the pre-trained text extraction network model to obtain the initial word vector group according to the initial position prediction probability value of each word vector in the first word vector group In is used to characterize the multiple starting word vectors to be determined at the start position of text extraction.

Step S207 , for each to-be-determined initial word vector, concatenate the to-be-determined initial word vector and the initial word vector group to obtain a concatenated word vector group.

In order to illustrate the specific implementation of step 207, as a preferred embodiment, step 207 may specifically include: splicing the to-be-determined starting word vector with each word vector in the initial word vector group to obtain A concatenated word vector group.

Step S208, using the second position prediction module in the pre-trained text extraction network model, according to the predicted probability value of the end position of each word vector in the spliced word vector group, to obtain the representation value in the spliced word vector group A plurality of to-be-determined end word vectors at the end position of text extraction.

Step S209: Determine an initial extracted text combination according to the predicted multiple to-be-determined start word vectors and the multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors.

Step S210, obtaining the to-be-determined extracted text combinations satisfying the preset conditions among the initial extracted text combinations. Wherein, the preset condition at least includes: the difference between the end position number corresponding to the to-be-determined end word vector and the start-position number of the to-be-determined start word vector is greater than a set threshold.

In this embodiment, different from the text extraction network model training process, after K*N initial extraction text combinations are obtained, the to-be-determined extractions that meet the preset conditions among the K*N initial extraction text combinations are determined according to the preset conditions. Text combination, the preset condition is that the end position number corresponding to the end word vector to be determined in the text combination to be determined is greater than the start position number of the start word vector to be determined, and the difference between the end position number and the start position number is greater than A threshold (for example, 2) is set, and the preset condition is not specifically limited here.

Step S211, according to the predicted probability value of the start position of each of the to-be-determined start word vectors in the to-be-determined extracted text combination, a plurality of to-be-determined end words corresponding to each of the to-be-determined start word vectors The probability product value of the predicted probability value of the end position of the vector is used to determine the target start word vector and its corresponding target end word vector.

In this embodiment, the predicted start position probability value of each start word vector to be determined in the to-be-determined extracted text combination is multiplied by the predicted end position probability values of the corresponding N end word vectors to be determined respectively Processing, by traversing each probability product value, combining the extracted text with the largest probability product value to determine the target start word vector and its corresponding target end word vector.

Step S212, according to the start position number corresponding to the target start word vector and the end position number corresponding to the target end word vector, to obtain the target extracted text.

By applying the technical solution of this embodiment, the initial word vector group of the text paragraph to be extracted obtained through sentence recognition is input into the pre-trained text extraction network model, and multiple to-be-determined initial word vectors in the initial word vector group are predicted , according to a plurality of to-be-determined start word vectors and initial word vector groups, predict a plurality of to-be-determined end word vectors corresponding to each to-be-determined start word vector, thereby obtaining a plurality of to-be-determined start word vectors according to the prediction, and A plurality of to-be-determined end word vectors corresponding to each to-be-determined start word vector is determined to extract the target text. It can be seen that the pre-trained text information extraction network model can effectively avoid the strong dependence of artificial rules in the existing regular rule method, and cannot completely extract complex or incomplete text information; NER model recognition is prone to overfitting, And when faced with text containing new corpus information, the accuracy of text extraction is low; and other mainstream methods to extract words in isolated texts lead to technical problems of low text extraction accuracy, thereby effectively improving the accuracy of text information extraction .

Further, as a specific implementation of the method in FIG. 1 , an embodiment of the present application provides a text information extraction device, as shown in FIG. 4 , the device includes: a sentence recognition module 32, a first position prediction module 33, a second position prediction module Module 34 , determining module 35 .

The sentence recognition module 32 can be used to obtain the initial word vector group of the text paragraph to be extracted by performing sentence recognition on the text paragraph to be extracted.

The first position prediction module 33 may be configured to use a pre-trained text extraction network model to predict a plurality of undetermined initial word vectors used to represent the start position of text extraction in the initial word vector group.

The second position prediction module 34 can be used to predict the number of each of the initial word vectors corresponding to each of the initial word vectors in the initial word vector group according to the plurality of initial word vectors to be determined and the initial word vector group. A to-be-determined ending word vector.

The determining module 35 may be configured to determine the target text to be extracted according to the predicted multiple to-be-determined start word vectors and the multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors.

In a specific application scenario, as shown in FIG. 5 , a model training module 31 is also included.

In a specific application scenario, the sentence recognition module 32 includes a word segmentation processing unit 321 , a grouping division unit 322 , and a word vector conversion unit 323 .

The word segmentation processing unit 321 may be configured to perform word segmentation processing on the to-be-extracted text paragraph to obtain a text paragraph after word segmentation processing.

The grouping unit 322 can be configured to obtain an initial data sequence including a complete sentence according to a preset sequence length.

The word vector conversion unit 323 may be configured to perform word vector conversion processing on the initial data sequence to obtain an initial word vector group.

In a specific application scenario, the first position prediction module 33 includes a pre-training unit 331 and a starting position prediction unit 332 .

The pre-training unit 331 may be configured to use the pre-trained module in the pre-trained text extraction network model according to the initial word vector group to obtain a first word vector group containing contextual semantic information.

The starting position prediction unit 332 can be used to use the first position prediction module in the pre-trained text extraction network model to predict the probability value according to the starting position of each word vector in the first word vector group, A plurality of to-be-determined initial word vectors used to characterize the starting position of text extraction in the initial word vector group are obtained.

In a specific application scenario, the second position prediction module 34 includes a vector splicing unit 341 and an end position prediction unit 342 .

The vector concatenating unit 341 may be configured to concatenate the to-be-determined initial word vector and the initial word vector group for each to-be-determined initial word vector to obtain a concatenated word vector group.

The end position prediction unit 342 can be used to use the second position prediction module in the pre-trained text extraction network model to obtain the spliced word according to the predicted probability value of the end position of each word vector in the spliced word vector group. Multiple to-be-determined end word vectors used to represent the end position of text extraction in the vector group.

In a specific application scenario, the determination module 35 includes a combination determination unit 351 , a preset condition unit 352 , a probability determination unit 353 , and a text extraction unit 354 .

The combination determining unit 351 may be configured to determine an initial extracted text combination according to the predicted multiple to-be-determined start word vectors and the multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors.

The preset condition unit 352 can be used to obtain the to-be-determined extracted text combination that satisfies the preset condition in the initial extracted text combination; wherein, the preset condition includes at least: the end position sequence number corresponding to the to-be-determined end word vector and the number to be determined It is determined that the difference between the starting position numbers of the starting word vectors is greater than a set threshold.

The probability value determination unit 353 can be used to predict the probability value according to the initial position of each of the to-be-determined start word vectors in the to-be-determined extracted text combination, respectively corresponding to each of the to-be-determined start word vectors The probability product value of the predicted probability values of the end positions of the plurality of end word vectors to be determined is used to determine the target start word vector and its corresponding target end word vector.

The text extraction unit 354 may be configured to obtain the target extracted text according to the start position number corresponding to the target start word vector and the end position number corresponding to the target end word vector.

In a specific application scenario, the model training module 31 can be used to train the initial text extraction network model. The model training module 31 includes a first-stage training unit 311 , a training monitoring unit 312 , and a second-stage training unit 313 .

The first-stage training unit 311 may be configured to train the initial text extraction network model according to the position labels corresponding to the start position number and the end position number in the training samples.

The training monitoring unit 312 can be used to obtain the first-stage text extraction network model when it is monitored that the current loss value of the first loss function in the initial text extraction network model drops to a preset percentage of the initial loss value.

The second-stage training unit 313 can be used to use the first loss function and the second loss function corresponding to the preset correction module to train the first-stage text extraction network according to the training samples ignoring the position label The model is trained twice to obtain a trained text extraction network model.

In a specific application scenario, the correction module is used to assist in training the first position prediction module and the second position prediction module in the first-stage text extraction network model.

It should be noted that for other corresponding descriptions of the functional units involved in a text information extraction device provided in the embodiment of the present application, reference may be made to the corresponding descriptions in FIG. 1 and FIG. 2 , and details are not repeated here.

Based on the methods shown in Figure 1 and Figure 2 above, correspondingly, the embodiment of the present application also provides a storage medium on which computer-readable instructions are stored, and when the readable instructions are executed by a processor, the above-mentioned information shown in Figure 1 can be realized. And the text information extraction method of Fig. 2.

Based on this understanding, the technical solution of the present application can be embodied in the form of software products, which can be stored in a storage medium (which can be CD-ROM, U disk, mobile hard disk, etc.), and include several instructions to make A computer device (which may be a personal computer, a server, or a network device, etc.) executes the methods described in each implementation scenario of this application.

Based on the method shown in Figure 1 and Figure 2 above, and the virtual device embodiment shown in Figure 4 and Figure 5, in order to achieve the above purpose, the embodiment of this application also provides a computer device, which can be a personal computer, A server, a network device, etc., the physical device includes a storage medium and a processor; the storage medium is used to store computer-readable instructions; the processor is used to execute computer-readable instructions to realize the text shown in Figure 1 and Figure 2 information extraction method.

Optionally, the computer device may also include a user interface, a network interface, a camera, a radio frequency (Radio Frequency, RF) circuit, a sensor, an audio circuit, a WI-FI module, and the like. The user interface may include a display screen (Display), an input unit such as a keyboard (Keyboard), and the like, and optional user interfaces may also include a USB interface, a card reader interface, and the like. Optionally, the network interface may include a standard wired interface, a wireless interface (such as a Bluetooth interface, a WI-FI interface) and the like.

Those skilled in the art can understand that the structure of a computer device provided in this embodiment does not constitute a limitation to the physical device, and may include more or less components, or combine some components, or arrange different components.

The storage medium may also include an operating system and a network communication module. An operating system is a program that manages the hardware and software resources of a computer device and supports the operation of information processing programs and other software and/or programs. The network communication module is used to realize the communication between various components inside the storage medium, and communicate with other hardware and software in the physical device.

Through the above description of the embodiments, those skilled in the art can clearly understand that the present application can be realized by means of software plus a necessary general-purpose hardware platform, or by hardware. By applying the technical solution of the present application, compared with the existing text information extraction scheme based on regular rules, this embodiment can use the trained text extraction network model to effectively avoid the existing technical scheme relying on artificial rules, which has low accuracy and efficiency. Low technical problems, while solving the problem of only predicting whether each word in the article is a quotation, and unable to establish the necessary connection between words, thereby improving the flexibility and adaptability of text extraction, and effectively improving the accuracy of text information extraction Spend.

Those skilled in the art can understand that the accompanying drawing is only a schematic diagram of a preferred implementation scenario, and the modules or processes in the accompanying drawings are not necessarily necessary for implementing the present application. Those skilled in the art can understand that the modules in the devices in the implementation scenario can be distributed among the devices in the implementation scenario according to the description of the implementation scenario, or can be located in one or more devices different from the implementation scenario according to corresponding changes. The modules of the above implementation scenarios can be combined into one module, or can be further split into multiple sub-modules.

The serial numbers of the above application are for description only, and do not represent the pros and cons of the implementation scenarios. The above disclosures are only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any changes conceivable by those skilled in the art shall fall within the protection scope of the present application.

Claims

A text information extraction method, including:

By performing sentence recognition on the text paragraph to be extracted, the initial word vector group of the text paragraph to be extracted is obtained;

Using the pre-trained text extraction network model, predicting a plurality of initial word vectors to be determined for representing the starting position of text extraction in the initial word vector group;

Predicting a plurality of to-be-determined ending word vectors corresponding to each of the to-be-determined starting word vectors in the initial word vector group according to the plurality of to-be-determined starting word vectors and the described initial word vector groups;

According to the predicted multiple to-be-determined start word vectors and multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors, the target text to be extracted is determined.
The method according to claim 1, wherein, the initial word vector group of the text to be extracted is obtained by performing sentence recognition on the text paragraph to be extracted, specifically comprising:

Carrying out word segmentation processing on the text paragraph to be extracted, to obtain the text paragraph after word segmentation processing;

According to the preset sequence length, an initial data sequence containing a complete sentence is obtained;

Perform word vector conversion processing on the initial data sequence to obtain an initial word vector group.
The method according to claim 1, wherein, using the pre-trained text extraction network model, predicting a plurality of initial word vectors to be determined for representing the start position of text extraction in the initial word vector group, specifically comprising :

According to the initial word vector group, use the pre-trained module in the pre-trained text extraction network model to obtain the first word vector group containing contextual semantic information;

Using the first position prediction module in the pre-trained text extraction network model, according to the initial position prediction probability value of each word vector in the first word vector group, obtain the initial word vector group for A plurality of to-be-determined starting word vectors characterizing the starting position of text extraction.
The method according to claim 1 or 3, wherein, according to a plurality of said initial word vectors to be determined and said initial word vector groups, predict each said initial word vector group to be determined in said initial word vector group Multiple word vectors to be determined corresponding to word vectors, including:

For each initial word vector to be determined, the initial word vector to be determined and the initial word vector group are spliced to obtain a spliced word vector group;

Using the second position prediction module in the pre-trained text extraction network model, according to the predicted probability value of the end position of each word vector in the spliced word vector group, obtain the end of characterizing text extraction in the spliced word vector group Multiple to-be-determined ending word vectors of positions;

The splicing processing of the initial word vector to be determined and the initial word vector group is performed to obtain a spliced word vector group, which specifically includes:

The to-be-determined starting word vector is spliced with each word vector in the initial word vector group to obtain a spliced word vector group.
The method according to claim 1 or 4, wherein, the plurality of to-be-determined start word vectors obtained according to the prediction, and the plurality of to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors, determine Target extracted text, specifically:

According to the predicted multiple to-be-determined start word vectors, and the multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors, determine the initial extraction text combination;

Acquiring the to-be-determined extracted text combinations that meet the preset conditions in the initial extracted text combinations;

According to the predicted probability value of the start position of each of the to-be-determined start word vectors in the to-be-determined extracted text combination, the ends of the plurality of to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors are respectively The probability product value of the position prediction probability value determines the target starting word vector and its corresponding target ending word vector;

Obtain the target extraction text according to the starting position serial number corresponding to the target starting word vector and the ending position serial number corresponding to the target ending word vector;

Wherein, the preset condition at least includes: the difference between the end position number corresponding to the to-be-determined end word vector and the start-position number of the to-be-determined start word vector is greater than a set threshold.
The method according to claim 1, further comprising:

Train the initial text extraction network model, including:

According to the position label corresponding to the start position sequence number and the end position sequence number in the training sample, train the initial text extraction network model;

When it is monitored that the current loss value of the first loss function in the initial text extraction network model drops to a preset percentage of the initial loss value, the first stage text extraction network model is obtained;

Using the first loss function and the second loss function corresponding to the pre-set correction module, according to the training samples ignoring the position label, perform secondary training on the first-stage text extraction network model to obtain the trained Text Extraction Network Model.
The method according to claim 6, wherein the correction module is used to assist in training the first position prediction module and the second position prediction module in the first-stage text extraction network model.
A text information extraction device, including:

The sentence recognition module is used to obtain the initial word vector group of the text paragraph to be extracted by performing sentence recognition on the text paragraph to be extracted;

The first position prediction module is used to use the pre-trained text extraction network model to predict a plurality of initial word vectors to be determined for representing the start position of text extraction in the initial word vector group;

The second position prediction module is used to predict a plurality of to-be-determined initial word vectors corresponding to each of the to-be-determined initial word vectors in the initial word vector group according to the plurality of to-be-determined initial word vectors and the said initial word vector group. Determine the end word vector;

The determination module is configured to determine the target text to be extracted according to the multiple to-be-determined start word vectors obtained through prediction, and the multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors.
The device according to claim 8, wherein the sentence recognition module specifically comprises:

A word segmentation processing unit, configured to perform word segmentation processing on the text paragraph to be extracted, to obtain a text paragraph after word segmentation processing;

A grouping division unit, configured to obtain an initial data sequence containing a complete sentence according to a preset sequence length;

The word vector conversion unit is configured to perform word vector conversion processing on the initial data sequence to obtain an initial word vector group.
The device according to claim 8, wherein the first position prediction module specifically comprises:

A pre-training unit, configured to use the pre-trained module in the pre-trained text extraction network model to obtain the first word vector group containing contextual semantic information according to the initial word vector group;

The starting position prediction unit is used to use the first position prediction module in the pre-trained text extraction network model to obtain the predicted probability value of the starting position of each word vector in the first word vector group. A plurality of undetermined initial word vectors used to characterize the starting position of text extraction in the initial word vector group.
The device according to claim 8 or 10, wherein the second position prediction module specifically includes:

A vector splicing unit, configured to splice the to-be-determined initial word vector and the initial word vector group for each to-be-determined initial word vector, to obtain a spliced word vector group;

The end position prediction unit is used to use the second position prediction module in the pre-trained text extraction network model to obtain the spliced word vector group according to the predicted probability value of the end position of each word vector in the spliced word vector group A plurality of to-be-determined end word vectors used to characterize the end position of text extraction;

The splicing processing of the initial word vector to be determined and the initial word vector group is performed to obtain a spliced word vector group, which specifically includes:

The to-be-determined starting word vector is spliced with each word vector in the initial word vector group to obtain a spliced word vector group.
The device according to claim 8 or 11, wherein the determining module specifically comprises:

A combination determination unit, configured to determine an initial extracted text combination according to the predicted multiple to-be-determined start word vectors and the multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors;

A preset condition unit, configured to acquire the to-be-determined extracted text combinations that meet the preset conditions in the initial extracted text combinations;

A probability value determination unit, configured to predict the probability value according to the initial position of each of the to-be-determined start word vectors in the to-be-determined extracted text combination, and respectively correspond to each of the to-be-determined start word vectors To determine the probability product value of the end position prediction probability value of the end word vector, determine the target start word vector and its corresponding target end word vector;

The text extraction unit is used to obtain the target extracted text according to the starting position serial number corresponding to the target starting word vector and the ending position serial number corresponding to the target ending word vector;

Wherein, the preset condition at least includes: the difference between the end position number corresponding to the to-be-determined end word vector and the start-position number of the to-be-determined start word vector is greater than a set threshold.
The device according to claim 8, further comprising:

The model training module is used to train the initial text extraction network model, specifically including:

The first-stage training unit is used to train the initial text extraction network model according to the position labels corresponding to the starting position serial number and the ending position serial number in the training sample;

The training monitoring unit is used to obtain the first-stage text extraction network model when the current loss value of the first loss function in the initial text extraction network model is monitored to drop to a preset percentage of the initial loss value;

The second-stage training unit is configured to use the first loss function and the second loss function corresponding to the preset correction module to perform the first-stage text extraction network model according to the training samples ignoring the position label. Secondary training to get the trained text extraction network model.
The device according to claim 13, wherein the correction module is used to assist in training the first position prediction module and the second position prediction module in the first-stage text extraction network model.
A storage medium, on which computer-readable instructions are stored, wherein, when the readable instructions are executed by a processor, a method for extracting text information is realized, including:

By performing sentence recognition on the text paragraph to be extracted, the initial word vector group of the text paragraph to be extracted is obtained;

Using the pre-trained text extraction network model, predicting a plurality of initial word vectors to be determined for representing the starting position of text extraction in the initial word vector group;

Predicting a plurality of to-be-determined ending word vectors corresponding to each of the to-be-determined starting word vectors in the initial word vector group according to the plurality of to-be-determined starting word vectors and the described initial word vector groups;

According to the predicted multiple to-be-determined start word vectors and multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors, the target text to be extracted is determined.
The storage medium according to claim 15, wherein the initial word vector group of the text to be extracted is obtained by performing sentence recognition on the text paragraph to be extracted, specifically comprising:

Carrying out word segmentation processing on the text paragraph to be extracted, to obtain the text paragraph after word segmentation processing;

According to the preset sequence length, an initial data sequence containing a complete sentence is obtained;

Perform word vector conversion processing on the initial data sequence to obtain an initial word vector group.
The storage medium according to claim 15, wherein, using the pre-trained text extraction network model, predicting a plurality of undetermined initial word vectors used to represent the start position of text extraction in the initial word vector group, specifically include:

According to the initial word vector group, use the pre-trained module in the pre-trained text extraction network model to obtain the first word vector group containing contextual semantic information;

Using the first position prediction module in the pre-trained text extraction network model, according to the initial position prediction probability value of each word vector in the first word vector group, obtain the initial word vector group for A plurality of to-be-determined starting word vectors characterizing the starting position of text extraction.
A computer device, comprising a storage medium, a processor, and computer-readable instructions stored on the storage medium and operable on the processor, wherein, when the processor executes the readable instructions, a method for extracting text information is implemented, including :

By performing sentence recognition on the text paragraph to be extracted, the initial word vector group of the text paragraph to be extracted is obtained;

Using the pre-trained text extraction network model, predicting a plurality of initial word vectors to be determined for representing the starting position of text extraction in the initial word vector group;

According to a plurality of described initial word vectors to be determined and the initial word vector group, predict a plurality of to-be-determined end word vectors corresponding to each of the initial word vectors to be determined in the initial word vector group;

According to the predicted multiple to-be-determined start word vectors and multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors, the target text to be extracted is determined.
The computer device according to claim 18, wherein the initial word vector group of the text to be extracted is obtained by performing sentence recognition on the text paragraph to be extracted, specifically comprising:

Carrying out word segmentation processing on the text paragraph to be extracted, to obtain the text paragraph after word segmentation processing;

According to the preset sequence length, an initial data sequence containing a complete sentence is obtained;

Perform word vector conversion processing on the initial data sequence to obtain an initial word vector group.
The computer device according to claim 18, wherein, using the pre-trained text extraction network model, predicting a plurality of undetermined initial word vectors used to represent the start position of text extraction in the initial word vector group, specifically include:

According to the initial word vector group, use the pre-trained module in the pre-trained text extraction network model to obtain the first word vector group containing contextual semantic information;

Using the first position prediction module in the pre-trained text extraction network model, according to the initial position prediction probability value of each word vector in the first word vector group, obtain the initial word vector group for A plurality of to-be-determined starting word vectors characterizing the starting position of text extraction.