CN113722436A

CN113722436A - Text information extraction method and device, computer equipment and storage medium

Info

Publication number: CN113722436A
Application number: CN202111007458.6A
Authority: CN
Inventors: 谯轶轩; 陈浩
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2021-11-30
Also published as: WO2023029354A1

Abstract

The application discloses a text information extraction method, a text information extraction device, computer equipment and a storage medium, relates to the technical field of artificial intelligence, and can improve the accuracy of text information extraction. The method comprises the following steps: performing sentence recognition on a text paragraph to be extracted to obtain an initial word vector group of the text paragraph to be extracted; predicting a plurality of initial word vectors to be determined, which are used for representing the text extraction starting position, in the initial word vector group by using a pre-trained text extraction network model; predicting a plurality of to-be-determined end word vectors corresponding to each to-be-determined starting word vector in the initial word vector group according to the plurality of to-be-determined starting word vectors and the initial word vector group; and determining a target extraction text according to a plurality of to-be-determined starting word vectors obtained through prediction and a plurality of to-be-determined ending word vectors corresponding to each to-be-determined starting word vector. The method and the device are suitable for extracting the target text in the data set.

Description

Text information extraction method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a text information extraction method and apparatus, a computer device, and a storage medium.

Background

As a technique for extracting specific information from text data, text information extraction has been advanced in the direction of digitization, intellectualization, and semantics with the development of subjects such as artificial intelligence, and has a greater role in social knowledge management. The text information extraction method widely applied at present is a regular rule method for extracting text by a filtering or matching rule established manually based on a regular expression; identifying an NER model by utilizing a named entity, and setting a mode of processing an extraction task; and other mainstream ways of predicting words in text.

In the prior art, a regular rule mode has the problem of dependence on artificial rules, and text information cannot be completely extracted when a complex sentence environment and a text with incomplete semantics are faced; the NER model identification is easy to generate overfitting, and when a text containing new speech material information is faced, the extraction accuracy is greatly reduced; and extracting words and the like in isolated text, resulting in low accuracy of text information extraction.

Disclosure of Invention

In view of the above, the present application provides a text information extraction method, apparatus, computer device, and storage medium. The method mainly aims to solve the problems that a regular rule mode depends on artificial rules in the prior art, and when the method is faced with a complex sentence environment and a text with incomplete semantics, the text extraction has limitations; the NER model identification is easy to generate overfitting, and when a text containing new speech material information is faced, the text extraction accuracy is low; and extracting words in isolated text results in a technical problem of low accuracy of text extraction.

According to an aspect of the present application, there is provided a text information extraction method, including:

performing sentence recognition on a text paragraph to be extracted to obtain an initial word vector group of the text paragraph to be extracted;

predicting a plurality of initial word vectors to be determined, which are used for representing the text extraction starting position, in the initial word vector group by using a pre-trained text extraction network model;

predicting a plurality of to-be-determined end word vectors corresponding to each to-be-determined starting word vector in the initial word vector group according to the plurality of to-be-determined starting word vectors and the initial word vector group;

and determining a target extraction text according to a plurality of to-be-determined starting word vectors obtained through prediction and a plurality of to-be-determined ending word vectors corresponding to each to-be-determined starting word vector.

According to another aspect of the present application, there is provided a text information extracting apparatus including:

the sentence recognition module is used for performing sentence recognition on the text paragraphs to be extracted to obtain an initial word vector group of the text paragraphs to be extracted;

the first position prediction module is used for predicting a plurality of initial word vectors to be determined, which are used for representing the text extraction starting position, in the initial word vector group by utilizing a pre-trained text extraction network model;

the second position prediction module is used for predicting a plurality of to-be-determined end word vectors corresponding to each to-be-determined starting word vector in the initial word vector group according to the plurality of to-be-determined starting word vectors and the initial word vector group;

and the determining module is used for determining the target extraction text according to the plurality of starting word vectors to be determined obtained through prediction and the plurality of ending word vectors to be determined corresponding to each starting word vector to be determined.

According to yet another aspect of the present application, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described text information extraction method.

According to yet another aspect of the present application, there is provided a computer device comprising a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, the processor implementing the text information extraction method when executing the program.

By means of the technical scheme, compared with the existing text information extraction scheme based on mainstream modes such as regular rules and NER model identification, the text information extraction method, the text information extraction device, the computer equipment and the storage medium provided by the application have the advantages that the text paragraphs to be extracted are subjected to sentence identification to obtain the initial word vector group of the text paragraphs to be extracted; predicting a plurality of initial word vectors to be determined, which are used for representing the text extraction starting position, in the initial word vector group by using a pre-trained text extraction network model; predicting a plurality of to-be-determined end word vectors corresponding to each to-be-determined starting word vector in the initial word vector group according to the plurality of to-be-determined starting word vectors and the initial word vector group; and determining a target extraction text according to a plurality of to-be-determined starting word vectors obtained through prediction and a plurality of to-be-determined ending word vectors corresponding to each to-be-determined starting word vector. Therefore, the problem that the existing regular rule mode has strong dependence on artificial rules and cannot completely extract complex or incomplete text information can be effectively avoided through the pre-trained text information extraction network model; the NER model identification is easy to generate overfitting, and when a text containing new speech material information is faced, the text extraction accuracy is low; and the other main flow modes extract words in the isolated text, so that the text extraction accuracy is low, and the text information extraction accuracy is effectively improved.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic flowchart illustrating a text information extraction method provided in an embodiment of the present application;

FIG. 2 is a flow chart illustrating another text information extraction method provided in the embodiment of the present application;

FIG. 3 is a schematic diagram of a text extraction network model architecture in a training phase according to an embodiment of the present application;

fig. 4 is a schematic structural diagram illustrating a text information extraction apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram illustrating another text information extraction apparatus according to an embodiment of the present application.

Detailed Description

The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Wherein, Artificial Intelligence (AI) is a theory, method, technology and application system for simulating, extending and expanding human Intelligence by using a digital computer or a machine controlled by the digital computer, sensing environment, acquiring knowledge and obtaining the best result by using the knowledge.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

For the technical problem of low accuracy of extracting text information in the regular rule mode, the NER model identification mode and other mainstream modes in the prior art, taking the regular rule mode as an example, in the context of Data set reference, words such as "surfey", "Data", "Study", "Database" and "Statistics" usually appear in reference to other text information with high frequency, and the used words begin with capitals. The regular rule mode further realizes extraction of text information by filtering the matched reference information, but the regular rule mode is too simple, text extraction performance depends on specification of an artificial rule, and the text extraction effect is relatively poor. Based on this, this embodiment provides a text information extraction method, as shown in fig. 1, which is described by taking an example that the method is applied to a server and other computer devices, where the server may be an independent server, or may also be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform, such as an intelligent medical system, a digital medical platform, and the like. The method comprises the following steps:

step S101, performing sentence recognition on the text paragraphs to be extracted to obtain an initial word vector group of the text paragraphs to be extracted.

In this embodiment, in order to facilitate the text extraction network model to process text information, word segmentation processing is performed on words of a text paragraph to be extracted, the text paragraph after word segmentation is divided according to a preset sequence length to obtain one or more initial data sequences containing a complete sentence, and word vector conversion processing is performed on the initial data sequences to obtain an initial word vector group. Specifically, the text paragraphs are divided in units of sentences, and the text paragraphs smaller than the preset sequence length are subjected to padding processing.

In the exemplary embodiment of the application, the text paragraphs after word segmentation are divided according to 512 words, so that the extraction capability of the text extraction network model for long texts can be enhanced, and further, the text paragraphs are divided by taking sentences as units, so that the problem that in the text paragraph dividing process, a complete sentence is divided into different data sequences, and the accuracy of the text extraction network model for extracting context semantics is influenced can be effectively solved.

According to the requirements of practical application scenarios, for example, response events of encyclopedic question answering, splicing processing is carried out according to question information input by a user and an obtained target text paragraph to obtain a text paragraph to be extracted containing the question information, sentence recognition is carried out according to the text paragraph to be extracted containing the question information to obtain an initial word vector group of the text paragraph to be extracted, and therefore the position of a starting word vector and the position of an ending word vector in the text paragraph to be extracted are further predicted.

And S102, predicting a plurality of initial word vectors to be determined, which are used for representing the text extraction starting position, in the initial word vector group by using a pre-trained text extraction network model.

In the embodiment, a Pre-training module (GPT) in a Pre-trained text extraction network model is used to enable each word vector in an initial word vector group to learn semantic information of other word vectors, so as to obtain a first word vector group containing context semantic information; further, a first position prediction module is used for obtaining the initial position prediction probability value of each word vector in the first word vector group, and the initial word vector to be determined with the maximum K initial position prediction probability values in the first word vector group is determined through traversal.

In the exemplary embodiment of the application, a pre-training model GPT adopts a multi-layer Transformer architecture, wherein a self-attention mechanism self-attention enables each word vector to be capable of extracting grammar, syntax and other deep semantic information except self characteristics after multi-layer learning, and context relation of each word vector in an initial word vector group is established, so that accuracy of a text extraction network model for extracting text information is improved.

Step S103, predicting a plurality of to-be-determined end word vectors corresponding to each to-be-determined start word vector in the initial word vector group according to the plurality of to-be-determined start word vectors and the initial word vector group.

In this embodiment, in order to further improve the accuracy of predicting the start and end positions of the word vector, the end position of the word vector is predicted by using the start position information of the word vector obtained by prediction. Specifically, K initial word vectors to be determined are respectively vector-spliced with the initial word vector group to obtain K spliced word vector groups for input to the second position prediction module. And acquiring the end position prediction probability value corresponding to each to-be-determined starting word vector in the spliced word vector group by using a second position prediction module, and determining the to-be-determined end word vector with the maximum end position prediction probability value corresponding to each to-be-determined starting word vector in the spliced word vector group by traversing. And K and N can be set to be equal or unequal according to the requirements of the actual application scene.

And step S104, determining a target extraction text according to a plurality of to-be-determined starting word vectors obtained through prediction and a plurality of to-be-determined ending word vectors corresponding to each to-be-determined starting word vector.

In this embodiment, K × N initial extracted text combinations are obtained according to each to-be-determined start word vector and the to-be-determined end word vector having the maximum N end position prediction probability values respectively corresponding thereto, an to-be-determined extracted text combination satisfying a preset condition among the K × N initial extracted text combinations is further determined, a start word vector corresponding to the maximum product value is determined as a target start word vector and an end word vector corresponding thereto is determined as a target end word vector through product calculation according to the start position prediction probability value of each to-be-determined start word vector in the to-be-determined extracted text combinations and the end position prediction probability values of the plurality of to-be-determined end word vectors corresponding thereto, and thus the target extracted text is obtained.

For this embodiment, according to the above scheme, an initial word vector group of a text paragraph to be extracted, which is obtained through sentence recognition, is input into a pre-trained text extraction network model, a plurality of to-be-determined starting word vectors in the initial word vector group are predicted, a plurality of to-be-determined end word vectors corresponding to each to-be-determined starting word vector are predicted according to the plurality of to-be-determined starting word vectors and the initial word vector group, and thus, a target extracted text is determined according to the plurality of to-be-determined starting word vectors obtained through prediction and the plurality of to-be-determined end word vectors corresponding to each to-be-determined starting word vector. Compared with the technical schemes of the conventional regular rule mode, the NER model identification mode and other mainstream modes, the method and the device can improve the extraction accuracy of the text extraction network model based on the initial word vector and the initial word vector group, so that the target text in the text paragraph can be extracted more accurately.

Further, as a refinement and an extension of the specific implementation of the above embodiment, in order to fully describe the specific implementation process of the embodiment, another text information extraction method is provided, as shown in fig. 2, and the method includes:

step S201, training an initial text extraction network model.

Specifically, as shown in fig. 3, in order to improve the accuracy of text information extraction, the constructed initial text extraction network model includes a serial first position prediction module and a serial second position prediction module, that is, a span module, which are respectively used for predicting the starting position and the ending position of the target extracted text; adding a pre-training module GPT at the input end of the first position prediction model, wherein the pre-training module GPT is used for acquiring context semantic information of each word vector; in the model training stage, the updating of model parameters obtained by text extraction network model training tends to be stable more easily by additionally arranging a correction module.

For explaining the specific implementation of step 201, as a preferred embodiment, step 201 may specifically include: training the initial text extraction network model according to the position labels corresponding to the initial position sequence number and the end position sequence number in the training sample; when it is monitored that the current loss value of a first loss function in the initial text extraction network model is reduced to a preset percentage of the initial loss value, a first-stage text extraction network model is obtained; and performing secondary training on the first-stage text extraction network model by using the first loss function and a preset second loss function corresponding to the correction module according to the training sample neglecting the position label to obtain a trained text extraction network model. The correction module is used for assisting in training a first position prediction module and a second position prediction module in the first-stage text extraction network model.

The specific steps of training the constructed initial text extraction network model comprise:

1) a text paragraph is obtained as a training sample, and a labeled text sequence for extraction in the training sample is preset as [ w100, w101, w102, w103], wherein w100 corresponds to a quoted target start position, and w103 corresponds to a quoted target end position.

2) And performing word segmentation on the training samples, performing word segmentation on English by adopting a blank space, and performing word segmentation on Chinese by adopting a word segmentation tool jieba disclosed in Baidu to obtain a text paragraph after word segmentation.

3) And dividing the text paragraphs after word segmentation according to a preset sequence length to obtain one or more groups of initial data sequences containing complete sentences. Specifically, the length of the sequence is set to 512 words, and text paragraphs with lengths smaller than 512 words are filled to establish a group of initial data sequences containing complete sentences; and truncating text paragraphs larger than 512 words based on the complete sentence, filling up the parts less than 512 words after truncation to establish a group of initial data sequences containing the complete sentence, and continuously dividing the remaining text paragraphs after truncation as a new text paragraph until the division is finished to obtain a plurality of groups of initial data sequences containing the complete sentence.

4) And converting each word in the initial data sequence into a word vector by using a trained word2vec or GloVe word vector module to obtain an initial word vector group represented as [ w1, w2, … w512 ].

5) And performing semantic feature extraction on each word in the initial word vector group by using a GPT (general purpose text extraction) module in the initial text extraction network model to obtain a first word vector group containing context semantics, wherein the first word vector group is represented as [ h1, h2 and … h512 ]. Specifically, the GPT model adopts a multi-layer Transformer architecture, each layer of the Transformer includes a self-attention mechanism self-attention, and the mechanism can enable each word in [ w1, w2, … w512] to extract feature information of words at other positions, and use the extracted feature information for updating a self vector, so as to obtain a deep relation between the other words and the self, that is, after multi-layer learning of each word vector in an initial word vector group, a word vector including grammar, syntax and other deep semantic information of all other word position information in the initial word vector group is obtained, thereby obtaining a first word vector group [ h1, h2, … h512] including context semantics.

6) The first word vector group [ h1, h2, … h512] is input to a first position prediction module (span module) in the initial text extraction network model, and the starting position prediction probability value of the starting word vector h100 at the target starting position is output.

7) And respectively splicing the first word vector h100 at the starting position of the representation target with each word vector in the first word vector group to obtain a spliced word vector group which is represented as [ h100+ h1, h100+ h2, … h100+ h512 ].

8) And inputting the spliced word vector group into a second position prediction module in the initial text extraction network model, and outputting the end position prediction probability value of the end word vector h103 at the target end position, wherein the first position prediction module is the same as the second position prediction module. Optionally, the first position prediction module and the second position prediction module may be two position prediction modules or one position prediction module, and sequentially output the position prediction probability values of the start word vector and the end word vector, where the position prediction modules are not specifically limited.

9) In the training process, the maximum product of the position prediction probability values of the target starting position and the target ending position is used as a target, if the current loss value of the first loss function L1 of the position prediction module is reduced to 30% of the initial loss value, the target starting position and the target ending position are set to be null, secondary training is carried out by using a multi-task learning framework, and a trained text extraction network model is obtained. Wherein the multi-task learning framework is to use the correction module to assist in training the position prediction module, i.e. the current loss value L of the first loss function L1_mAnd when the initial loss value is reduced to 30%, obtaining a first-stage text extraction network model, continuing to train the first-stage text extraction network model according to a second loss function L2 and a first loss function L1 corresponding to the correction module, obtaining a second-stage text extraction network model, and using the second-stage text extraction network model as the trained text extraction network model. The method specifically comprises the following steps:

firstly, a first word vector group [ h1, h2, … h512] containing context semantics is input into a first position prediction module in a first-stage text extraction network model again, the initial position prediction probability value of each first word vector is output, and the word vector with the maximum K initial position prediction probability values is taken as an initial word vector to be determined; for each initial word vector to be determined, splicing each initial word vector with each word vector in the first word vector group to obtain a spliced word vector group; obtaining an ending position prediction probability value of each word vector corresponding to each starting word vector to be determined through a second position prediction module in the first-stage text extraction network model, and taking the word vector with the maximum N ending position prediction probability values as an ending word vector to be determined; and establishing K-N initial extracted text combinations according to the K initial word vectors to be determined and the N end word vectors to be determined corresponding to each initial word vector to be determined.

The position prediction module is used for calculating the position prediction probability value of the word vector, and the calculation formula is as follows:

p＝S(Wx+b)

wherein, W is weight, b is bias value, is network model parameter continuously updated through model training and learning, s represents sigmoid function, and the expression is as follows:

the method comprises the steps of synchronously inputting a first word vector group [ h1, h2 and … h512] containing context semantics into a correction module, directly extracting text information, namely realizing long text extraction by continuously predicting next position words, and specifically comprises the steps of adding a softmax layer to a full connection layer, namely P ═ softmax (wx + b), wherein x is the word vector group [ h1, h2 and … h512] containing the context semantics, predicting the position probability value of each next position word by using the softmax layer, and outputting a digital vector with the sum of 1.

And thirdly, performing iterative training on the output results of the first step and the second step according to the target loss function L until the training is finished to obtain a trained text extraction network model. The method specifically comprises the following steps: the maximum iteration times of model training are N rounds, N is 10000 by default, and the user can customize the model. The target loss function is defined as L1+ L2, and the loss function L1 is used to calculate the negative logarithm of the target start position and target end position, and the formula is as follows:

wherein, P_{Starting position}The position prediction probability value of the word vector corresponding to the target starting position output by the first position prediction module is represented; p_{End position}The position prediction probability value of the word vector corresponding to the target end position output by the second position prediction module is represented; m is the size of a preset vocabulary table and is set as 50000 word vectors; y is_hcRepresenting that the dimension value of the index c of the current word vector h is 1, the others are 0, and c is more than 0 and less than M; p is a radical of_hcAnd (4) representing the probability that the current word vector h is c, namely the value corresponding to the dimension c of the digital vector after the softmax layer processing.

The method has the advantages that the multitask training is achieved by means of the supplement correction module, the multitask training can be closer to the actual scene of text extraction, meanwhile, the loss value of the model is increased suddenly due to the fact that the vector mark at the position of the extracted empty text in the training process, the learning difficulty is increased, the model cannot be converged at last, the first loss function of the model is corrected in an auxiliary mode through the supplement correction module, the updating of the model parameters is more stable, it needs to be explained that in the actual application of text extraction, the trained text extraction network model does not include the correction module, and the correction module is only used for further optimizing the model parameters in the position prediction module.

In a PyTorch framework, a loss function L is minimized, and network model parameters W and b in an initial text extraction network model are iteratively updated by using a stochastic gradient descent algorithm SGD to obtain a trained text extraction network model. Specifically, in the model training process, the L obtained if two adjacent training times are performed_mAnd L_m+1Is less than a set value, i.e. L_m-L_m+1<0.01, the model is considered to be converged, and the training is judged to be finished to obtain the trained textAnd extracting the network model.

Step S202, performing word segmentation processing on the text paragraphs to be extracted to obtain the text paragraphs after word segmentation processing.

Step S203, obtaining an initial data sequence containing the complete sentence according to the preset sequence length.

And step S204, performing word vector conversion processing on the initial data sequence to obtain an initial word vector group.

And S205, extracting a pre-training module in the network model by using the pre-trained text according to the initial word vector group to obtain a first word vector group containing context semantic information.

Step S206, a first position prediction module in the pre-trained text extraction network model is used for predicting a probability value according to the starting position of each word vector in the first word vector group to obtain a plurality of starting word vectors to be determined, which are used for representing the text extraction starting position, in the initial word vector group.

Step S207, aiming at each starting word vector to be determined, splicing the starting word vector to be determined and the initial word vector group to obtain a spliced word vector group.

For illustrating the specific implementation of step 207, as a preferred embodiment, step 207 may specifically include: and respectively splicing the initial word vector to be determined with each word vector in the initial word vector group to obtain a spliced word vector group.

And S208, utilizing a second position prediction module in the pre-trained text extraction network model to predict a probability value according to the ending position of each word vector in the spliced word vector group to obtain a plurality of ending word vectors to be determined, which are used for representing the text extraction ending position, in the spliced word vector group.

Step S209, determining an initial extracted text combination according to a plurality of pre-measured start word vectors to be determined and a plurality of end word vectors to be determined corresponding to each start word vector to be determined.

And step S210, acquiring the extracted text combination to be determined which meets the preset conditions in the initial extracted text combination. Wherein the preset conditions at least include: and the difference value between the sequence number of the end position corresponding to the end word vector to be determined and the sequence number of the start position of the start word vector to be determined is greater than the set threshold value.

In this embodiment, different from the text extraction network model training process, after K × N initial extracted text combinations are obtained, a to-be-determined extracted text combination satisfying a preset condition is determined from the K × N initial extracted text combinations according to a preset condition, where the preset condition is that an ending position sequence number corresponding to an ending word vector to be determined in the to-be-determined extracted text combination is greater than a starting position sequence number of an starting word vector to be determined, and a difference between the ending position sequence number and the starting position sequence number is greater than a set threshold (e.g., 2), where the preset condition is not specifically limited.

Step S211, determining a target start word vector and a target end word vector corresponding thereto according to the probability product value of the start position prediction probability value of each start word vector to be determined in the extracted text combination to be determined and the end position prediction probability values of a plurality of end word vectors to be determined corresponding to each start word vector to be determined.

In this embodiment, the starting position prediction probability value of each starting word vector to be determined in the extracted text combination to be determined is multiplied by the ending position prediction probability values of the N ending word vectors to be determined corresponding thereto, and the extracted text combination with the largest probability product value is subjected to traversal of each probability product value to determine a target starting word vector and a target ending word vector corresponding thereto.

Step S212, a target extraction text is obtained according to the starting position sequence number corresponding to the target starting word vector and the ending position sequence number corresponding to the target ending word vector.

By applying the technical scheme of the embodiment, an initial word vector group of a text paragraph to be extracted, which is obtained by sentence recognition, is input into a pre-trained text extraction network model, a plurality of starting word vectors to be determined in the initial word vector group are predicted, a plurality of ending word vectors to be determined, which correspond to each starting word vector to be determined, are predicted according to the plurality of starting word vectors to be determined and the initial word vector group, and thus, a target extracted text is determined according to the plurality of starting word vectors to be determined, which are obtained by prediction, and the plurality of ending word vectors to be determined, which correspond to each starting word vector to be determined. Therefore, the problem that the existing regular rule mode has strong dependence on artificial rules and cannot completely extract complex or incomplete text information can be effectively avoided through the pre-trained text information extraction network model; the NER model identification is easy to generate overfitting, and when a text containing new speech material information is faced, the text extraction accuracy is low; and the other main flow modes extract words in the isolated text, so that the text extraction accuracy is low, and the text information extraction accuracy is effectively improved.

Further, as a specific implementation of the method in fig. 1, an embodiment of the present application provides a text information extraction apparatus, as shown in fig. 4, the apparatus includes: sentence recognition module 32, first position prediction module 33, second position prediction module 34, and determination module 35.

The sentence recognition module 32 may be configured to perform sentence recognition on the text passage to be extracted to obtain an initial word vector group of the text passage to be extracted.

The first position prediction module 33 may be configured to predict, by using a pre-trained text extraction network model, a plurality of to-be-determined start word vectors in the initial word vector group, which are used to represent a text extraction start position.

The second position prediction module 34 may be configured to predict, according to the plurality of to-be-determined start word vectors and the initial word vector group, a plurality of to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors in the initial word vector group.

The determining module 35 may be configured to determine the target extracted text according to a plurality of to-be-determined start word vectors obtained through prediction and a plurality of to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors.

In a specific application scenario, as shown in fig. 5, a model training module 31 is further included.

In a specific application scenario, the sentence recognition module 32 includes a word segmentation processing unit 321, a grouping division unit 322, and a word vector conversion unit 323.

The word segmentation processing unit 321 may be configured to perform word segmentation processing on the text paragraphs to be extracted to obtain text paragraphs after word segmentation processing.

The packet dividing unit 322 may be configured to obtain an initial data sequence including a complete sentence according to a preset sequence length.

The word vector conversion unit 323 may be configured to perform word vector conversion processing on the initial data sequence to obtain an initial word vector group.

In a specific application scenario, the first position prediction module 33 includes a pre-training unit 331 and a starting position prediction unit 332.

The pre-training unit 331 may be configured to obtain, according to the initial word vector group, a first word vector group including context semantic information by using a pre-training module in the pre-trained text extraction network model.

The starting position predicting unit 332 may be configured to, by using a first position predicting module in the pre-trained text extraction network model, predict a probability value according to a starting position of each word vector in the first word vector group, to obtain a plurality of starting word vectors to be determined, in the initial word vector group, for representing a text extraction starting position.

In a specific application scenario, the second position prediction module 34 includes a vector stitching unit 341 and an end position prediction unit 342.

The vector splicing unit 341 may be configured to splice, for each to-be-determined start word vector, the to-be-determined start word vector and the initial word vector group to obtain a spliced word vector group.

The ending position predicting unit 342 may be configured to, by using a second position predicting module in the pre-trained text extraction network model, obtain, according to the ending position prediction probability value of each word vector in the concatenated word vector group, a plurality of ending word vectors to be determined, which are used to represent text extraction ending positions in the concatenated word vector group.

In a specific application scenario, the determining module 35 includes a combination determining unit 351, a preset condition unit 352, a probability determining unit 353, and a text extracting unit 354.

The combination determining unit 351 may be configured to determine an initial extracted text combination according to a plurality of pre-measured start word vectors to be determined and a plurality of end word vectors to be determined corresponding to each start word vector to be determined.

A preset condition unit 352, configured to obtain an extracted text combination to be determined, which meets a preset condition, from the initial extracted text combinations; wherein the preset conditions at least include: and the difference value between the sequence number of the end position corresponding to the end word vector to be determined and the sequence number of the start position of the start word vector to be determined is greater than the set threshold value.

The probability value determining unit 353 may be configured to determine, according to the starting position prediction probability value of each to-be-determined starting word vector in the to-be-determined extracted text combination, a probability product value of the ending position prediction probability values of a plurality of to-be-determined ending word vectors corresponding to each to-be-determined starting word vector, respectively, a target starting word vector and a target ending word vector corresponding to the target starting word vector.

The text extracting unit 354 may be configured to obtain a target extracted text according to the starting position sequence number corresponding to the target starting word vector and the ending position sequence number corresponding to the target ending word vector.

In a specific application scenario, the model training module 31 may be configured to train an initial text extraction network model. The model training module 31 includes a first stage training unit 311, a training monitoring unit 312, and a second stage training unit 313.

The first stage training unit 311 may be configured to train the initial text extraction network model according to the position labels corresponding to the starting position sequence number and the ending position sequence number in the training sample.

The training monitoring unit 312 may be configured to obtain a first-stage text extraction network model when it is monitored that the current loss value of the first loss function in the initial text extraction network model decreases to a preset percentage of the initial loss value.

The second-stage training unit 313 may be configured to perform secondary training on the first-stage text extraction network model according to a training sample in which the position tag is omitted, using the first loss function and a second loss function corresponding to a preset correction module, to obtain a trained text extraction network model.

In a specific application scenario, the modification module is used for assisting in training a first position prediction module and a second position prediction module in the first-stage text extraction network model.

It should be noted that other corresponding descriptions of the functional units related to the text information extraction device provided in the embodiment of the present application may refer to the corresponding descriptions in fig. 1 and fig. 2, and are not described herein again.

Based on the method shown in fig. 1 and fig. 2, correspondingly, the embodiment of the present application further provides a storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the text information extraction method shown in fig. 1 and fig. 2.

Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, or the like), and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the method described in the implementation scenarios of the present application.

Based on the method shown in fig. 1 and fig. 2 and the virtual device embodiment shown in fig. 4 and fig. 5, in order to achieve the above object, an embodiment of the present application further provides a computer device, which may specifically be a personal computer, a server, a network device, and the like, where the entity device includes a storage medium and a processor; a storage medium for storing a computer program; a processor for executing a computer program to implement the above-described text information extraction method as shown in fig. 1 and 2.

Optionally, the computer device may further include a user interface, a network interface, a camera, Radio Frequency (RF) circuitry, a sensor, audio circuitry, a WI-FI module, and so forth. The user interface may include a Display screen (Display), an input unit such as a keypad (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., a bluetooth interface, WI-FI interface), etc.

It will be understood by those skilled in the art that the present embodiment provides a computer device structure that is not limited to the physical device, and may include more or less components, or some components in combination, or a different arrangement of components.

The storage medium may further include an operating system and a network communication module. An operating system is a program that manages the hardware and software resources of a computer device, supporting the operation of information handling programs, as well as other software and/or programs. The network communication module is used for realizing communication among components in the storage medium and other hardware and software in the entity device.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus a necessary general hardware platform, and can also be implemented by hardware. By applying the technical scheme of the application, compared with the existing text information extraction scheme based on the regular rule, the method can extract the network model by using the trained text, effectively avoids the technical problems that the existing technical scheme depends on artificial rules, is low in accuracy and low in efficiency, and simultaneously solves the problem that necessary connection cannot be established between characters because each character in an article is only predicted or not is quoted, so that the flexibility and the adaptability of text extraction are improved, and the accuracy of text information extraction is effectively improved.

Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present application. Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The above application serial numbers are for description purposes only and do not represent the superiority or inferiority of the implementation scenarios. The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present application.

Claims

1. A text information extraction method is characterized by comprising the following steps:

2. The method according to claim 1, wherein the obtaining of the initial word vector group of the text to be extracted by performing sentence recognition on the text passage to be extracted specifically comprises:

performing word segmentation processing on the text paragraphs to be extracted to obtain the text paragraphs after the word segmentation processing;

obtaining an initial data sequence containing a complete statement according to a preset sequence length;

and performing word vector conversion processing on the initial data sequence to obtain an initial word vector group.

3. The method according to claim 1, wherein the predicting, by using a pre-trained text extraction network model, a plurality of to-be-determined start word vectors in the initial word vector group for characterizing a text extraction start position specifically includes:

according to the initial word vector group, utilizing the pre-trained text to extract a pre-training module in a network model to obtain a first word vector group containing context semantic information;

and utilizing a first position prediction module in the pre-trained text extraction network model to predict a probability value according to the starting position of each word vector in the first word vector group to obtain a plurality of starting word vectors to be determined, which are used for representing the text extraction starting position, in the initial word vector group.

4. The method according to claim 1 or 3, wherein the predicting, according to the plurality of to-be-determined start word vectors and the initial word vector group, a plurality of to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors in the initial word vector group specifically includes:

for each starting word vector to be determined, splicing the starting word vector to be determined and the initial word vector group to obtain a spliced word vector group;

a second position prediction module in the pre-trained text extraction network model is used for obtaining a plurality of end word vectors to be determined for representing the text extraction end positions in the spliced word vector group according to the end position prediction probability value of each word vector in the spliced word vector group;

the splicing processing is performed on the initial word vector to be determined and the initial word vector group to obtain a spliced word vector group, and the method specifically includes:

and respectively splicing the initial word vector to be determined with each word vector in the initial word vector group to obtain a spliced word vector group.

5. The method according to claim 1 or 4, wherein the determining a target extracted text according to a plurality of to-be-determined start word vectors obtained by prediction and a plurality of to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors specifically includes:

determining an initial extracted text combination according to a plurality of pre-measured starting word vectors to be determined and a plurality of ending word vectors to be determined corresponding to each starting word vector to be determined;

acquiring an extracted text combination to be determined which meets a preset condition in the initial extracted text combination;

determining a target starting word vector and a target ending word vector corresponding to the target starting word vector according to the starting position prediction probability value of each starting word vector to be determined in the text combination to be determined and the probability product value of the ending position prediction probability values of a plurality of ending word vectors to be determined corresponding to each starting word vector to be determined;

obtaining a target extraction text according to the initial position sequence number corresponding to the target initial word vector and the end position sequence number corresponding to the target end word vector;

wherein the preset conditions at least include: and the difference value between the sequence number of the end position corresponding to the end word vector to be determined and the sequence number of the start position of the start word vector to be determined is greater than the set threshold value.

6. The method of claim 1, further comprising:

training an initial text extraction network model specifically comprises the following steps:

training the initial text extraction network model according to the position labels corresponding to the initial position sequence number and the end position sequence number in the training sample;

when it is monitored that the current loss value of a first loss function in the initial text extraction network model is reduced to a preset percentage of the initial loss value, a first-stage text extraction network model is obtained;

and performing secondary training on the first-stage text extraction network model by using the first loss function and a preset second loss function corresponding to the correction module according to the training sample neglecting the position label to obtain a trained text extraction network model.

7. The method of claim 6, wherein the modification module is used to assist in training a first location prediction module and a second location prediction module in the first stage text extraction network model.

8. A text information extraction device characterized by comprising:

9. A computer device comprising a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, wherein the processor implements the text information extraction method according to any one of claims 1 to 7 when executing the program.

10. A storage medium on which a computer program is stored, characterized in that the program, when executed by a processor, implements the text information extraction method of any one of claims 1 to 7.