CN113657104A

CN113657104A - Text extraction method and device, computer equipment and storage medium

Info

Publication number: CN113657104A
Application number: CN202111015728.8A
Authority: CN
Inventors: 孙安国
Original assignee: Ping An Medical and Healthcare Management Co Ltd
Current assignee: Shenzhen Ping An Medical Health Technology Service Co Ltd
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2021-11-16
Anticipated expiration: 2041-08-31
Also published as: CN113657104B

Abstract

The embodiments of the present application belong to the field of artificial intelligence, are applied in the medical field, and relate to a text extraction method, which includes acquiring a text to be processed and a preset extraction model, inputting the text to be processed into the preset extraction model, and performing processing on the text to be processed according to a coding layer. coding to obtain the target coding vector; input the target coding vector to the first network layer, and calculate the first network vector; input the first network vector to the second network layer, calculate and obtain the target feature matrix, and perform self-attention calculation on the target feature matrix , obtain the target feature vector; input the target feature vector to the discriminant layer, calculate the optimal labeling sequence, obtain the entity information corresponding to the optimal labeling sequence, and obtain the target extraction text. The present application also provides a text extraction device, computer equipment and storage medium. In addition, this application also relates to blockchain technology, and the target extraction text can be stored in the blockchain. The present application achieves accurate extraction of text.

Description

Text extraction method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a text extraction method and apparatus, a computer device, and a storage medium.

Background

For a piece of medical files/data, such as: and the medicine instruction text is used for extracting the indications corresponding to the medicines. For example: the product is white tablet, light blue or light green tablet with colorant, or film coated tablet. Can be used for treating duodenal ulcer, gastric ulcer, reflux esophagitis, stress ulcer, and Zollinger-Ellison syndrome. For treating duodenal ulcer or pathological hypersecretion state, the medicine is taken 0.2-0.4 g once and 4 times a day after meal and before sleep, or 0.8g once and 1 time before sleep; indications for which 5 drugs can be withdrawn: duodenal ulcer; gastric ulcer; reflux esophagitis; stress ulcers; Zollinger-Ellison syndrome.

However, currently, identification of drug indications is mainly performed by collecting indication names to generate an indication dictionary library; then, reading the medicine specification, and matching the adaptation symptom dictionary library according to the maximum matching rule; the matched indications are then generated. This method causes the problem that only the indication appearing in the dictionary database can be identified, the new indication cannot be identified, the generalization ability is too poor, and the identification ability is low.

Disclosure of Invention

The embodiment of the application aims to provide a text extraction method, a text extraction device, computer equipment and a storage medium, so as to solve the technical problem that text extraction is not accurate enough.

In order to solve the above technical problem, an embodiment of the present application provides a text extraction method, which adopts the following technical solutions:

acquiring a text to be processed and a preset extraction model, inputting the text to be processed to the preset extraction model, and coding the text to be processed according to a coding layer of the preset extraction model to obtain a target coding vector;

inputting the target coding vector to a first network layer of the preset extraction model, and calculating to obtain a first network vector corresponding to the text to be processed;

inputting the first network vector to a second network layer of the preset extraction model, calculating to obtain a target feature matrix, and performing self-attention calculation on the target feature matrix to obtain a target feature vector;

inputting the target characteristic vector to a discrimination layer of the preset extraction model, calculating to obtain an optimal labeling sequence corresponding to the text to be processed, acquiring entity information corresponding to the optimal labeling sequence, and determining the entity information as the target extraction text of the text to be processed.

Further, the step of encoding the text to be processed according to the encoding layer of the preset decimation model includes:

acquiring label information of the text to be processed, coding the label information according to the coding layer to obtain a first vector, and coding words of the text to be processed according to the coding layer to obtain a second vector;

and splicing the first vector and the second vector to obtain the target coding vector.

Further, the step of obtaining the tag information of the text to be processed and encoding the tag information according to the encoding layer to obtain a first vector includes:

acquiring a pinyin text, a radical text and an inverse translation text of the text to be processed, and taking the pinyin text, the radical text and the inverse translation text as the label information;

coding the pinyin text, the radical text and the reverse translation text according to the coding layer to obtain pinyin codes, radical codes and reverse translation codes;

and performing self-attention calculation on the pinyin codes, the radical codes and the reverse translation codes to obtain the first vector.

Further, the step of inputting the target coding vector to a first network layer of the preset extraction model and calculating to obtain a first network vector corresponding to the text to be processed includes:

the first network layer comprises a forward long-short term memory network and a backward long-short term memory network, the target coding vector is input into the forward long-short term memory network according to the positive sequence of the text to be processed, and a forward hidden vector is obtained through calculation;

inputting the target coding vector into the backward long-short term memory network according to the reverse order of the text to be processed, and calculating to obtain a backward hidden vector;

and splicing the forward hidden vector and the backward hidden vector to obtain the first network vector.

Further, before the step of obtaining the text to be processed and the preset extraction model, the method further comprises:

collecting a plurality of groups of corpus texts and real extracted texts corresponding to the corpus texts, and labeling the real extracted texts and the corpus texts to obtain labeled texts;

constructing a basic extraction model, inputting the labeled text into the basic extraction model, and calculating to obtain a loss function;

and adjusting parameters of the basic extraction model according to the loss function, determining that the basic extraction model is trained when the loss function is converged, and taking the trained basic extraction model as the preset extraction model.

Further, the step of labeling the real extracted text and the corpus text to obtain a labeled text includes:

performing word segmentation on the real extracted text to obtain word segmentation words, and acquiring the position of each word segmentation word in the real extracted text;

labeling the word segmentation words according to the starting position, the middle position and the ending position of the position to obtain a first sub-text;

and acquiring a preset label of the corpus text, labeling the corpus text as a second sub-text according to the preset label, and combining the first sub-text and the second sub-text as the labeled text.

Further, the step of inputting the label text to the basic extraction model and calculating to obtain the loss function includes:

inputting the labeled text to the basic extraction model to obtain a predicted extraction text of the labeled text;

and calculating a loss function of the basic extraction model according to the predicted extraction text and the real extraction text.

In order to solve the above technical problem, an embodiment of the present application further provides a text extraction device, which adopts the following technical solutions:

the acquisition module is used for acquiring a text to be processed and a preset extraction model, inputting the text to be processed to the preset extraction model, and coding the text to be processed according to a coding layer of the preset extraction model to obtain a target coding vector;

the first calculation module is used for inputting the target coding vector to a first network layer of the preset extraction model and calculating to obtain a first network vector corresponding to the text to be processed;

the second calculation module is used for inputting the first network vector to a second network layer of the preset extraction model, calculating to obtain a target characteristic matrix, and performing self-attention calculation on the target characteristic matrix to obtain a target characteristic vector;

and the confirming module is used for inputting the target characteristic vector to a judging layer of the preset extraction model, calculating to obtain an optimal labeling sequence corresponding to the text to be processed, acquiring entity information corresponding to the optimal labeling sequence, and determining the entity information as the target extraction text of the text to be processed.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which adopts the following technical solutions:

In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, which adopts the following technical solutions:

The method comprises the steps of obtaining a text to be processed and a preset extraction model, inputting the text to be processed to the preset extraction model, and coding the text to be processed according to a coding layer of the preset extraction model to obtain a target coding vector, so that the text to be processed can be accurately expressed through the target coding vector; then, inputting the target coding vector to a first network layer of a preset extraction model, calculating to obtain a first network vector corresponding to the text to be processed, and extracting and expressing the context characteristic information in the text to be processed according to the first network vector; then, inputting the first network vector to a second network layer of a preset extraction model, calculating to obtain a target characteristic matrix, and performing self-attention calculation on the target characteristic matrix to obtain a target characteristic vector; and finally, inputting the target characteristic vector to a discrimination layer of a preset extraction model, calculating to obtain an optimal labeling sequence corresponding to the text to be processed, obtaining entity information corresponding to the optimal labeling sequence, and determining the entity information as the target extraction text of the text to be processed, so that the generalization capability of the model is improved, the text which does not appear in a dictionary library can be identified and extracted, the improved model can identify and extract the text according to context semantic information, the recall rate and the accuracy of the model are improved, the dictionary library does not need to be continuously maintained, and resources are saved.

Drawings

In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a text extraction method according to the present application;

FIG. 3 is a schematic block diagram of one embodiment of a text extraction device according to the present application;

FIG. 4 is a schematic block diagram of one embodiment of a computer device according to the present application.

Reference numerals: the text extraction device 300, the acquisition module 301, the first calculation module 302, the second calculation module 303 and the confirmation module 304.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the

terminal devices

101, 102, 103.

It should be noted that the text extraction method provided in the embodiments of the present application is generally executed by a server/terminal device, and accordingly, the text extraction apparatus is generally disposed in the server/terminal device.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow diagram of one embodiment of a method of text extraction according to the present application is shown. The text extraction method comprises the following steps:

step S201, acquiring a text to be processed and a preset extraction model, inputting the text to be processed to the preset extraction model, and coding the text to be processed according to a coding layer of the preset extraction model to obtain a target coding vector;

in this embodiment, the text to be processed is a text that needs to be extracted for information identification, for example, for extraction of a pharmaceutical indication, an indication corresponding to the pharmaceutical needs to be extracted from a pharmaceutical specification, which is the text to be processed. The preset extraction model is an extraction model which is trained in advance, and comprises a coding layer, a first network layer, a second network layer and a discrimination layer. And when the text to be processed is obtained, coding the text to be processed according to the coding layer of the preset extraction model to obtain a target coding vector corresponding to the text to be processed. The target coding vector can be obtained by coding a coded structure of bert (pre-training language model), specifically, position coding, classified coding and embedded coding are respectively performed on the text to be processed to obtain a first coding result, a second coding result and a third coding result; and splicing the first coding result, the second coding result and the third coding result to obtain a target coding vector.

Step S202, inputting the target coding vector to a first network layer of the preset extraction model, and calculating to obtain a first network vector corresponding to the text to be processed;

in this embodiment, the first network layer adopts a bidirectional Long Short Term Memory (BilSTM) network, and the first network layer is composed of a forward Long Short Term Memory network and a backward Long Short Term Memory network. Sequentially inputting the target coding vector into a forward long-short term memory network according to the positive sequence of the text to be processed to obtain a forward hidden vector; and sequentially inputting the target coding vector into a backward long-short term memory network according to the reverse order of the text to be processed to obtain a backward implicit vector, and splicing the forward implicit vector and the backward implicit vector to obtain a first network vector. The first network vector comprises all information in the forward direction and the backward direction, and semantic information characteristics of the text context can be accurately represented according to the first network vector.

Step S203, inputting the first network vector to a second network layer of the preset extraction model, calculating to obtain a target feature matrix, and performing self-attention calculation on the target feature matrix to obtain a target feature vector;

in this embodiment, when a first network vector is obtained, the first network vector is input to a second network layer, and feature extraction is performed on the first network vector according to the second network layer, so as to obtain a target extraction matrix. Specifically, the second network layer may perform feature extraction on the first network vector by using a neural network such as a convolutional neural network or a gated convolutional network. Taking a gated convolutional network (GCNN) as an example, when a first network vector is obtained, performing parallel convolution on the first network vector according to the gated convolutional network to obtain a first convolution result and a second convolution result; selecting one convolution result, inputting the selected convolution result into a preset activation function (sigmoid function), and calculating to obtain a gating calculation result; and multiplying the gating calculation result with another group of convolution results which are not calculated by the activation function, and calculating to obtain a target characteristic matrix. And when the target characteristic matrix is obtained, performing self-attention calculation on the target characteristic matrix to obtain a target characteristic vector. When the indications of the medicine are extracted from the medicine specification, the target feature vector is the feature vector corresponding to the medicine specification.

Step S204, inputting the target characteristic vector to a discrimination layer of the preset extraction model, calculating to obtain an optimal labeling sequence corresponding to the text to be processed, acquiring entity information corresponding to the optimal labeling sequence, and determining the entity information as the target extraction text of the text to be processed.

In this embodiment, when obtaining the target extraction vector, the target feature vector is input to a discrimination layer of a preset extraction model, and the discrimination layer may adopt a discrimination model such as a Conditional Random field model (crf) or a hidden markov model (hmm). Taking a conditional random field model as an example, the conditional random field model is a undirected graph model, calculating target feature vectors according to the conditional random field model, and selecting the labels with the highest score corresponding to each word from the target feature vectors, wherein the labels with the highest score are combined into an optimal labeling sequence. For example, for w0, a score of 1.5 for "B-drug" is selected, with the 1.5 being the highest score, and "B-drug" being the label with the highest w0 score. And when the optimal labeling sequence is obtained, acquiring entity information corresponding to the optimal labeling sequence, and taking the entity information as a target extraction text of the text to be processed.

It is emphasized that, in order to further ensure the privacy and security of the target extracted text, the target extracted text may also be stored in a node of a block chain.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

The method and the device have the advantages that the generalization capability of the model is improved, the text which does not appear in the dictionary base can be identified and extracted, the improved model can identify and extract the text according to the context semantic information, the recall rate and the accuracy rate of the model are improved, the dictionary base does not need to be continuously maintained, and resources are saved.

In some optional implementation manners of this embodiment, the step of encoding the text to be processed according to the encoding layer of the preset decimation model includes:

In this embodiment, when a text to be processed is encoded according to an encoding layer, tag information of the text to be processed is obtained, where the tag information is a tag expression of the text to be processed, such as tag information of pinyin or radicals of the text to be processed; coding the label information according to the coding layer to obtain a first vector; and coding the words of the text to be processed according to the coding layer to obtain a second vector. And when the first vector and the second vector are obtained, vector splicing is carried out on the first vector and the second vector to obtain a target coding vector.

In the embodiment, the label information and the words of the text to be processed are encoded, so that the text to be processed is accurately expressed, and the text features of the text to be processed can be more accurately reflected according to the calculated target encoding vector.

In some optional implementation manners of this embodiment, the step of obtaining the tag information of the text to be processed and encoding the tag information according to the encoding layer to obtain the first vector includes:

In this embodiment, in order to more accurately encode and represent a text to be processed, when the text to be processed is encoded according to an encoding layer, a pinyin text, a radical text, and an inverse translation text of the text to be processed are obtained, where the pinyin text, the radical text, and the inverse translation text are label information of the text to be processed. Respectively coding the pinyin text, the radical text and the reverse translation text according to the coding layer to obtain corresponding pinyin codes, radical codes and reverse translation codes; then, the self-attention calculation is carried out on the pinyin codes, the radical codes and the reverse translation codes to obtain a first vector. Specifically, vector splicing is carried out on the pinyin codes, the radical codes and the reverse translation codes to obtain spliced vectors; when the splicing vectors are obtained, calculating the similarity between the vectors through self attention of the splicing vectors, and expressing the similarity through a probability distribution result to obtain a probability distribution result; and finally, multiplying the probability distribution result by the rest coding matrixes to obtain a first vector.

In the embodiment, the pinyin text, the radical text and the inverse translation text of the text to be processed are encoded, so that the text to be processed is fully expressed, and the accuracy of text extraction is further improved.

In some optional implementation manners of this embodiment, the step of inputting the target encoding vector to the first network layer of the preset extraction model and calculating to obtain the first network vector corresponding to the text to be processed includes:

In this embodiment, the first network layer includes a forward long-short term memory network and a backward long-short term memory network, and the forward long-short term memory network and the backward long-short term memory network each include three gate structures, namely a forgetting gate, a memory gate, and an output gate. Sequentially inputting the target coding vector to a forgetting gate of a forward long-short term memory network according to the positive sequence of the text to be processed, and calculating according to the input target coding vector and the hidden layer state at the previous moment to obtain an output value of the forgetting gate; and calculating the temporary cell state and the cell state at the previous moment according to the memory gate to obtain the cell state at the current moment, inputting the cell state at the current moment to an output gate, and calculating to obtain a forward implicit vector. And similarly, sequentially inputting the target coding vectors into a backward long-short term memory network according to the reverse order of the text to be processed, and calculating according to the calculation mode of the forward long-short term memory network and the backward long-short term memory network to obtain backward hidden vectors. And when the forward hidden vector and the backward hidden vector are obtained, vector splicing is carried out on the vectors at the corresponding positions of the forward hidden vector and the backward hidden vector to obtain a first network vector.

In the embodiment, the forward long-short term memory network and the backward long-short term memory network are used for carrying out feature calculation on the target coding vector, so that the semantic information features of the text context are accurately represented, and the extracted text is more accurate.

In some optional implementation manners of this embodiment, before the step of obtaining the text to be processed and the preset extraction model, the method further includes:

In this embodiment, before obtaining the preset extraction model, a basic extraction model needs to be constructed, and the basic extraction model is trained to obtain the preset extraction model. Specifically, a plurality of groups of corpus texts and real extracted texts corresponding to the corpus texts are collected, the corpus texts and the real extracted texts are labeled, and the labeled corpus texts and the real extracted texts are both used as labeled texts. And constructing a basic extraction model, wherein the basic extraction model and the preset extraction model have the same model structure, namely the basic extraction model and the preset extraction model respectively comprise a coding layer, a first network layer, a second network layer and a judgment layer, but the parameters of the basic extraction model and the parameters of the preset extraction model are different. Inputting the label text into the basic extraction model, and calculating to obtain a loss function of the basic extraction model; and adjusting parameters of the basic extraction model according to the loss function until the loss function obtained by calculation according to the basic extraction model after a certain parameter adjustment is converged, and determining that the basic extraction model is trained, wherein the trained basic training model is the preset extraction model.

In the embodiment, the basic extraction model is trained in advance, so that the text can be accurately extracted through the trained basic extraction model, and the text extraction efficiency and accuracy are improved.

In some optional implementation manners of this embodiment, the labeling the real extracted text and the corpus text to obtain a labeled text includes:

In this embodiment, when a real extracted text is obtained, performing word segmentation on the real extracted text to obtain word segmentation words, and obtaining a position of each word segmentation word in the real extracted text, where the position includes a start position, a middle position, and an end position. And labeling the participle words of each real extracted text according to the starting position, the middle position and the ending position, wherein the labeled participle words are the first sub-text. And meanwhile, acquiring a preset label of the corpus text, and labeling the corpus text as a second sub-text according to the preset label, wherein the first sub-text and the second sub-text are combined to form a labeled text.

Taking a medical drug instruction text as an example, the language material text is that the product is a white tablet, a light blue or light green tablet added with a colorant, or a film coated tablet. Can be used for treating duodenal ulcer, gastric ulcer, reflux esophagitis, stress ulcer, and Zollinger-Ellison syndrome. For treating duodenal ulcer or pathological hypersecretion state, the medicine is taken 0.2-0.4 g once and 4 times a day after meal and before sleep, or 0.8g once and 1 time before sleep. "the actual extracted text corresponding to the corpus text is the extracted drug indication, which includes: duodenal ulcer, gastric ulcer, reflux esophagitis, stress ulcer, Zollinger-Ellison syndrome. "twelve" in "duodenal ulcer" is the word of starting position, "ulcer" is the word of ending position, thus "twelve" is labeled as "B-indication starting position" and "it" is labeled as "M-indication intermediate position" … … "ulcer" is labeled as "E-indication ending position"; for corpus text, labeled "O-other".

In the embodiment, the truly extracted text and the corpus text are labeled, so that when the text is extracted through the preset extraction model, the text can be accurately extracted according to the labeled text, and the accuracy rate of the text extraction is improved.

In some optional implementation manners of this embodiment, the step of inputting the annotation text into the basic extraction model and calculating the loss function includes:

In this embodiment, when obtaining the annotation text, the annotation text is input to the basic extraction model. When the first time of the annotation text is input into the basic extraction model, the parameters of the basic extraction model are initial preset parameters; and performing text extraction on the labeled text according to the basic extraction model, namely predicting the extracted text corresponding to the labeled text to obtain a predicted extracted text. And calculating a loss function according to the predicted extracted text and the real extracted text corresponding to the labeled text, and calculating a difference value between the real result and the predicted result through the loss function, so that the degree of the model required to be adjusted each time can be determined. And adjusting parameters of the basic extraction model according to the loss function, inputting the label text to the adjusted basic extraction model again after each adjustment, calculating the loss function until the loss function is converged after calculation according to the adjusted basic extraction model for a certain time, and determining the adjusted basic extraction model as a preset extraction model.

According to the embodiment, the model is subjected to parameter adjustment through the loss function, so that the training efficiency of the model and the prediction accuracy of the model are improved.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with computer readable instructions, which can be stored in a computer readable storage medium, and when executed, the processes of the embodiments of the methods described above can be included. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a text extraction apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied to various electronic devices.

As shown in fig. 3, the text extraction apparatus 300 according to the present embodiment includes: an acquisition module 301, a first calculation module 302, a second calculation module 303, and a confirmation module 304. Wherein:

an obtaining module 301, configured to obtain a to-be-processed text and a preset extraction model, input the to-be-processed text to the preset extraction model, and encode the to-be-processed text according to a coding layer of the preset extraction model to obtain a target coding vector;

in some optional implementations of this embodiment, the obtaining module 301 includes:

the encoding unit is used for acquiring label information of the text to be processed, encoding the label information according to the encoding layer to obtain a first vector, and encoding words of the text to be processed according to the encoding layer to obtain a second vector;

and the splicing unit is used for splicing the first vector and the second vector to obtain the target coding vector.

In some optional implementations of this embodiment, the encoding unit includes:

the acquiring subunit is used for acquiring a pinyin text, a radical text and an inverse translation text of the text to be processed, and taking the pinyin text, the radical text and the inverse translation text as the label information;

the coding subunit is used for coding the pinyin text, the radical text and the inverse-translated text respectively according to the coding layer to obtain pinyin codes, radical codes and inverse-translated codes;

and the calculation subunit is used for performing self-attention calculation on the pinyin codes, the radical codes and the reverse translation codes to obtain the first vector.

A first calculating module 302, configured to input the target coding vector to a first network layer of the preset extraction model, and calculate to obtain a first network vector corresponding to the text to be processed;

in some optional implementations of this embodiment, the first calculating module 302 includes:

the first calculation unit is used for the first network layer to comprise a forward long-short term memory network and a backward long-short term memory network, inputting the target coding vector into the forward long-short term memory network according to the positive sequence of the text to be processed, and calculating to obtain a forward hidden vector;

the second calculation unit is used for inputting the target coding vector into the backward long-short term memory network according to the reverse order of the text to be processed, and calculating to obtain a backward hidden vector;

and the third calculating unit is used for splicing the forward hidden vector and the backward hidden vector to obtain the first network vector.

A second calculating module 303, configured to input the first network vector to a second network layer of the preset extraction model, calculate to obtain a target feature matrix, and perform self-attention calculation on the target feature matrix to obtain a target feature vector;

The confirming module 304 is configured to input the target feature vector to a discrimination layer of the preset extraction model, calculate an optimal tagging sequence corresponding to the text to be processed, obtain entity information corresponding to the optimal tagging sequence, and determine that the entity information is a target extraction text of the text to be processed.

In some optional implementations of this embodiment, the text extraction apparatus 300 further includes:

the labeling module is used for acquiring a plurality of groups of corpus texts and real extracted texts corresponding to the corpus texts, and labeling the real extracted texts and the corpus texts to obtain labeled texts;

the construction module is used for constructing a basic extraction model, inputting the labeled text into the basic extraction model, and calculating to obtain a loss function;

and the adjusting module is used for adjusting parameters of the basic extraction model according to the loss function, determining that the basic extraction model is trained when the loss function is converged, and taking the trained basic extraction model as the preset extraction model.

In some optional implementations of this embodiment, the labeling module includes:

the word segmentation unit is used for segmenting the real extracted text to obtain word segmentation words and acquiring the position of each word segmentation word in the real extracted text;

the first labeling unit is used for labeling the word segmentation words according to the starting position, the middle position and the ending position of the position to obtain a first sub text;

and the second labeling unit is used for acquiring a preset label of the corpus text, labeling the corpus text as a second sub-text according to the preset label, and combining the first sub-text and the second sub-text into the labeled text.

In some optional implementations of this embodiment, the building block includes:

the prediction unit is used for inputting the labeled text into the basic extraction model to obtain a predicted extraction text of the labeled text;

and the fourth calculation unit is used for calculating a loss function of the basic extraction model according to the predicted extraction text and the real extraction text.

The text extraction device provided by the embodiment improves the generalization capability of the model, can also recognize and extract the text which does not appear in the dictionary database, and the improved model can recognize and extract the text according to the context semantic information, so that the recall rate and the accuracy of the model are improved, the dictionary database does not need to be continuously maintained, and the resources are saved.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 4, fig. 4 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 6 comprises a memory 61, a processor 62, a network interface 63 communicatively connected to each other via a system bus. It is noted that only a computer device 6 having components 61-63 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 61 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 61 may be an internal storage unit of the computer device 6, such as a hard disk or a memory of the computer device 6. In other embodiments, the memory 61 may also be an external storage device of the computer device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 6. Of course, the memory 61 may also comprise both an internal storage unit of the computer device 6 and an external storage device thereof. In this embodiment, the memory 61 is generally used for storing an operating system installed in the computer device 6 and various application software, such as computer readable instructions of a text extraction method. Further, the memory 61 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 62 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 62 is typically used to control the overall operation of the computer device 6. In this embodiment, the processor 62 is configured to execute computer readable instructions stored in the memory 61 or process data, such as computer readable instructions for executing the text extraction method.

The network interface 63 may comprise a wireless network interface or a wired network interface, and the network interface 63 is typically used for establishing a communication connection between the computer device 6 and other electronic devices.

The computer equipment provided by the embodiment improves the generalization capability of the model, can also recognize and extract the text which does not appear in the dictionary library, and the improved model can recognize and extract the text according to the context semantic information, so that the recall rate and the accuracy of the model are improved, the dictionary library does not need to be continuously maintained, and the resources are saved.

The present application further provides another embodiment, which is to provide a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the text extraction method as described above.

The computer-readable storage medium provided by the embodiment improves the generalization capability of the model, can also recognize and extract texts which do not appear in the dictionary base, and the improved model can recognize and extract texts according to context semantic information, so that the recall rate and the accuracy of the model are improved, the dictionary base does not need to be continuously maintained, and resources are saved.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A method for extracting text, comprising the steps of:

2. The method according to claim 1, wherein the step of encoding the text to be processed according to the encoding layer of the preset extraction model comprises:

3. The method according to claim 2, wherein the step of obtaining the tag information of the text to be processed and encoding the tag information according to the encoding layer to obtain the first vector comprises:

4. The method according to claim 1, wherein the step of inputting the target coding vector to a first network layer of the preset extraction model and calculating a first network vector corresponding to the text to be processed comprises:

5. The method for extracting text according to claim 1, further comprising, before the step of obtaining the text to be processed and the preset extraction model:

6. The method according to claim 5, wherein the step of labeling the extracted actual text and the corpus text to obtain a labeled text comprises:

7. The method of claim 5, wherein the step of inputting the labeled text into the basic extraction model and calculating the loss function comprises:

8. A text extraction apparatus, comprising:

9. A computer device comprising a memory having computer readable instructions stored therein and a processor which when executed implements the steps of a text extraction method according to any one of claims 1 to 7.

10. A computer-readable storage medium having computer-readable instructions stored thereon which, when executed by a processor, implement the steps of the text extraction method of any of claims 1-7.