CN113657104A - Text extraction method and device, computer equipment and storage medium - Google Patents
Text extraction method and device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN113657104A CN113657104A CN202111015728.8A CN202111015728A CN113657104A CN 113657104 A CN113657104 A CN 113657104A CN 202111015728 A CN202111015728 A CN 202111015728A CN 113657104 A CN113657104 A CN 113657104A
- Authority
- CN
- China
- Prior art keywords
- text
- vector
- processed
- extraction model
- coding
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 221
- 239000013598 vector Substances 0.000 claims abstract description 202
- 238000002372 labelling Methods 0.000 claims abstract description 45
- 238000004364 calculation method Methods 0.000 claims abstract description 42
- 239000011159 matrix material Substances 0.000 claims abstract description 28
- 230000015654 memory Effects 0.000 claims description 50
- 230000006870 function Effects 0.000 claims description 37
- 238000013519 translation Methods 0.000 claims description 25
- 238000000034 method Methods 0.000 claims description 24
- 230000011218 segmentation Effects 0.000 claims description 19
- 238000005516 engineering process Methods 0.000 abstract description 4
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 239000003814 drug Substances 0.000 description 19
- 208000000718 duodenal ulcer Diseases 0.000 description 11
- 229940079593 drug Drugs 0.000 description 9
- 238000010586 diagram Methods 0.000 description 6
- 230000006403 short-term memory Effects 0.000 description 6
- 239000003826 tablet Substances 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 208000007107 Stomach Ulcer Diseases 0.000 description 4
- 206010042220 Stress ulcer Diseases 0.000 description 4
- 201000008629 Zollinger-Ellison syndrome Diseases 0.000 description 4
- 230000004913 activation Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 201000005917 gastric ulcer Diseases 0.000 description 4
- 201000000052 gastrinoma Diseases 0.000 description 4
- 208000021302 gastroesophageal reflux disease Diseases 0.000 description 4
- 208000000689 peptic esophagitis Diseases 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 208000025865 Ulcer Diseases 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 239000003086 colorant Substances 0.000 description 2
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 238000012790 confirmation Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 239000007941 film coated tablet Substances 0.000 description 2
- 235000012054 meals Nutrition 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000001575 pathological effect Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 231100000397 ulcer Toxicity 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 208000024891 symptom Diseases 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the application belongs to the field of artificial intelligence, is applied to the field of medical treatment, and relates to a text extraction method which comprises the steps of obtaining a text to be processed and a preset extraction model, inputting the text to be processed to the preset extraction model, and coding the text to be processed according to a coding layer to obtain a target coding vector; inputting a target coding vector to a first network layer, and calculating to obtain a first network vector; inputting the first network vector to a second network layer, calculating to obtain a target feature matrix, and performing self-attention calculation on the target feature matrix to obtain a target feature vector; and inputting the target characteristic vector to a discrimination layer, calculating to obtain an optimal labeling sequence, and acquiring entity information corresponding to the optimal labeling sequence to obtain a target extraction text. The application also provides a text extraction device, computer equipment and a storage medium. In addition, the application also relates to a block chain technology, and the target extraction text can be stored in the block chain. The text extraction method and the text extraction device achieve accurate extraction of the text.
Description
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a text extraction method and apparatus, a computer device, and a storage medium.
Background
For a piece of medical files/data, such as: and the medicine instruction text is used for extracting the indications corresponding to the medicines. For example: the product is white tablet, light blue or light green tablet with colorant, or film coated tablet. Can be used for treating duodenal ulcer, gastric ulcer, reflux esophagitis, stress ulcer, and Zollinger-Ellison syndrome. For treating duodenal ulcer or pathological hypersecretion state, the medicine is taken 0.2-0.4 g once and 4 times a day after meal and before sleep, or 0.8g once and 1 time before sleep; indications for which 5 drugs can be withdrawn: duodenal ulcer; gastric ulcer; reflux esophagitis; stress ulcers; Zollinger-Ellison syndrome.
However, currently, identification of drug indications is mainly performed by collecting indication names to generate an indication dictionary library; then, reading the medicine specification, and matching the adaptation symptom dictionary library according to the maximum matching rule; the matched indications are then generated. This method causes the problem that only the indication appearing in the dictionary database can be identified, the new indication cannot be identified, the generalization ability is too poor, and the identification ability is low.
Disclosure of Invention
The embodiment of the application aims to provide a text extraction method, a text extraction device, computer equipment and a storage medium, so as to solve the technical problem that text extraction is not accurate enough.
In order to solve the above technical problem, an embodiment of the present application provides a text extraction method, which adopts the following technical solutions:
acquiring a text to be processed and a preset extraction model, inputting the text to be processed to the preset extraction model, and coding the text to be processed according to a coding layer of the preset extraction model to obtain a target coding vector;
inputting the target coding vector to a first network layer of the preset extraction model, and calculating to obtain a first network vector corresponding to the text to be processed;
inputting the first network vector to a second network layer of the preset extraction model, calculating to obtain a target feature matrix, and performing self-attention calculation on the target feature matrix to obtain a target feature vector;
inputting the target characteristic vector to a discrimination layer of the preset extraction model, calculating to obtain an optimal labeling sequence corresponding to the text to be processed, acquiring entity information corresponding to the optimal labeling sequence, and determining the entity information as the target extraction text of the text to be processed.
Further, the step of encoding the text to be processed according to the encoding layer of the preset decimation model includes:
acquiring label information of the text to be processed, coding the label information according to the coding layer to obtain a first vector, and coding words of the text to be processed according to the coding layer to obtain a second vector;
and splicing the first vector and the second vector to obtain the target coding vector.
Further, the step of obtaining the tag information of the text to be processed and encoding the tag information according to the encoding layer to obtain a first vector includes:
acquiring a pinyin text, a radical text and an inverse translation text of the text to be processed, and taking the pinyin text, the radical text and the inverse translation text as the label information;
coding the pinyin text, the radical text and the reverse translation text according to the coding layer to obtain pinyin codes, radical codes and reverse translation codes;
and performing self-attention calculation on the pinyin codes, the radical codes and the reverse translation codes to obtain the first vector.
Further, the step of inputting the target coding vector to a first network layer of the preset extraction model and calculating to obtain a first network vector corresponding to the text to be processed includes:
the first network layer comprises a forward long-short term memory network and a backward long-short term memory network, the target coding vector is input into the forward long-short term memory network according to the positive sequence of the text to be processed, and a forward hidden vector is obtained through calculation;
inputting the target coding vector into the backward long-short term memory network according to the reverse order of the text to be processed, and calculating to obtain a backward hidden vector;
and splicing the forward hidden vector and the backward hidden vector to obtain the first network vector.
Further, before the step of obtaining the text to be processed and the preset extraction model, the method further comprises:
collecting a plurality of groups of corpus texts and real extracted texts corresponding to the corpus texts, and labeling the real extracted texts and the corpus texts to obtain labeled texts;
constructing a basic extraction model, inputting the labeled text into the basic extraction model, and calculating to obtain a loss function;
and adjusting parameters of the basic extraction model according to the loss function, determining that the basic extraction model is trained when the loss function is converged, and taking the trained basic extraction model as the preset extraction model.
Further, the step of labeling the real extracted text and the corpus text to obtain a labeled text includes:
performing word segmentation on the real extracted text to obtain word segmentation words, and acquiring the position of each word segmentation word in the real extracted text;
labeling the word segmentation words according to the starting position, the middle position and the ending position of the position to obtain a first sub-text;
and acquiring a preset label of the corpus text, labeling the corpus text as a second sub-text according to the preset label, and combining the first sub-text and the second sub-text as the labeled text.
Further, the step of inputting the label text to the basic extraction model and calculating to obtain the loss function includes:
inputting the labeled text to the basic extraction model to obtain a predicted extraction text of the labeled text;
and calculating a loss function of the basic extraction model according to the predicted extraction text and the real extraction text.
In order to solve the above technical problem, an embodiment of the present application further provides a text extraction device, which adopts the following technical solutions:
the acquisition module is used for acquiring a text to be processed and a preset extraction model, inputting the text to be processed to the preset extraction model, and coding the text to be processed according to a coding layer of the preset extraction model to obtain a target coding vector;
the first calculation module is used for inputting the target coding vector to a first network layer of the preset extraction model and calculating to obtain a first network vector corresponding to the text to be processed;
the second calculation module is used for inputting the first network vector to a second network layer of the preset extraction model, calculating to obtain a target characteristic matrix, and performing self-attention calculation on the target characteristic matrix to obtain a target characteristic vector;
and the confirming module is used for inputting the target characteristic vector to a judging layer of the preset extraction model, calculating to obtain an optimal labeling sequence corresponding to the text to be processed, acquiring entity information corresponding to the optimal labeling sequence, and determining the entity information as the target extraction text of the text to be processed.
In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which adopts the following technical solutions:
acquiring a text to be processed and a preset extraction model, inputting the text to be processed to the preset extraction model, and coding the text to be processed according to a coding layer of the preset extraction model to obtain a target coding vector;
inputting the target coding vector to a first network layer of the preset extraction model, and calculating to obtain a first network vector corresponding to the text to be processed;
inputting the first network vector to a second network layer of the preset extraction model, calculating to obtain a target feature matrix, and performing self-attention calculation on the target feature matrix to obtain a target feature vector;
inputting the target characteristic vector to a discrimination layer of the preset extraction model, calculating to obtain an optimal labeling sequence corresponding to the text to be processed, acquiring entity information corresponding to the optimal labeling sequence, and determining the entity information as the target extraction text of the text to be processed.
In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, which adopts the following technical solutions:
acquiring a text to be processed and a preset extraction model, inputting the text to be processed to the preset extraction model, and coding the text to be processed according to a coding layer of the preset extraction model to obtain a target coding vector;
inputting the target coding vector to a first network layer of the preset extraction model, and calculating to obtain a first network vector corresponding to the text to be processed;
inputting the first network vector to a second network layer of the preset extraction model, calculating to obtain a target feature matrix, and performing self-attention calculation on the target feature matrix to obtain a target feature vector;
inputting the target characteristic vector to a discrimination layer of the preset extraction model, calculating to obtain an optimal labeling sequence corresponding to the text to be processed, acquiring entity information corresponding to the optimal labeling sequence, and determining the entity information as the target extraction text of the text to be processed.
The method comprises the steps of obtaining a text to be processed and a preset extraction model, inputting the text to be processed to the preset extraction model, and coding the text to be processed according to a coding layer of the preset extraction model to obtain a target coding vector, so that the text to be processed can be accurately expressed through the target coding vector; then, inputting the target coding vector to a first network layer of a preset extraction model, calculating to obtain a first network vector corresponding to the text to be processed, and extracting and expressing the context characteristic information in the text to be processed according to the first network vector; then, inputting the first network vector to a second network layer of a preset extraction model, calculating to obtain a target characteristic matrix, and performing self-attention calculation on the target characteristic matrix to obtain a target characteristic vector; and finally, inputting the target characteristic vector to a discrimination layer of a preset extraction model, calculating to obtain an optimal labeling sequence corresponding to the text to be processed, obtaining entity information corresponding to the optimal labeling sequence, and determining the entity information as the target extraction text of the text to be processed, so that the generalization capability of the model is improved, the text which does not appear in a dictionary library can be identified and extracted, the improved model can identify and extract the text according to context semantic information, the recall rate and the accuracy of the model are improved, the dictionary library does not need to be continuously maintained, and resources are saved.
Drawings
In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow diagram of one embodiment of a text extraction method according to the present application;
FIG. 3 is a schematic block diagram of one embodiment of a text extraction device according to the present application;
FIG. 4 is a schematic block diagram of one embodiment of a computer device according to the present application.
Reference numerals: the text extraction device 300, the acquisition module 301, the first calculation module 302, the second calculation module 303 and the confirmation module 304.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.
The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.
It should be noted that the text extraction method provided in the embodiments of the present application is generally executed by a server/terminal device, and accordingly, the text extraction apparatus is generally disposed in the server/terminal device.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow diagram of one embodiment of a method of text extraction according to the present application is shown. The text extraction method comprises the following steps:
step S201, acquiring a text to be processed and a preset extraction model, inputting the text to be processed to the preset extraction model, and coding the text to be processed according to a coding layer of the preset extraction model to obtain a target coding vector;
in this embodiment, the text to be processed is a text that needs to be extracted for information identification, for example, for extraction of a pharmaceutical indication, an indication corresponding to the pharmaceutical needs to be extracted from a pharmaceutical specification, which is the text to be processed. The preset extraction model is an extraction model which is trained in advance, and comprises a coding layer, a first network layer, a second network layer and a discrimination layer. And when the text to be processed is obtained, coding the text to be processed according to the coding layer of the preset extraction model to obtain a target coding vector corresponding to the text to be processed. The target coding vector can be obtained by coding a coded structure of bert (pre-training language model), specifically, position coding, classified coding and embedded coding are respectively performed on the text to be processed to obtain a first coding result, a second coding result and a third coding result; and splicing the first coding result, the second coding result and the third coding result to obtain a target coding vector.
Step S202, inputting the target coding vector to a first network layer of the preset extraction model, and calculating to obtain a first network vector corresponding to the text to be processed;
in this embodiment, the first network layer adopts a bidirectional Long Short Term Memory (BilSTM) network, and the first network layer is composed of a forward Long Short Term Memory network and a backward Long Short Term Memory network. Sequentially inputting the target coding vector into a forward long-short term memory network according to the positive sequence of the text to be processed to obtain a forward hidden vector; and sequentially inputting the target coding vector into a backward long-short term memory network according to the reverse order of the text to be processed to obtain a backward implicit vector, and splicing the forward implicit vector and the backward implicit vector to obtain a first network vector. The first network vector comprises all information in the forward direction and the backward direction, and semantic information characteristics of the text context can be accurately represented according to the first network vector.
Step S203, inputting the first network vector to a second network layer of the preset extraction model, calculating to obtain a target feature matrix, and performing self-attention calculation on the target feature matrix to obtain a target feature vector;
in this embodiment, when a first network vector is obtained, the first network vector is input to a second network layer, and feature extraction is performed on the first network vector according to the second network layer, so as to obtain a target extraction matrix. Specifically, the second network layer may perform feature extraction on the first network vector by using a neural network such as a convolutional neural network or a gated convolutional network. Taking a gated convolutional network (GCNN) as an example, when a first network vector is obtained, performing parallel convolution on the first network vector according to the gated convolutional network to obtain a first convolution result and a second convolution result; selecting one convolution result, inputting the selected convolution result into a preset activation function (sigmoid function), and calculating to obtain a gating calculation result; and multiplying the gating calculation result with another group of convolution results which are not calculated by the activation function, and calculating to obtain a target characteristic matrix. And when the target characteristic matrix is obtained, performing self-attention calculation on the target characteristic matrix to obtain a target characteristic vector. When the indications of the medicine are extracted from the medicine specification, the target feature vector is the feature vector corresponding to the medicine specification.
Step S204, inputting the target characteristic vector to a discrimination layer of the preset extraction model, calculating to obtain an optimal labeling sequence corresponding to the text to be processed, acquiring entity information corresponding to the optimal labeling sequence, and determining the entity information as the target extraction text of the text to be processed.
In this embodiment, when obtaining the target extraction vector, the target feature vector is input to a discrimination layer of a preset extraction model, and the discrimination layer may adopt a discrimination model such as a Conditional Random field model (crf) or a hidden markov model (hmm). Taking a conditional random field model as an example, the conditional random field model is a undirected graph model, calculating target feature vectors according to the conditional random field model, and selecting the labels with the highest score corresponding to each word from the target feature vectors, wherein the labels with the highest score are combined into an optimal labeling sequence. For example, for w0, a score of 1.5 for "B-drug" is selected, with the 1.5 being the highest score, and "B-drug" being the label with the highest w0 score. And when the optimal labeling sequence is obtained, acquiring entity information corresponding to the optimal labeling sequence, and taking the entity information as a target extraction text of the text to be processed.
It is emphasized that, in order to further ensure the privacy and security of the target extracted text, the target extracted text may also be stored in a node of a block chain.
The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
The method and the device have the advantages that the generalization capability of the model is improved, the text which does not appear in the dictionary base can be identified and extracted, the improved model can identify and extract the text according to the context semantic information, the recall rate and the accuracy rate of the model are improved, the dictionary base does not need to be continuously maintained, and resources are saved.
In some optional implementation manners of this embodiment, the step of encoding the text to be processed according to the encoding layer of the preset decimation model includes:
acquiring label information of the text to be processed, coding the label information according to the coding layer to obtain a first vector, and coding words of the text to be processed according to the coding layer to obtain a second vector;
and splicing the first vector and the second vector to obtain the target coding vector.
In this embodiment, when a text to be processed is encoded according to an encoding layer, tag information of the text to be processed is obtained, where the tag information is a tag expression of the text to be processed, such as tag information of pinyin or radicals of the text to be processed; coding the label information according to the coding layer to obtain a first vector; and coding the words of the text to be processed according to the coding layer to obtain a second vector. And when the first vector and the second vector are obtained, vector splicing is carried out on the first vector and the second vector to obtain a target coding vector.
In the embodiment, the label information and the words of the text to be processed are encoded, so that the text to be processed is accurately expressed, and the text features of the text to be processed can be more accurately reflected according to the calculated target encoding vector.
In some optional implementation manners of this embodiment, the step of obtaining the tag information of the text to be processed and encoding the tag information according to the encoding layer to obtain the first vector includes:
acquiring a pinyin text, a radical text and an inverse translation text of the text to be processed, and taking the pinyin text, the radical text and the inverse translation text as the label information;
coding the pinyin text, the radical text and the reverse translation text according to the coding layer to obtain pinyin codes, radical codes and reverse translation codes;
and performing self-attention calculation on the pinyin codes, the radical codes and the reverse translation codes to obtain the first vector.
In this embodiment, in order to more accurately encode and represent a text to be processed, when the text to be processed is encoded according to an encoding layer, a pinyin text, a radical text, and an inverse translation text of the text to be processed are obtained, where the pinyin text, the radical text, and the inverse translation text are label information of the text to be processed. Respectively coding the pinyin text, the radical text and the reverse translation text according to the coding layer to obtain corresponding pinyin codes, radical codes and reverse translation codes; then, the self-attention calculation is carried out on the pinyin codes, the radical codes and the reverse translation codes to obtain a first vector. Specifically, vector splicing is carried out on the pinyin codes, the radical codes and the reverse translation codes to obtain spliced vectors; when the splicing vectors are obtained, calculating the similarity between the vectors through self attention of the splicing vectors, and expressing the similarity through a probability distribution result to obtain a probability distribution result; and finally, multiplying the probability distribution result by the rest coding matrixes to obtain a first vector.
In the embodiment, the pinyin text, the radical text and the inverse translation text of the text to be processed are encoded, so that the text to be processed is fully expressed, and the accuracy of text extraction is further improved.
In some optional implementation manners of this embodiment, the step of inputting the target encoding vector to the first network layer of the preset extraction model and calculating to obtain the first network vector corresponding to the text to be processed includes:
the first network layer comprises a forward long-short term memory network and a backward long-short term memory network, the target coding vector is input into the forward long-short term memory network according to the positive sequence of the text to be processed, and a forward hidden vector is obtained through calculation;
inputting the target coding vector into the backward long-short term memory network according to the reverse order of the text to be processed, and calculating to obtain a backward hidden vector;
and splicing the forward hidden vector and the backward hidden vector to obtain the first network vector.
In this embodiment, the first network layer includes a forward long-short term memory network and a backward long-short term memory network, and the forward long-short term memory network and the backward long-short term memory network each include three gate structures, namely a forgetting gate, a memory gate, and an output gate. Sequentially inputting the target coding vector to a forgetting gate of a forward long-short term memory network according to the positive sequence of the text to be processed, and calculating according to the input target coding vector and the hidden layer state at the previous moment to obtain an output value of the forgetting gate; and calculating the temporary cell state and the cell state at the previous moment according to the memory gate to obtain the cell state at the current moment, inputting the cell state at the current moment to an output gate, and calculating to obtain a forward implicit vector. And similarly, sequentially inputting the target coding vectors into a backward long-short term memory network according to the reverse order of the text to be processed, and calculating according to the calculation mode of the forward long-short term memory network and the backward long-short term memory network to obtain backward hidden vectors. And when the forward hidden vector and the backward hidden vector are obtained, vector splicing is carried out on the vectors at the corresponding positions of the forward hidden vector and the backward hidden vector to obtain a first network vector.
In the embodiment, the forward long-short term memory network and the backward long-short term memory network are used for carrying out feature calculation on the target coding vector, so that the semantic information features of the text context are accurately represented, and the extracted text is more accurate.
In some optional implementation manners of this embodiment, before the step of obtaining the text to be processed and the preset extraction model, the method further includes:
collecting a plurality of groups of corpus texts and real extracted texts corresponding to the corpus texts, and labeling the real extracted texts and the corpus texts to obtain labeled texts;
constructing a basic extraction model, inputting the labeled text into the basic extraction model, and calculating to obtain a loss function;
and adjusting parameters of the basic extraction model according to the loss function, determining that the basic extraction model is trained when the loss function is converged, and taking the trained basic extraction model as the preset extraction model.
In this embodiment, before obtaining the preset extraction model, a basic extraction model needs to be constructed, and the basic extraction model is trained to obtain the preset extraction model. Specifically, a plurality of groups of corpus texts and real extracted texts corresponding to the corpus texts are collected, the corpus texts and the real extracted texts are labeled, and the labeled corpus texts and the real extracted texts are both used as labeled texts. And constructing a basic extraction model, wherein the basic extraction model and the preset extraction model have the same model structure, namely the basic extraction model and the preset extraction model respectively comprise a coding layer, a first network layer, a second network layer and a judgment layer, but the parameters of the basic extraction model and the parameters of the preset extraction model are different. Inputting the label text into the basic extraction model, and calculating to obtain a loss function of the basic extraction model; and adjusting parameters of the basic extraction model according to the loss function until the loss function obtained by calculation according to the basic extraction model after a certain parameter adjustment is converged, and determining that the basic extraction model is trained, wherein the trained basic training model is the preset extraction model.
In the embodiment, the basic extraction model is trained in advance, so that the text can be accurately extracted through the trained basic extraction model, and the text extraction efficiency and accuracy are improved.
In some optional implementation manners of this embodiment, the labeling the real extracted text and the corpus text to obtain a labeled text includes:
performing word segmentation on the real extracted text to obtain word segmentation words, and acquiring the position of each word segmentation word in the real extracted text;
labeling the word segmentation words according to the starting position, the middle position and the ending position of the position to obtain a first sub-text;
and acquiring a preset label of the corpus text, labeling the corpus text as a second sub-text according to the preset label, and combining the first sub-text and the second sub-text as the labeled text.
In this embodiment, when a real extracted text is obtained, performing word segmentation on the real extracted text to obtain word segmentation words, and obtaining a position of each word segmentation word in the real extracted text, where the position includes a start position, a middle position, and an end position. And labeling the participle words of each real extracted text according to the starting position, the middle position and the ending position, wherein the labeled participle words are the first sub-text. And meanwhile, acquiring a preset label of the corpus text, and labeling the corpus text as a second sub-text according to the preset label, wherein the first sub-text and the second sub-text are combined to form a labeled text.
Taking a medical drug instruction text as an example, the language material text is that the product is a white tablet, a light blue or light green tablet added with a colorant, or a film coated tablet. Can be used for treating duodenal ulcer, gastric ulcer, reflux esophagitis, stress ulcer, and Zollinger-Ellison syndrome. For treating duodenal ulcer or pathological hypersecretion state, the medicine is taken 0.2-0.4 g once and 4 times a day after meal and before sleep, or 0.8g once and 1 time before sleep. "the actual extracted text corresponding to the corpus text is the extracted drug indication, which includes: duodenal ulcer, gastric ulcer, reflux esophagitis, stress ulcer, Zollinger-Ellison syndrome. "twelve" in "duodenal ulcer" is the word of starting position, "ulcer" is the word of ending position, thus "twelve" is labeled as "B-indication starting position" and "it" is labeled as "M-indication intermediate position" … … "ulcer" is labeled as "E-indication ending position"; for corpus text, labeled "O-other".
In the embodiment, the truly extracted text and the corpus text are labeled, so that when the text is extracted through the preset extraction model, the text can be accurately extracted according to the labeled text, and the accuracy rate of the text extraction is improved.
In some optional implementation manners of this embodiment, the step of inputting the annotation text into the basic extraction model and calculating the loss function includes:
inputting the labeled text to the basic extraction model to obtain a predicted extraction text of the labeled text;
and calculating a loss function of the basic extraction model according to the predicted extraction text and the real extraction text.
In this embodiment, when obtaining the annotation text, the annotation text is input to the basic extraction model. When the first time of the annotation text is input into the basic extraction model, the parameters of the basic extraction model are initial preset parameters; and performing text extraction on the labeled text according to the basic extraction model, namely predicting the extracted text corresponding to the labeled text to obtain a predicted extracted text. And calculating a loss function according to the predicted extracted text and the real extracted text corresponding to the labeled text, and calculating a difference value between the real result and the predicted result through the loss function, so that the degree of the model required to be adjusted each time can be determined. And adjusting parameters of the basic extraction model according to the loss function, inputting the label text to the adjusted basic extraction model again after each adjustment, calculating the loss function until the loss function is converged after calculation according to the adjusted basic extraction model for a certain time, and determining the adjusted basic extraction model as a preset extraction model.
According to the embodiment, the model is subjected to parameter adjustment through the loss function, so that the training efficiency of the model and the prediction accuracy of the model are improved.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with computer readable instructions, which can be stored in a computer readable storage medium, and when executed, the processes of the embodiments of the methods described above can be included. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a text extraction apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied to various electronic devices.
As shown in fig. 3, the text extraction apparatus 300 according to the present embodiment includes: an acquisition module 301, a first calculation module 302, a second calculation module 303, and a confirmation module 304. Wherein:
an obtaining module 301, configured to obtain a to-be-processed text and a preset extraction model, input the to-be-processed text to the preset extraction model, and encode the to-be-processed text according to a coding layer of the preset extraction model to obtain a target coding vector;
in some optional implementations of this embodiment, the obtaining module 301 includes:
the encoding unit is used for acquiring label information of the text to be processed, encoding the label information according to the encoding layer to obtain a first vector, and encoding words of the text to be processed according to the encoding layer to obtain a second vector;
and the splicing unit is used for splicing the first vector and the second vector to obtain the target coding vector.
In some optional implementations of this embodiment, the encoding unit includes:
the acquiring subunit is used for acquiring a pinyin text, a radical text and an inverse translation text of the text to be processed, and taking the pinyin text, the radical text and the inverse translation text as the label information;
the coding subunit is used for coding the pinyin text, the radical text and the inverse-translated text respectively according to the coding layer to obtain pinyin codes, radical codes and inverse-translated codes;
and the calculation subunit is used for performing self-attention calculation on the pinyin codes, the radical codes and the reverse translation codes to obtain the first vector.
In this embodiment, the text to be processed is a text that needs to be extracted for information identification, for example, for extraction of a pharmaceutical indication, an indication corresponding to the pharmaceutical needs to be extracted from a pharmaceutical specification, which is the text to be processed. The preset extraction model is an extraction model which is trained in advance, and comprises a coding layer, a first network layer, a second network layer and a discrimination layer. And when the text to be processed is obtained, coding the text to be processed according to the coding layer of the preset extraction model to obtain a target coding vector corresponding to the text to be processed. The target coding vector can be obtained by coding a coded structure of bert (pre-training language model), specifically, position coding, classified coding and embedded coding are respectively performed on the text to be processed to obtain a first coding result, a second coding result and a third coding result; and splicing the first coding result, the second coding result and the third coding result to obtain a target coding vector.
A first calculating module 302, configured to input the target coding vector to a first network layer of the preset extraction model, and calculate to obtain a first network vector corresponding to the text to be processed;
in some optional implementations of this embodiment, the first calculating module 302 includes:
the first calculation unit is used for the first network layer to comprise a forward long-short term memory network and a backward long-short term memory network, inputting the target coding vector into the forward long-short term memory network according to the positive sequence of the text to be processed, and calculating to obtain a forward hidden vector;
the second calculation unit is used for inputting the target coding vector into the backward long-short term memory network according to the reverse order of the text to be processed, and calculating to obtain a backward hidden vector;
and the third calculating unit is used for splicing the forward hidden vector and the backward hidden vector to obtain the first network vector.
In this embodiment, the first network layer adopts a bidirectional Long Short Term Memory (BilSTM) network, and the first network layer is composed of a forward Long Short Term Memory network and a backward Long Short Term Memory network. Sequentially inputting the target coding vector into a forward long-short term memory network according to the positive sequence of the text to be processed to obtain a forward hidden vector; and sequentially inputting the target coding vector into a backward long-short term memory network according to the reverse order of the text to be processed to obtain a backward implicit vector, and splicing the forward implicit vector and the backward implicit vector to obtain a first network vector. The first network vector comprises all information in the forward direction and the backward direction, and semantic information characteristics of the text context can be accurately represented according to the first network vector.
A second calculating module 303, configured to input the first network vector to a second network layer of the preset extraction model, calculate to obtain a target feature matrix, and perform self-attention calculation on the target feature matrix to obtain a target feature vector;
in this embodiment, when a first network vector is obtained, the first network vector is input to a second network layer, and feature extraction is performed on the first network vector according to the second network layer, so as to obtain a target extraction matrix. Specifically, the second network layer may perform feature extraction on the first network vector by using a neural network such as a convolutional neural network or a gated convolutional network. Taking a gated convolutional network (GCNN) as an example, when a first network vector is obtained, performing parallel convolution on the first network vector according to the gated convolutional network to obtain a first convolution result and a second convolution result; selecting one convolution result, inputting the selected convolution result into a preset activation function (sigmoid function), and calculating to obtain a gating calculation result; and multiplying the gating calculation result with another group of convolution results which are not calculated by the activation function, and calculating to obtain a target characteristic matrix. And when the target characteristic matrix is obtained, performing self-attention calculation on the target characteristic matrix to obtain a target characteristic vector. When the indications of the medicine are extracted from the medicine specification, the target feature vector is the feature vector corresponding to the medicine specification.
The confirming module 304 is configured to input the target feature vector to a discrimination layer of the preset extraction model, calculate an optimal tagging sequence corresponding to the text to be processed, obtain entity information corresponding to the optimal tagging sequence, and determine that the entity information is a target extraction text of the text to be processed.
In this embodiment, when obtaining the target extraction vector, the target feature vector is input to a discrimination layer of a preset extraction model, and the discrimination layer may adopt a discrimination model such as a Conditional Random field model (crf) or a hidden markov model (hmm). Taking a conditional random field model as an example, the conditional random field model is a undirected graph model, calculating target feature vectors according to the conditional random field model, and selecting the labels with the highest score corresponding to each word from the target feature vectors, wherein the labels with the highest score are combined into an optimal labeling sequence. For example, for w0, a score of 1.5 for "B-drug" is selected, with the 1.5 being the highest score, and "B-drug" being the label with the highest w0 score. And when the optimal labeling sequence is obtained, acquiring entity information corresponding to the optimal labeling sequence, and taking the entity information as a target extraction text of the text to be processed.
It is emphasized that, in order to further ensure the privacy and security of the target extracted text, the target extracted text may also be stored in a node of a block chain.
The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
In some optional implementations of this embodiment, the text extraction apparatus 300 further includes:
the labeling module is used for acquiring a plurality of groups of corpus texts and real extracted texts corresponding to the corpus texts, and labeling the real extracted texts and the corpus texts to obtain labeled texts;
the construction module is used for constructing a basic extraction model, inputting the labeled text into the basic extraction model, and calculating to obtain a loss function;
and the adjusting module is used for adjusting parameters of the basic extraction model according to the loss function, determining that the basic extraction model is trained when the loss function is converged, and taking the trained basic extraction model as the preset extraction model.
In some optional implementations of this embodiment, the labeling module includes:
the word segmentation unit is used for segmenting the real extracted text to obtain word segmentation words and acquiring the position of each word segmentation word in the real extracted text;
the first labeling unit is used for labeling the word segmentation words according to the starting position, the middle position and the ending position of the position to obtain a first sub text;
and the second labeling unit is used for acquiring a preset label of the corpus text, labeling the corpus text as a second sub-text according to the preset label, and combining the first sub-text and the second sub-text into the labeled text.
In some optional implementations of this embodiment, the building block includes:
the prediction unit is used for inputting the labeled text into the basic extraction model to obtain a predicted extraction text of the labeled text;
and the fourth calculation unit is used for calculating a loss function of the basic extraction model according to the predicted extraction text and the real extraction text.
In this embodiment, before obtaining the preset extraction model, a basic extraction model needs to be constructed, and the basic extraction model is trained to obtain the preset extraction model. Specifically, a plurality of groups of corpus texts and real extracted texts corresponding to the corpus texts are collected, the corpus texts and the real extracted texts are labeled, and the labeled corpus texts and the real extracted texts are both used as labeled texts. And constructing a basic extraction model, wherein the basic extraction model and the preset extraction model have the same model structure, namely the basic extraction model and the preset extraction model respectively comprise a coding layer, a first network layer, a second network layer and a judgment layer, but the parameters of the basic extraction model and the parameters of the preset extraction model are different. Inputting the label text into the basic extraction model, and calculating to obtain a loss function of the basic extraction model; and adjusting parameters of the basic extraction model according to the loss function until the loss function obtained by calculation according to the basic extraction model after a certain parameter adjustment is converged, and determining that the basic extraction model is trained, wherein the trained basic training model is the preset extraction model.
The text extraction device provided by the embodiment improves the generalization capability of the model, can also recognize and extract the text which does not appear in the dictionary database, and the improved model can recognize and extract the text according to the context semantic information, so that the recall rate and the accuracy of the model are improved, the dictionary database does not need to be continuously maintained, and the resources are saved.
In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 4, fig. 4 is a block diagram of a basic structure of a computer device according to the present embodiment.
The computer device 6 comprises a memory 61, a processor 62, a network interface 63 communicatively connected to each other via a system bus. It is noted that only a computer device 6 having components 61-63 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
The memory 61 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 61 may be an internal storage unit of the computer device 6, such as a hard disk or a memory of the computer device 6. In other embodiments, the memory 61 may also be an external storage device of the computer device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 6. Of course, the memory 61 may also comprise both an internal storage unit of the computer device 6 and an external storage device thereof. In this embodiment, the memory 61 is generally used for storing an operating system installed in the computer device 6 and various application software, such as computer readable instructions of a text extraction method. Further, the memory 61 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 62 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 62 is typically used to control the overall operation of the computer device 6. In this embodiment, the processor 62 is configured to execute computer readable instructions stored in the memory 61 or process data, such as computer readable instructions for executing the text extraction method.
The network interface 63 may comprise a wireless network interface or a wired network interface, and the network interface 63 is typically used for establishing a communication connection between the computer device 6 and other electronic devices.
The computer equipment provided by the embodiment improves the generalization capability of the model, can also recognize and extract the text which does not appear in the dictionary library, and the improved model can recognize and extract the text according to the context semantic information, so that the recall rate and the accuracy of the model are improved, the dictionary library does not need to be continuously maintained, and the resources are saved.
The present application further provides another embodiment, which is to provide a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the text extraction method as described above.
The computer-readable storage medium provided by the embodiment improves the generalization capability of the model, can also recognize and extract texts which do not appear in the dictionary base, and the improved model can recognize and extract texts according to context semantic information, so that the recall rate and the accuracy of the model are improved, the dictionary base does not need to be continuously maintained, and resources are saved.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.
It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.
Claims (10)
1. A method for extracting text, comprising the steps of:
acquiring a text to be processed and a preset extraction model, inputting the text to be processed to the preset extraction model, and coding the text to be processed according to a coding layer of the preset extraction model to obtain a target coding vector;
inputting the target coding vector to a first network layer of the preset extraction model, and calculating to obtain a first network vector corresponding to the text to be processed;
inputting the first network vector to a second network layer of the preset extraction model, calculating to obtain a target feature matrix, and performing self-attention calculation on the target feature matrix to obtain a target feature vector;
inputting the target characteristic vector to a discrimination layer of the preset extraction model, calculating to obtain an optimal labeling sequence corresponding to the text to be processed, acquiring entity information corresponding to the optimal labeling sequence, and determining the entity information as the target extraction text of the text to be processed.
2. The method according to claim 1, wherein the step of encoding the text to be processed according to the encoding layer of the preset extraction model comprises:
acquiring label information of the text to be processed, coding the label information according to the coding layer to obtain a first vector, and coding words of the text to be processed according to the coding layer to obtain a second vector;
and splicing the first vector and the second vector to obtain the target coding vector.
3. The method according to claim 2, wherein the step of obtaining the tag information of the text to be processed and encoding the tag information according to the encoding layer to obtain the first vector comprises:
acquiring a pinyin text, a radical text and an inverse translation text of the text to be processed, and taking the pinyin text, the radical text and the inverse translation text as the label information;
coding the pinyin text, the radical text and the reverse translation text according to the coding layer to obtain pinyin codes, radical codes and reverse translation codes;
and performing self-attention calculation on the pinyin codes, the radical codes and the reverse translation codes to obtain the first vector.
4. The method according to claim 1, wherein the step of inputting the target coding vector to a first network layer of the preset extraction model and calculating a first network vector corresponding to the text to be processed comprises:
the first network layer comprises a forward long-short term memory network and a backward long-short term memory network, the target coding vector is input into the forward long-short term memory network according to the positive sequence of the text to be processed, and a forward hidden vector is obtained through calculation;
inputting the target coding vector into the backward long-short term memory network according to the reverse order of the text to be processed, and calculating to obtain a backward hidden vector;
and splicing the forward hidden vector and the backward hidden vector to obtain the first network vector.
5. The method for extracting text according to claim 1, further comprising, before the step of obtaining the text to be processed and the preset extraction model:
collecting a plurality of groups of corpus texts and real extracted texts corresponding to the corpus texts, and labeling the real extracted texts and the corpus texts to obtain labeled texts;
constructing a basic extraction model, inputting the labeled text into the basic extraction model, and calculating to obtain a loss function;
and adjusting parameters of the basic extraction model according to the loss function, determining that the basic extraction model is trained when the loss function is converged, and taking the trained basic extraction model as the preset extraction model.
6. The method according to claim 5, wherein the step of labeling the extracted actual text and the corpus text to obtain a labeled text comprises:
performing word segmentation on the real extracted text to obtain word segmentation words, and acquiring the position of each word segmentation word in the real extracted text;
labeling the word segmentation words according to the starting position, the middle position and the ending position of the position to obtain a first sub-text;
and acquiring a preset label of the corpus text, labeling the corpus text as a second sub-text according to the preset label, and combining the first sub-text and the second sub-text as the labeled text.
7. The method of claim 5, wherein the step of inputting the labeled text into the basic extraction model and calculating the loss function comprises:
inputting the labeled text to the basic extraction model to obtain a predicted extraction text of the labeled text;
and calculating a loss function of the basic extraction model according to the predicted extraction text and the real extraction text.
8. A text extraction apparatus, comprising:
the acquisition module is used for acquiring a text to be processed and a preset extraction model, inputting the text to be processed to the preset extraction model, and coding the text to be processed according to a coding layer of the preset extraction model to obtain a target coding vector;
the first calculation module is used for inputting the target coding vector to a first network layer of the preset extraction model and calculating to obtain a first network vector corresponding to the text to be processed;
the second calculation module is used for inputting the first network vector to a second network layer of the preset extraction model, calculating to obtain a target characteristic matrix, and performing self-attention calculation on the target characteristic matrix to obtain a target characteristic vector;
and the confirming module is used for inputting the target characteristic vector to a judging layer of the preset extraction model, calculating to obtain an optimal labeling sequence corresponding to the text to be processed, acquiring entity information corresponding to the optimal labeling sequence, and determining the entity information as the target extraction text of the text to be processed.
9. A computer device comprising a memory having computer readable instructions stored therein and a processor which when executed implements the steps of a text extraction method according to any one of claims 1 to 7.
10. A computer-readable storage medium having computer-readable instructions stored thereon which, when executed by a processor, implement the steps of the text extraction method of any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111015728.8A CN113657104A (en) | 2021-08-31 | 2021-08-31 | Text extraction method and device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111015728.8A CN113657104A (en) | 2021-08-31 | 2021-08-31 | Text extraction method and device, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113657104A true CN113657104A (en) | 2021-11-16 |
Family
ID=78482608
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111015728.8A Pending CN113657104A (en) | 2021-08-31 | 2021-08-31 | Text extraction method and device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113657104A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114328939A (en) * | 2022-03-17 | 2022-04-12 | 天津思睿信息技术有限公司 | Natural language processing model construction method based on big data |
Citations (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1414453A (en) * | 2002-04-06 | 2003-04-30 | 龚学胜 | Chinese language phonetic transcription, single spelling input unified scheme and intelligent transition translation |
JP2010262117A (en) * | 2009-05-01 | 2010-11-18 | Canon Inc | Information processing device, information processing method, program, and storage medium |
CN108304911A (en) * | 2018-01-09 | 2018-07-20 | 中国科学院自动化研究所 | Knowledge Extraction Method and system based on Memory Neural Networks and equipment |
CN109408812A (en) * | 2018-09-30 | 2019-03-01 | 北京工业大学 | A method of the sequence labelling joint based on attention mechanism extracts entity relationship |
CN110263323A (en) * | 2019-05-08 | 2019-09-20 | 清华大学 | Keyword abstraction method and system based on the long Memory Neural Networks in short-term of fence type |
CN110532381A (en) * | 2019-07-15 | 2019-12-03 | 中国平安人寿保险股份有限公司 | A kind of text vector acquisition methods, device, computer equipment and storage medium |
CN110598213A (en) * | 2019-09-06 | 2019-12-20 | 腾讯科技(深圳)有限公司 | Keyword extraction method, device, equipment and storage medium |
CN111144110A (en) * | 2019-12-27 | 2020-05-12 | 科大讯飞股份有限公司 | Pinyin marking method, device, server and storage medium |
CN111160017A (en) * | 2019-12-12 | 2020-05-15 | 北京文思海辉金信软件有限公司 | Keyword extraction method, phonetics scoring method and phonetics recommendation method |
CN111222317A (en) * | 2019-10-16 | 2020-06-02 | 平安科技(深圳)有限公司 | Sequence labeling method, system and computer equipment |
WO2020107878A1 (en) * | 2018-11-30 | 2020-06-04 | 平安科技(深圳)有限公司 | Method and apparatus for generating text summary, computer device and storage medium |
CN111783471A (en) * | 2020-06-29 | 2020-10-16 | 中国平安财产保险股份有限公司 | Semantic recognition method, device, equipment and storage medium of natural language |
CN111814466A (en) * | 2020-06-24 | 2020-10-23 | 平安科技(深圳)有限公司 | Information extraction method based on machine reading understanding and related equipment thereof |
CN112069319A (en) * | 2020-09-10 | 2020-12-11 | 杭州中奥科技有限公司 | Text extraction method and device, computer equipment and readable storage medium |
CN112085091A (en) * | 2020-09-07 | 2020-12-15 | 中国平安财产保险股份有限公司 | Artificial intelligence-based short text matching method, device, equipment and storage medium |
US10916242B1 (en) * | 2019-08-07 | 2021-02-09 | Nanjing Silicon Intelligence Technology Co., Ltd. | Intent recognition method based on deep learning network |
WO2021027125A1 (en) * | 2019-08-12 | 2021-02-18 | 平安科技(深圳)有限公司 | Sequence labeling method and apparatus, computer device and storage medium |
CN112417886A (en) * | 2020-11-20 | 2021-02-26 | 平安普惠企业管理有限公司 | Intention entity information extraction method and device, computer equipment and storage medium |
CN112487807A (en) * | 2020-12-09 | 2021-03-12 | 重庆邮电大学 | Text relation extraction method based on expansion gate convolution neural network |
WO2021056709A1 (en) * | 2019-09-24 | 2021-04-01 | 平安科技(深圳)有限公司 | Method and apparatus for recognizing similar questions, computer device and storage medium |
CN112699213A (en) * | 2020-12-23 | 2021-04-23 | 平安普惠企业管理有限公司 | Speech intention recognition method and device, computer equipment and storage medium |
CN112906395A (en) * | 2021-03-26 | 2021-06-04 | 平安科技(深圳)有限公司 | Drug relationship extraction method, device, equipment and storage medium |
CN112949320A (en) * | 2021-03-30 | 2021-06-11 | 平安科技(深圳)有限公司 | Sequence labeling method, device, equipment and medium based on conditional random field |
WO2021151292A1 (en) * | 2020-08-28 | 2021-08-05 | 平安科技(深圳)有限公司 | Corpus monitoring method based on mask language model, corpus monitoring apparatus, device, and medium |
CN113255320A (en) * | 2021-05-13 | 2021-08-13 | 北京熙紫智数科技有限公司 | Entity relation extraction method and device based on syntax tree and graph attention machine mechanism |
CN113268953A (en) * | 2021-07-15 | 2021-08-17 | 中国平安人寿保险股份有限公司 | Text key word extraction method and device, computer equipment and storage medium |
-
2021
- 2021-08-31 CN CN202111015728.8A patent/CN113657104A/en active Pending
Patent Citations (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1414453A (en) * | 2002-04-06 | 2003-04-30 | 龚学胜 | Chinese language phonetic transcription, single spelling input unified scheme and intelligent transition translation |
JP2010262117A (en) * | 2009-05-01 | 2010-11-18 | Canon Inc | Information processing device, information processing method, program, and storage medium |
CN108304911A (en) * | 2018-01-09 | 2018-07-20 | 中国科学院自动化研究所 | Knowledge Extraction Method and system based on Memory Neural Networks and equipment |
CN109408812A (en) * | 2018-09-30 | 2019-03-01 | 北京工业大学 | A method of the sequence labelling joint based on attention mechanism extracts entity relationship |
WO2020107878A1 (en) * | 2018-11-30 | 2020-06-04 | 平安科技(深圳)有限公司 | Method and apparatus for generating text summary, computer device and storage medium |
CN110263323A (en) * | 2019-05-08 | 2019-09-20 | 清华大学 | Keyword abstraction method and system based on the long Memory Neural Networks in short-term of fence type |
CN110532381A (en) * | 2019-07-15 | 2019-12-03 | 中国平安人寿保险股份有限公司 | A kind of text vector acquisition methods, device, computer equipment and storage medium |
US10916242B1 (en) * | 2019-08-07 | 2021-02-09 | Nanjing Silicon Intelligence Technology Co., Ltd. | Intent recognition method based on deep learning network |
WO2021027125A1 (en) * | 2019-08-12 | 2021-02-18 | 平安科技(深圳)有限公司 | Sequence labeling method and apparatus, computer device and storage medium |
CN110598213A (en) * | 2019-09-06 | 2019-12-20 | 腾讯科技(深圳)有限公司 | Keyword extraction method, device, equipment and storage medium |
WO2021056709A1 (en) * | 2019-09-24 | 2021-04-01 | 平安科技(深圳)有限公司 | Method and apparatus for recognizing similar questions, computer device and storage medium |
CN111222317A (en) * | 2019-10-16 | 2020-06-02 | 平安科技(深圳)有限公司 | Sequence labeling method, system and computer equipment |
WO2021072852A1 (en) * | 2019-10-16 | 2021-04-22 | 平安科技(深圳)有限公司 | Sequence labeling method and system, and computer device |
CN111160017A (en) * | 2019-12-12 | 2020-05-15 | 北京文思海辉金信软件有限公司 | Keyword extraction method, phonetics scoring method and phonetics recommendation method |
CN111144110A (en) * | 2019-12-27 | 2020-05-12 | 科大讯飞股份有限公司 | Pinyin marking method, device, server and storage medium |
CN111814466A (en) * | 2020-06-24 | 2020-10-23 | 平安科技(深圳)有限公司 | Information extraction method based on machine reading understanding and related equipment thereof |
CN111783471A (en) * | 2020-06-29 | 2020-10-16 | 中国平安财产保险股份有限公司 | Semantic recognition method, device, equipment and storage medium of natural language |
WO2021151292A1 (en) * | 2020-08-28 | 2021-08-05 | 平安科技(深圳)有限公司 | Corpus monitoring method based on mask language model, corpus monitoring apparatus, device, and medium |
CN112085091A (en) * | 2020-09-07 | 2020-12-15 | 中国平安财产保险股份有限公司 | Artificial intelligence-based short text matching method, device, equipment and storage medium |
CN112069319A (en) * | 2020-09-10 | 2020-12-11 | 杭州中奥科技有限公司 | Text extraction method and device, computer equipment and readable storage medium |
CN112417886A (en) * | 2020-11-20 | 2021-02-26 | 平安普惠企业管理有限公司 | Intention entity information extraction method and device, computer equipment and storage medium |
CN112487807A (en) * | 2020-12-09 | 2021-03-12 | 重庆邮电大学 | Text relation extraction method based on expansion gate convolution neural network |
CN112699213A (en) * | 2020-12-23 | 2021-04-23 | 平安普惠企业管理有限公司 | Speech intention recognition method and device, computer equipment and storage medium |
CN112906395A (en) * | 2021-03-26 | 2021-06-04 | 平安科技(深圳)有限公司 | Drug relationship extraction method, device, equipment and storage medium |
CN112949320A (en) * | 2021-03-30 | 2021-06-11 | 平安科技(深圳)有限公司 | Sequence labeling method, device, equipment and medium based on conditional random field |
CN113255320A (en) * | 2021-05-13 | 2021-08-13 | 北京熙紫智数科技有限公司 | Entity relation extraction method and device based on syntax tree and graph attention machine mechanism |
CN113268953A (en) * | 2021-07-15 | 2021-08-17 | 中国平安人寿保险股份有限公司 | Text key word extraction method and device, computer equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
沈华东,等: "AM-BRNN: 一种基于深度学习的文本摘要自动抽取模型", 小型微型计算机系统, vol. 39, no. 6, 31 December 2018 (2018-12-31), pages 1184 - 1189 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114328939A (en) * | 2022-03-17 | 2022-04-12 | 天津思睿信息技术有限公司 | Natural language processing model construction method based on big data |
CN114328939B (en) * | 2022-03-17 | 2022-05-27 | 天津思睿信息技术有限公司 | Natural language processing model construction method based on big data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112685565B (en) | Text classification method based on multi-mode information fusion and related equipment thereof | |
CN112863683A (en) | Medical record quality control method and device based on artificial intelligence, computer equipment and storage medium | |
CN111783471B (en) | Semantic recognition method, device, equipment and storage medium for natural language | |
CN113947095B (en) | Multilingual text translation method, multilingual text translation device, computer equipment and storage medium | |
CN112860919B (en) | Data labeling method, device, equipment and storage medium based on generation model | |
CN111767375A (en) | Semantic recall method and device, computer equipment and storage medium | |
CN112632278A (en) | Labeling method, device, equipment and storage medium based on multi-label classification | |
CN112188311B (en) | Method and apparatus for determining video material of news | |
CN112287069A (en) | Information retrieval method and device based on voice semantics and computer equipment | |
CN112836521A (en) | Question-answer matching method and device, computer equipment and storage medium | |
CN112598039B (en) | Method for obtaining positive samples in NLP (non-linear liquid) classification field and related equipment | |
CN112949320B (en) | Sequence labeling method, device, equipment and medium based on conditional random field | |
CN112232052B (en) | Text splicing method, text splicing device, computer equipment and storage medium | |
CN112199954B (en) | Disease entity matching method and device based on voice semantics and computer equipment | |
CN112699213A (en) | Speech intention recognition method and device, computer equipment and storage medium | |
CN111353311A (en) | Named entity identification method and device, computer equipment and storage medium | |
CN113657105A (en) | Medical entity extraction method, device, equipment and medium based on vocabulary enhancement | |
CN111368551A (en) | Method and device for determining event subject | |
CN115438149A (en) | End-to-end model training method and device, computer equipment and storage medium | |
CN114780701B (en) | Automatic question-answer matching method, device, computer equipment and storage medium | |
CN115757731A (en) | Dialogue question rewriting method, device, computer equipment and storage medium | |
CN113723077B (en) | Sentence vector generation method and device based on bidirectional characterization model and computer equipment | |
CN118193668A (en) | Text entity relation extraction method and device | |
CN113657104A (en) | Text extraction method and device, computer equipment and storage medium | |
CN112417886A (en) | Intention entity information extraction method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20220607 Address after: 518000 China Aviation Center 2901, No. 1018, Huafu Road, Huahang community, Huaqiang North Street, Futian District, Shenzhen, Guangdong Province Applicant after: Shenzhen Ping An medical and Health Technology Service Co.,Ltd. Address before: Room 12G, Area H, 666 Beijing East Road, Huangpu District, Shanghai 200001 Applicant before: PING AN MEDICAL AND HEALTHCARE MANAGEMENT Co.,Ltd. |