CN113408296B - Text information extraction method, device and equipment - Google Patents

Text information extraction method, device and equipment Download PDF

Info

Publication number
CN113408296B
CN113408296B CN202110707811.5A CN202110707811A CN113408296B CN 113408296 B CN113408296 B CN 113408296B CN 202110707811 A CN202110707811 A CN 202110707811A CN 113408296 B CN113408296 B CN 113408296B
Authority
CN
China
Prior art keywords
text
processed
model
sequence
sequence labeling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110707811.5A
Other languages
Chinese (zh)
Other versions
CN113408296A (en
Inventor
刘禄
廖锐
刘志伟
王海永
杨雪
张春龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Neusoft Corp
Original Assignee
Neusoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Neusoft Corp filed Critical Neusoft Corp
Priority to CN202110707811.5A priority Critical patent/CN113408296B/en
Publication of CN113408296A publication Critical patent/CN113408296A/en
Application granted granted Critical
Publication of CN113408296B publication Critical patent/CN113408296B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The embodiment of the application discloses a text information extraction method, a device and equipment, which are used for extracting text features and part-of-speech features of a text to be processed and fusing the text features to obtain text fusion features, inputting the text fusion features into a sequence labeling model of a first level, and labeling information items to be extracted corresponding to the current level. And fusing the obtained labeling result with the text fusion feature to obtain updated text fusion feature. The sequence labeling models of all levels can be labeled sequentially by replacing the sequence labeling model of the current level, and labeling results of the sequence labeling models of all levels are obtained. And analyzing the labeling results of the to-be-processed text output by the sequence labeling models of all the layers to obtain information extraction contents of to-be-extracted information items of different layers included in the to-be-processed text. The method can obtain more accurate text information of the text to be processed on the basis of automatically extracting the text information.

Description

Text information extraction method, device and equipment
Technical Field
The present invention relates to the field of data processing, and in particular, to a method, an apparatus, and a device for extracting text information.
Background
The text includes a large amount of text information. When extracting text information in a text, the structure of a part of the text is irregular or incomplete, a preset structural model is lacking, and the text information in the text is difficult to directly extract. For example, in the medical field, doctors write generated medical record text.
Currently, text processing is often required for such text to enable extraction of text information. However, the process of extracting text information is complicated, and the accuracy of the obtained text information is low. Therefore, how to efficiently and accurately extract text information is a problem to be solved.
Disclosure of Invention
In view of this, the embodiments of the present application provide a method, an apparatus, and a device for extracting text information, which can label a text to be processed through a multi-level sequence labeling model, and obtain more accurate text information by using a labeling result, so as to implement efficient and accurate text information extraction.
In order to solve the above problems, the technical solution provided in the embodiments of the present application is as follows:
a text information extraction method, the method comprising:
extracting text features and part-of-speech features of a text to be processed with a preset length;
Fusing the text features and the part-of-speech features of the text to be processed to obtain text fusion features of the text to be processed;
determining the sequence labeling model of the first level as the sequence labeling model of the current level;
inputting the text fusion characteristics of the text to be processed into the sequence labeling model of the current level, labeling the information items to be extracted corresponding to the sequence labeling model of the current level, and obtaining a labeling result of the text to be processed, which is output by the sequence labeling model of the current level;
judging whether a sequence labeling model of the next level exists or not;
if a sequence labeling model of the next level exists, fusing a labeling result of the text to be processed output by the sequence labeling model of the current level with text fusion characteristics of the text to be processed, and obtaining the text fusion characteristics of the text to be processed again;
determining the sequence labeling model of the next level as the sequence labeling model of the current level, and re-executing the text fusion characteristics of the text to be processed and inputting the text fusion characteristics into the sequence labeling model of the current level and the subsequent steps;
if the sequence labeling model of the next level does not exist, labeling results of the text to be processed, which are output by the sequence labeling models of all levels, are obtained;
Analyzing the labeling results of the text to be processed, which are output by the sequence labeling models of all the layers, and obtaining information extraction contents of information items to be extracted of different layers, which are included in the text to be processed.
In one possible implementation manner, before extracting the text feature and the part-of-speech feature of the text to be processed with the preset length, the method further includes:
filtering redundant information and desensitizing sensitive information on an original text to obtain a first target text;
if the length of the first target text is greater than a preset length, the first target text is segmented into a plurality of second target texts with the length smaller than or equal to the preset length, and the length of the second target text is complemented to the preset length to generate a text to be processed;
if the length of the first target text is smaller than the preset length, the length of the first target text is complemented to the preset length, and a text to be processed is generated;
and if the length of the first target text is equal to the preset length, determining the first target text as a text to be processed.
In one possible implementation manner, after obtaining the information extraction content of the information items to be extracted of different levels included in the text to be processed, the method further includes:
Acquiring text characteristics of target information extraction content and text characteristics of target term text, wherein the target information extraction content is any item of the information extraction content, and the target term text is any item of predetermined term text;
matching the text characteristics of the target information extraction content with the text characteristics of the target term text;
and if the text characteristics of the target information extraction content are matched with the text characteristics of the target term text, replacing the target information extraction content with the target term text.
In one possible implementation, the method further includes:
initializing sequence labeling models of all levels;
determining the sequence labeling model of the first level as the sequence labeling model of the current level;
inputting the text fusion characteristics of the training text into the sequence labeling model of the current level, labeling the information items to be extracted corresponding to the sequence labeling model of the current level, and obtaining a labeling result of the training text output by the sequence labeling model of the current level;
obtaining a loss value of the sequence annotation model of the current level according to a standard annotation result of an information item to be extracted corresponding to the sequence annotation model of the current level in the training text and an annotation result of the training text output by the sequence annotation model of the current level;
Judging whether a sequence labeling model of the next level exists or not;
if a sequence labeling model of the next level exists, fusing a labeling result of the training text output by the sequence labeling model of the current level with text fusion characteristics of the training text, and obtaining the text fusion characteristics of the training text again;
determining the sequence labeling model of the next level as the sequence labeling model of the current level, and re-executing the text fusion characteristics of the training text to be input into the sequence labeling model of the current level and the subsequent steps;
if the sequence labeling model of the next level does not exist, obtaining the loss value of the sequence labeling model of each level;
weighting and adding the loss values of the sequence labeling models of all the layers to obtain comprehensive loss values, and adjusting the sequence labeling models of all the layers according to the comprehensive loss values;
and re-executing the sequence labeling model of the first level to be determined as the sequence labeling model of the current level and the subsequent steps until reaching the preset stopping condition, and obtaining the sequence labeling model of each level generated by training.
In one possible implementation manner, the number of layers of the sequence labeling model and the information items to be extracted corresponding to the sequence labeling models of all the layers are predetermined according to the layers of the information items to be extracted.
In one possible implementation manner, the extracting text features and part-of-speech features of the text to be processed with the preset length includes:
inputting a text to be processed with a preset length into an ERNIE model to obtain text characteristics of the text to be processed; the text characteristics of the text to be processed represent grammar, semantics and positions of characters in the text to be processed; the text feature of the text to be processed is a text feature vector in m x n dimensions, wherein m is the preset length, and n is a positive integer;
inputting the text to be processed into a part-of-speech recognition model to obtain part-of-speech features of the text to be processed, wherein the part-of-speech features of the text to be processed are part-of-speech feature vectors with m 1 dimensions.
In one possible implementation manner, the fusing the text feature and the part-of-speech feature of the text to be processed to obtain the text fusion feature of the text to be processed includes:
mapping the part-of-speech feature vector in m x 1 dimension into part-of-speech feature vector in m x n dimension;
and fusing the part-of-speech feature vector in the m x n dimension with the text feature vector in the m x n dimension to obtain the text fusion feature of the text to be processed, wherein the text fusion feature of the text to be processed is the text fusion feature vector in the m x n dimension.
In one possible implementation manner, the labeling result of the text to be processed output by the sequence labeling model of the current level is a labeling result vector with m×1 dimensions;
fusing the labeling result of the text to be processed output by the sequence labeling model of the current level with the text fusion feature of the text to be processed to obtain the text fusion feature of the text to be processed again, wherein the method comprises the following steps:
mapping the m-1-dimensional labeling result vector into an m-n-dimensional labeling result vector;
and fusing the m-n-dimensional labeling result vector with the m-n-dimensional text fusion feature vector to obtain the text fusion feature of the text to be processed again.
A text information extraction apparatus, the apparatus comprising:
the extraction unit is used for extracting text features and part-of-speech features of the text to be processed with preset length;
the first fusion unit is used for fusing the text characteristics and the part-of-speech characteristics of the text to be processed to obtain text fusion characteristics of the text to be processed;
the first determining unit is used for determining the sequence annotation model of the first level as the sequence annotation model of the current level;
the first labeling unit is used for inputting the text fusion characteristics of the text to be processed into the sequence labeling model of the current level, labeling the information items to be extracted corresponding to the sequence labeling model of the current level, and obtaining a labeling result of the text to be processed, which is output by the sequence labeling model of the current level;
The first judging unit is used for judging whether a sequence labeling model of the next level exists or not;
the second fusion unit is used for fusing the labeling result of the text to be processed output by the sequence labeling model of the current level with the text fusion characteristics of the text to be processed if the sequence labeling model of the next level exists, and obtaining the text fusion characteristics of the text to be processed again;
the second determining unit is used for determining the sequence annotation model of the next level as the sequence annotation model of the current level, and re-executing the text fusion characteristics of the text to be processed and inputting the text fusion characteristics into the sequence annotation model of the current level and the subsequent steps;
the first acquisition unit is used for acquiring the labeling result of the text to be processed output by the sequence labeling model of each level if the sequence labeling model of the next level does not exist;
and the analysis unit is used for analyzing the labeling results of the texts to be processed, which are output by the sequence labeling models of all the layers, and obtaining information extraction contents of information items to be extracted of different layers included in the texts to be processed.
A text information extraction apparatus comprising: the text information extraction method comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the text information extraction method when executing the computer program.
A computer readable storage medium having instructions stored therein which, when executed on a terminal device, cause the terminal device to perform the text information extraction method described above.
From this, the embodiment of the application has the following beneficial effects:
according to the text information extraction method, device and equipment, the text characteristics and the part-of-speech characteristics of the text to be processed are extracted, so that the characteristics of the text to be processed are extracted, and more comprehensive characteristic information of the text to be processed can be obtained. The part-of-speech features are helpful for more accurately determining the information extraction content, and the accuracy of the obtained information extraction content can be improved. And inputting the text fusion characteristics obtained by fusing the text characteristics and the part-of-speech characteristics into a sequence labeling model of the first level, so that the information items to be extracted corresponding to the current level can be labeled. And fusing the obtained labeling result with the text fusion feature to obtain updated text fusion feature. The sequence labeling models of all levels can be labeled sequentially by replacing the sequence labeling model of the current level, and labeling results of the sequence labeling models of all levels are obtained. The labeling results of the sequence labeling models of all levels are fused with the text fusion features to be used as the input of the sequence labeling model of the next level, so that the sequence labeling model can be labeled based on the labeling results of the sequence labeling model of the previous level, and the accuracy of the labeling results of the sequence labeling model is improved. According to the labeling results of the sequence labeling models of all the layers, the information extraction contents of the information items to be extracted of different layers in the text to be processed can be obtained, multi-layer extraction of the information extraction contents can be realized, and double extraction of the relationship between the information extraction contents and the information extraction contents is considered. In addition, by acquiring multi-level information extraction contents, extraction of information extraction contents having various meanings can be achieved. Therefore, more accurate text information of the text to be processed is obtained on the basis of automatically extracting the text information.
Drawings
Fig. 1 is a schematic frame diagram of an exemplary application scenario provided in an embodiment of the present application;
fig. 2 is a flowchart of a text information extraction method provided in an embodiment of the present application;
fig. 3 is a flowchart of another text information extraction method provided in an embodiment of the present application;
fig. 4 is a flowchart of another text information extraction method provided in an embodiment of the present application;
FIG. 5 is a schematic diagram of matching text features of extracted content with text features of target term text according to an embodiment of the present application;
fig. 6 is a flowchart of another text information extraction method provided in an embodiment of the present application;
fig. 7 is a flowchart for extracting text features and part-of-speech features of a text to be processed with a preset length according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a text information extraction device according to an embodiment of the present application.
Detailed Description
In order to make the above objects, features and advantages of the present application more comprehensible, embodiments accompanied with figures and detailed description are described in further detail below.
In order to facilitate understanding and explanation of the technical solutions provided by the embodiments of the present application, the background art of the present application will be described first.
After research on conventional text information, it was found that a large amount of text information is contained in the text generated daily. Based on the text information extracted from the text, the subsequent processing and utilization of the text information can be realized. For example, text information is extracted from medical record text written by doctors, so that information related to diseases and medicines can be obtained. And then analyzing the obtained relevant information of the diseases and the medicines, so that the medical information can be tidied and utilized. But the extraction of text information is complicated by parts of the text, e.g. unstructured text. For such text, the structured text may be directly generated by influencing the way the text is generated. But such a method causes inconvenience in the process of generating text, and is difficult to be widely used. Or processing the text, extracting text information, and realizing the structural processing of the text. However, the current text information extraction method is complex in implementation process, low in accuracy and difficult to meet the requirements of text information extraction.
Based on the above, the embodiments of the present application provide a method, an apparatus, and a device for extracting text information, which are capable of obtaining more comprehensive feature information of a text to be processed by extracting text features and part-of-speech features of the text to be processed. The part-of-speech features are helpful for more accurately determining the information extraction content, and the accuracy of the obtained information extraction content can be improved. And inputting the text fusion characteristics obtained by fusing the text characteristics and the part-of-speech characteristics into a sequence labeling model of the first level, so that the information items to be extracted corresponding to the current level can be labeled. And fusing the obtained labeling result with the text fusion feature to obtain updated text fusion feature. The sequence labeling models of all levels can be labeled sequentially by replacing the sequence labeling model of the current level, and labeling results of the sequence labeling models of all levels are obtained. The labeling results of the sequence labeling models of all levels are fused with the text fusion features to be used as the input of the sequence labeling model of the next level, so that the sequence labeling model can be labeled based on the labeling results of the sequence labeling model of the previous level, and the accuracy of the labeling results of the sequence labeling model is improved. According to the labeling results of the sequence labeling models of all the layers, the information extraction contents of the information items to be extracted of different layers in the text to be processed can be obtained, multi-layer extraction of the information extraction contents can be realized, and double extraction of the relationship between the information extraction contents and the information extraction contents is considered. In addition, by acquiring multi-level information extraction contents, extraction of information extraction contents having various meanings can be achieved. Therefore, more accurate text information of the text to be processed is obtained on the basis of automatically extracting the text information.
In order to facilitate understanding of a text information extraction method provided in the embodiments of the present application, the following description is made with reference to a scenario example shown in fig. 1. Referring to fig. 1, the diagram is a schematic frame diagram of an exemplary application scenario provided in an embodiment of the present application.
In practical application, a text required to be subjected to text information extraction is firstly used as a text to be processed, text characteristics and part-of-speech characteristics of the text to be processed with preset length are extracted, and the text characteristics and the part-of-speech characteristics are fused to obtain text fusion characteristics of the text to be processed. And labeling the text to be processed by using a multi-level sequence labeling model. For example, a sequence annotation model with three levels. Firstly, determining a first-level sequence labeling model as a current-level sequence labeling model, inputting text fusion characteristics into the current-level sequence labeling model, and labeling information items to be extracted corresponding to the current-level sequence labeling model through the current-level sequence labeling model to obtain a current-level sequence labeling model, namely a labeling result of the text to be processed output by the first-level sequence labeling model. And fusing the obtained labeling result and the text fusion characteristic to obtain the text fusion characteristic of the text to be processed after re-fusion. And determining the sequence labeling model of the second level as the sequence labeling model of the current level, and inputting the text fusion characteristics after re-fusion into the sequence labeling model of the current level to obtain a corresponding labeling result. And fusing the labeling result output by the sequence labeling model of the second level with the text fusion characteristic. And determining the sequence labeling model of the third layer as the sequence labeling model of the current layer, inputting the fused text fusion characteristics into the sequence labeling model of the current layer, namely the sequence labeling model of the third layer, and obtaining a labeling result of the text to be processed, which is labeled and output by the sequence labeling model of the third layer. And after the three layers of sequence labeling models are labeled, labeling results of the to-be-processed text output by the three layers of sequence labeling models are obtained, and the labeling results of the to-be-processed text output by the three layers of sequence labeling models are analyzed to obtain information extraction contents of to-be-extracted information items of different layers included in the to-be-processed text. Therefore, extraction of information items to be extracted in different layers in the text to be processed can be realized, and the accuracy of the text information can be improved on the basis of automatic text information extraction.
Those skilled in the art will appreciate that the frame diagram shown in fig. 1 is but one example in which embodiments of the present application may be implemented. The scope of applicability of the embodiments of the application is not limited in any way by the framework.
Based on the above description, a text information extraction method provided in the present application will be described in detail with reference to the accompanying drawings.
Referring to fig. 2, the flowchart of a text information extraction method provided in the embodiment of the present application is shown in fig. 2, where the method may include S201-S209:
s201: extracting text features and part-of-speech features of a text to be processed with a preset length.
The text to be processed is the text which needs text information extraction. The text to be processed may be unstructured text, for example, in the medical field, the text to be processed may be medical text to be processed, such as medical record text or diagnosis text written by a doctor.
In order to facilitate feature extraction of the text to be processed, the length of the text to be processed may be set to a preset length. The preset length may specifically represent the number of characters included in the text to be processed. The preset length can be set according to the need of processing the text to be processed. In one possible implementation, the preset length may be specifically set to 512 characters.
And extracting text features and part-of-speech features of the text to be processed with a preset length. The text features refer to features of each character in the text to be processed in terms of text structure, such as text position, grammar, semantics and the like. The part-of-speech feature refers to the feature of each character in the text to be processed in terms of lexical properties.
By extracting the characteristics of the text to be processed in terms of text structure and vocabulary characteristics, the characteristics of the complete text to be processed can be obtained, and further, text information of the text to be processed can be extracted more accurately.
In one possible implementation manner, an embodiment of the present application provides a specific implementation manner of extracting text features and part-of-speech features of a text to be processed with a preset length, which is specifically referred to below.
S202: and fusing the text features and the part-of-speech features of the text to be processed to obtain text fusion features of the text to be processed.
Based on the extracted text features and part-of-speech features of the text to be processed, fusing the text features and the part-of-speech features to obtain text fusion features of the text to be processed, wherein the text fusion features comprise two aspects of features.
Specifically, the text feature corresponding to the text to be processed may be denoted as α, the corresponding part-of-speech feature may be denoted as β, and the text fusion feature may be denoted as α+β.
In one possible implementation manner, the embodiment of the present application provides a specific implementation manner of fusing text features and part-of-speech features of a text to be processed to obtain text fusion features of the text to be processed, which is specifically referred to below.
S203: and determining the sequence labeling model of the first level as the sequence labeling model of the current level.
The sequence labeling model is used for labeling the information items to be extracted corresponding to the sequence labeling model based on the input text fusion characteristics, and generating a labeling result of the text to be processed corresponding to the sequence labeling model. The sequence annotation model may specifically comprise a CRF (conditional random field ) layer. Based on the input text fusion characteristics, labeling each character in the text to be processed by using the CRF layer to obtain a corresponding label, and then analyzing label information corresponding to the text to be processed by combining a label rule to obtain a labeling result. Specifically, the tag rule may be a BIO rule.
It should be noted that, in the embodiment of the present application, the sequence labeling model has multiple layers, and the information items to be extracted corresponding to the sequence labeling model of each layer are different. The hierarchy of the sequence annotation model and the corresponding information items to be extracted can be set according to the extraction requirement of the text information.
In one possible implementation manner, the number of layers of the sequence labeling model and the information items to be extracted corresponding to the sequence labeling models of all the layers are predetermined according to the layers of the information items to be extracted. The hierarchy of information items to be extracted may refer to information items describing different levels and categories, where the information items to be extracted of the same hierarchy generally appear in a side-by-side relationship in text, while the information items to be extracted of different hierarchies are more likely to appear in an inclusive relationship. There are three information items to be extracted, namely, a "disease diagnosis" and a "lesion site" and a "lesion size", which are different levels of information items to be extracted than a "lesion site" and a "lesion size", and a "lesion site" and a "lesion size" are the same level of information items to be extracted. Specifically, for example, in a medical text, the information items to be extracted may be classified into different levels according to the range of the information items to be extracted from large to small, for example, the information items to be extracted are classified into four levels of diseases, treatment methods corresponding to the diseases, treatment apparatuses or medicines, and types of specific apparatuses or medicines. For another example, the hierarchy of information items to be extracted may also be of different types, divided according to the different meanings of the information items. For example, for texts with multiple meanings, different meanings can be correspondingly set as different levels of the information item to be extracted.
The hierarchy of the information items to be extracted can be set according to the extraction requirement of the text information. Specifically, for example, if 5 layers of information items to be extracted are required to be extracted from the text to be processed, the number of layers of the corresponding sequence labeling model is set to 5, and the information items to be extracted corresponding to the sequence labeling model with 5 layers are respectively the 5 corresponding layers of information items to be extracted.
Based on the sequence labeling models of multiple layers, the to-be-processed text can be labeled with the to-be-extracted information items of multiple layers, so that multi-layer text information extraction is realized.
The labeling process of the sequence labeling models of all levels is a serial processing mode, and features are required to be sequentially input into the sequence labeling models of all levels for labeling. And determining the sequence labeling model of the first level as the sequence labeling model of the current level.
S204: inputting the text fusion characteristics of the text to be processed into a sequence labeling model of the current level, labeling the information items to be extracted corresponding to the sequence labeling model of the current level, and obtaining a labeling result of the text to be processed output by the sequence labeling model of the current level.
And inputting the text fusion characteristics of the text to be processed, which are obtained by fusing the text characteristics and the part-of-speech characteristics of the text to be processed, into a sequence annotation model of the current level. And marking the information items to be extracted corresponding to the sequence marking model of the current level by using the sequence marking model of the current level. If the text fusion feature of the text to be processed has the information item to be extracted corresponding to the sequence labeling model of the current level, the sequence labeling model of the current level labels the corresponding information item to be extracted, and then a labeling result of the text to be processed, which is output by the sequence labeling model of the current level, is obtained.
S205: and judging whether a sequence labeling model of the next level exists or not.
After the sequence labeling model of the current level is labeled, a corresponding labeling result is obtained, and whether the sequence labeling model of the next level exists is judged. Because the embodiment of the application has a plurality of layers of sequence labeling models, if the current layer of sequence labeling model is the first layer of sequence labeling model, the next layer of sequence labeling model exists, and the step S206 and the subsequent steps are executed. If the sequence labeling model of the current level is the sequence labeling model of the second level and the sequence labeling models after the second level, the sequence labeling model of the next level may not exist. If the sequence labeling model of the next layer exists, executing S206 and subsequent steps; if there is no sequence annotation model for the next layer, then S208 and subsequent steps are performed.
S206: if the sequence labeling model of the next level exists, fusing a labeling result of the text to be processed output by the sequence labeling model of the current level with text fusion characteristics of the text to be processed, and obtaining the text fusion characteristics of the text to be processed again.
When the sequence labeling model of the next level exists, labeling the text to be processed by using the sequence labeling model of the next level is needed.
In order to improve the accuracy of the sequence labeling model, considering that the information items to be extracted corresponding to the sequence labeling models of different levels have correlation, fusing the labeling result of the text to be processed output by the sequence labeling model of the current level with the text fusion characteristics of the text to be processed to obtain the text fusion characteristics of the text to be processed after re-fusion.
Specifically, x is used n Text fusion features representing re-fused text to be processed, in one implementation, x n =α+β+γ n Wherein n represents the number of layers, gamma, corresponding to the sequence annotation model of the current layer n Representing the labeling result of the sequence labeling model of the nth layer, x n And (3) representing the text fusion characteristics of the text to be processed, which are obtained after the labeling result of the sequence labeling model of the nth level is fused with the text fusion characteristics of the text to be processed. In another implementation, x n =concat(α+β,γ n ),concat(α+β,γ n ) Representing alpha+beta and gamma n Splicing, e.g. alpha, beta being m x n dimensional arrays, gamma n For an m 1 dimension array, then α+β ism is n dimension array, x n Is an m (n+1) dimension array. In one possible implementation manner, the embodiment of the present application provides a method for fusing a labeling result of a text to be processed output by a sequence labeling model of a current level with a text fusion feature of the text to be processed, so as to retrieve a specific implementation manner of the text fusion feature of the text to be processed, which is specifically referred to below.
S207: and determining the sequence labeling model of the next level as the sequence labeling model of the current level, and re-executing the text fusion characteristics of the text to be processed and inputting the text fusion characteristics into the sequence labeling model of the current level and the subsequent steps.
Changing the hierarchy of the sequence labeling model corresponding to the sequence labeling model of the current hierarchy, and determining the sequence labeling model of the next hierarchy as the sequence labeling model of the current hierarchy. And after the sequence labeling model of the current level is redetermined, re-executing S204 and subsequent steps to realize labeling of the text to be processed by using the sequence labeling model of the current level and corresponding updating of the text fusion characteristics of the text to be processed.
S208: and if the sequence labeling model of the next level does not exist, labeling results of the text to be processed output by the sequence labeling models of all levels are obtained.
If the sequence labeling model of the current level is the sequence labeling model of the last level, the sequence labeling model of the next level does not exist, and labeling of the sequence labeling model is finished. And obtaining the labeling results of the texts to be processed output by the sequence labeling models of all the layers, and extracting text information by using the labeling results of the texts to be processed output by the sequence labeling models of all the layers.
S209: analyzing the labeling results of the to-be-processed text output by the sequence labeling models of all the layers to obtain information extraction contents of to-be-extracted information items of different layers included in the to-be-processed text.
The labeling results of the texts to be processed output by the sequence labeling models of all the layers comprise the relevant content of the information to be extracted from the texts to be processed. Analyzing the labeling result of the text to be processed output by the obtained sequence labeling model of each level, further obtaining information extraction content corresponding to the information items to be extracted of different levels in the text to be processed, wherein the obtained information extraction content is text information included in the text to be processed, and further realizing the structuring of the text to be processed.
Based on the above-mentioned related content of S201-S209, it is known that, by extracting the text feature and the part-of-speech feature of the text to be processed, the feature extraction of the two aspects of the text to be processed is realized, and the more comprehensive feature information of the text to be processed can be obtained. And inputting the text fusion characteristics obtained by fusing the text characteristics and the part-of-speech characteristics into a sequence labeling model of the first level, so that the information items to be extracted corresponding to the current level can be labeled. And fusing the obtained labeling result with the text fusion feature to obtain updated text fusion feature. The sequence labeling models of all levels can be labeled sequentially by replacing the sequence labeling model of the current level, and labeling results of the sequence labeling models of all levels are obtained. The labeling results of the sequence labeling models of all levels are fused with the text fusion features to be used as the input of the sequence labeling model of the next level, so that the sequence labeling model can be labeled based on the labeling results of the sequence labeling model of the previous level, and the accuracy of the labeling results of the sequence labeling model is improved. And the information extraction content in the text to be processed is determined according to the labeling result of the sequence labeling model of each level, so that multi-level information extraction can be realized, double extraction of the relation between the information extraction content and the information extraction content is considered, extraction of the information extraction content with various meanings can be realized, and the accuracy of text information extraction of the text to be processed is improved. Therefore, more accurate text information of the text to be processed is obtained on the basis of automatically extracting the text information.
It can be appreciated that, in order to facilitate more accurate feature extraction, the original text from which the text information is extracted needs to be preprocessed to obtain text meeting the requirements of subsequent processing.
Correspondingly, an embodiment of the present application provides a text information extraction method, and referring to fig. 3, the fig. is a flowchart of another text information extraction method provided in the embodiment of the present application. Before extracting the text features and the part-of-speech features of the text to be processed with the preset length, the method further comprises the following four steps.
S301: and filtering redundant information and desensitizing sensitive information on the original text to obtain a first target text.
The original text is text which is not preprocessed and needs text information extraction. Redundant information may be present in the original text. Wherein, the redundant information may refer to text having repeated meanings, symbols and words having no specific meaning, and the like. Text with repeated meaning may refer to repeated content occurring in the original text due to writing errors. For example, when filling out for template text, the written text may be repeated with the template content, resulting in redundant information in the final generated text. Symbols and words that do not have a particular meaning refer to useless symbols and words that do not have a semantic meaning, e.g., stop words.
Redundant information can interfere with text information extraction and filtering of redundant information in the original text is required. The embodiment of the application is not limited to a filtering manner of the redundant information, and in a possible implementation manner, the redundant information can be preset in a preset dictionary, and then the redundant information in the original text is removed based on the preset dictionary.
The original text also has sensitive information, and the sensitive information refers to information which exists in the original text and is inconvenient to disclose. For example, if the original text is a medical record text, privacy information such as patient name, patient address, etc. in the medical record text is sensitive information. Sensitive information in the original text is not relevant to the extraction of the text information.
And (5) desensitizing sensitive information in the original text. And obtaining the first target text subjected to redundant information filtering and sensitive information desensitization.
In addition, some special symbols in the original text can be replaced. Specifically, common vocabulary and symbols can be replaced based on a preset dictionary, so that the replaced text meets the requirement of feature extraction.
S302: if the length of the first target text is greater than the preset length, the first target text is segmented into a plurality of second target texts with the length smaller than or equal to the preset length, the lengths of the second target texts are complemented to the preset length, and the text to be processed is generated.
After the first target text is obtained, the length of the first target text is required to be processed, and the text to be processed with the preset length is obtained.
When the length of the first target text is greater than the preset length, the length of the first target text needs to be reduced. And cutting the first target text into a plurality of second target texts with the length smaller than or equal to the preset length. Specifically, the first target text may be cut using a specific symbol. For example, the first target text may be cut using special symbols such as periods, line feed symbols, and the like. The segmentation of the first target text is performed on the premise that the content of the first target text is not affected. The specific segmentation method can be determined according to the original separation method of the texts in the first target text.
And for the second target text with the length being the preset length, the second target text can be directly used as the text to be processed. And for the second target text with the length smaller than the preset length, complementing the length of the second target text to the preset length, and generating a text to be processed. In one possible implementation, the second target text that is smaller than the preset length may be complemented by a space-occupying symbol with the preset length.
S303: and if the length of the first target text is smaller than the preset length, the length of the first target text is complemented to the preset length, and the text to be processed is generated.
When the length of the first target text is smaller than the preset length, the length of the first target text needs to be supplemented. In one possible implementation, the first target text smaller than the preset length may be complemented with a placeholder symbol to generate the text to be processed.
S304: and if the length of the first target text is equal to the preset length, determining the first target text as the text to be processed.
When the length of the first target text is equal to the preset length, the length of the first target text is satisfied as the length of the text to be processed, and the first target text is directly determined as the text to be processed.
In the embodiment of the application, the original text is preprocessed, and the first target text is adjusted to be the text to be processed with the preset length, so that the text to be processed obtained after processing meets the requirement of subsequent text information extraction, and the feature extraction and labeling of the text to be processed are convenient to follow.
In one possible scenario, there may be an irregular term in the text to be processed. The information extraction content included in the obtained text to be processed may be an irregular text, so that the information extraction content is inconvenient to further process and use.
In view of the foregoing, in one possible implementation manner, the embodiment of the application provides a text information extraction method. Referring to fig. 4, a flowchart of another text information extraction method according to an embodiment of the present application is shown. After obtaining the information extraction content of the information items to be extracted of different levels included in the text to be processed, the method further comprises the following three steps:
s401: the text feature of target information extraction content and the text feature of target term text are acquired, wherein the target information extraction content is any item of information extraction content, and the target term text is any item of predetermined term text.
The term text may be predetermined for the normalization process of the information extraction content. The term text is standard and is used for text that is replaced. The specific type of the term text may be determined based on the type of information extraction content. For example, if the information extraction content is text information in medical terms, the corresponding term text may be standard medical text, for example, ICD (international Classification of diseases, international disease classification) version 10 and common terms of disease are used as the term text.
And selecting one information extraction content from the information extraction contents as target information extraction content. Any term text is selected from the predetermined term text as the target term text. Text features of the target information extraction content and text features of the target term text are extracted. The text features of the object information extraction content and the text features of the object term text can be features characterizing semantic and grammatical aspects.
The embodiments of the present application are not limited to a specific implementation manner of extracting the text features of the target information extraction content and the text features of the target term text. In one possible implementation, after determining the target information extraction content and the target term text, the text feature extraction may be performed on the target information extraction content and the target term text using an ERNIE model. In another possible implementation, text features may be extracted using the ERNIE model for target information extraction content. For the term text, text features may be extracted from the term text in advance using an ERNIE model and stored in a database in correspondence with the term text. And directly acquiring corresponding text characteristics after determining the target term text.
S402: and matching the text characteristics of the target information extraction content with the text characteristics of the target term text.
The text characteristics of the extracted target information extraction content and the text characteristics of the target term text can reflect the difference between the target information extraction content and the target term text. And matching the text characteristics of the extracted content with the text characteristics of the target term text by using the target information.
In one possible implementation manner, reference is made to fig. 5, which is a schematic diagram provided in an embodiment of the present application for matching text features of extracted content with text features of target term text by using target information.
The text features of the target information extraction content and the text features of the target term text may be processed first using PCA (Principal Component Analysis ) techniques.
PCA technology is a statistical method. A group of variables possibly with correlation are converted into a group of variables with linear uncorrelation through positive-negative conversion, and the variables obtained after conversion are called main components. PCA can both achieve the conversion of high-dimensional features to low-dimensional features and also make the feature linearity after dimension reduction uncorrelated. And performing dimension reduction processing on the text features of the extracted target information extraction content and the text features of the target term text by adopting a PCA technology to obtain the text features of the low-dimension target information extraction content and the text features of the low-dimension target term text. And secondly, carrying out two classification on text features of the low-dimensional target information extraction content and text features of each low-dimensional target term text by using a semantic matching algorithm based on softmax to obtain a classification result. The similarity between the target information extraction content and the target data text may be determined based on the classification result.
By extracting and matching text features of the target information extraction content and text features of the target term text by using a PCA technology and a softmax semantic matching algorithm, similarity between the target information extraction content and the target term text can be evaluated at a semantic level, and flexibility of feature matching and accuracy of feature matching are effectively improved.
Specifically, in order to reduce the matching range, the term text may be layered in advance based on the level of the information item to be extracted, so that the text features of the low-dimensional target information extraction content and the text features of the low-dimensional target term text in the same level are subjected to two classification, thereby further improving the classification efficiency and accuracy.
S403: and if the text characteristics of the target information extraction content are matched with the text characteristics of the target term text, replacing the target information extraction content with the target term text.
If there are text features of the target information extraction contents and text features of the target term text that match each other, it is explained that the target information extraction contents are similar to the target term text, and the target information extraction contents need to be replaced. And replacing the target information extraction content with target term text.
Based on the above-described related contents of S401 to S403, it is known that by extracting and matching the text features of the target information extraction content and the text features of the target term text, it is possible to determine whether the target information extraction content has the target term text that can be replaced. And the target information extraction content is replaced based on the matched target operation language text, so that the normalization processing of the information extraction content is realized, and the subsequent information processing by directly utilizing the normalized information extraction content is facilitated.
In a possible implementation manner, the embodiment of the present application further provides a text information extraction method, and referring to fig. 6, which is a flowchart of another text information extraction method provided in the embodiment of the present application, and includes S601-S610 in addition to S201-S209 described above:
s601: initializing sequence labeling models of all levels.
Based on the requirement of text information extraction, a sequence annotation model is initialized. Specifically, according to the predetermined information items to be extracted of each level, the sequence annotation model of each level is initialized correspondingly.
S602: and determining the sequence labeling model of the first level as the sequence labeling model of the current level.
When the sequence labeling model is trained, a serial processing mode is adopted, and the sequence labeling model of the first level is determined to be the sequence labeling model of the current level.
S603: inputting the text fusion characteristics of the training text into a sequence labeling model of the current level, labeling the information items to be extracted corresponding to the sequence labeling model of the current level, and obtaining a labeling result of the training text output by the sequence labeling model of the current level.
The training text is a text for training the sequence annotation model, and comprises standard annotation results of the information items to be extracted corresponding to the sequence annotation model of each level. The text fusion characteristics of the training text are utilized to train the sequence annotation model of each level.
Inputting the text fusion characteristics of the training text into a sequence labeling model of the current level, and labeling the information items to be extracted corresponding to the sequence labeling model of the current level on the training text by utilizing the sequence labeling model of the current level to obtain a labeling result of the training text output by the sequence labeling model of the current level.
S604: and obtaining a loss value of the sequence annotation model of the current level according to the standard annotation result of the information item to be extracted corresponding to the sequence annotation model of the current level in the training text and the annotation result of the training text output by the sequence annotation model of the current level.
And comparing the output labeling result of the training text with the standard labeling result of the information item to be extracted corresponding to the sequence labeling model of the current level in the training text, so that the labeling accuracy of the sequence labeling model can be determined. And calculating a loss value of the sequence labeling model of the current level aiming at the current training by using a standard labeling result of the information item to be extracted corresponding to the sequence labeling model of the current level in the training text and a labeling result of the training text output by the sequence labeling model of the current level. Based on the obtained loss value, the sequence labeling model of the current level can be adjusted, and training of the sequence labeling model is achieved.
S605: and judging whether a sequence labeling model of the next level exists or not.
After the labeling of the sequence labeling model of the current level is finished, the corresponding labeling result and the loss value are obtained, and whether the sequence labeling model of the next level exists is judged. Because the sequence labeling model in the embodiment of the present application is a sequence labeling model with multiple layers, if the sequence labeling model with the current layer is a sequence labeling model with the first layer, a sequence labeling model with the next layer exists, and S606 and subsequent steps are executed. If the sequence labeling model of the current level is the sequence labeling model of the second level and the sequence labeling models after the second level, the sequence labeling model of the next level may not exist. If the sequence labeling model of the next layer exists, executing S606 and subsequent steps; if there is no sequence annotation model for the next layer, then S608 and subsequent steps are performed.
S606: if the sequence labeling model of the next level exists, fusing a labeling result of the training text output by the sequence labeling model of the current level with text fusion characteristics of the training text, and obtaining the text fusion characteristics of the training text again.
And when the sequence labeling model of the next level exists, fusing the labeling result of the training text output by the obtained sequence labeling model of the current level with the training text fusion characteristic to obtain the updated text fusion characteristic of the training text.
In particular, for example, y is used n Representing text fusion features of the re-fused training text, then y n =μ+ω n Wherein n represents the number of layers corresponding to the sequence annotation model of the current layer, mu represents the initial text fusion characteristic of the training text, omega n Representing the labeling result of the sequence labeling model of the nth layer, y n And representing the text fusion characteristics of the training text obtained by fusing the labeling result of the sequence labeling model of the nth level with the text fusion characteristics of the training text.
S607: and determining the sequence labeling model of the next level as the sequence labeling model of the current level, and re-executing the text fusion characteristics of the training text to be input into the sequence labeling model of the current level and the subsequent steps.
Changing the hierarchy of the sequence labeling model corresponding to the sequence labeling model of the current hierarchy, and determining the sequence labeling model of the next hierarchy as the sequence labeling model of the current hierarchy. After the sequence labeling model of the current level is redetermined, the step S603 and the subsequent steps are re-executed, so that labeling of the training text by using the sequence labeling model of the current level and corresponding updating of the text fusion characteristics of the training text are realized.
S608: and if the sequence labeling model of the next level does not exist, obtaining the loss value of the sequence labeling model of each level.
If the sequence labeling model of the current level is the sequence labeling model of the last level, the sequence labeling model of the next level does not exist, the labeling of the sequence labeling model is ended, and the labeling process of the training is ended. And obtaining the loss value of the sequence annotation model of each level.
S609: and weighting and adding the loss values of the sequence labeling models of all the layers to obtain comprehensive loss values, and adjusting the sequence labeling models of all the layers according to the comprehensive loss values.
In the embodiment of the application, the correlation among the sequence labeling models of all the layers is considered, so that the training labeling models of all the layers can be intensively trained.
And carrying out weighted addition on the loss values of the obtained sequence labeling models of all the layers to obtain a comprehensive loss value. The comprehensive loss value can be obtained by multiplying the loss value of the sequence labeling model of each level with the corresponding weight parameter and then adding the multiplied loss value.
Based on the obtained comprehensive loss value, the sequence labeling model of each level can be adjusted, and the training of the sequence labeling model is realized.
S610: and re-executing the sequence labeling model of the first level to be determined as the sequence labeling model of the current level and the subsequent steps until reaching the preset stopping condition, and obtaining the sequence labeling model of each level generated by training.
In order to ensure the accuracy of the sequence annotation model, the sequence annotation model of each level needs to be trained for multiple times. And S602 and the subsequent steps are re-executed until the preset stopping condition is reached, and training of the sequence labeling models of all the layers is stopped, so that the sequence labeling models of all the layers generated by training are obtained. The preset stopping condition may specifically be that the loss value meets a preset condition, or that training reaches a preset test, and may specifically be set according to the training requirement of the sequence labeling model.
Based on the above-mentioned related content of S601-S610, it can be known that by using training text to intensively train the sequence labeling models of all levels, a sequence labeling model with more accurate labeling results can be obtained. In addition, the text fusion characteristics of the training text input into the sequence annotation models of all levels comprise the annotation results of the sequence annotation models of other levels, so that the sequence annotation models of all levels obtained through training have stronger correlation, and the accuracy of the annotation results of the sequence annotation models of all levels is improved.
In one possible implementation, text features of the text to be processed may be extracted using the ERNIE model, and part-of-speech features of the text to be processed may be extracted using the part-of-speech recognition model.
Correspondingly, the embodiment of the present application provides a specific implementation manner of extracting text features and part-of-speech features of a text to be processed with a preset length, which is shown in fig. 7, and the diagram is a flowchart for extracting text features and part-of-speech features of a text to be processed with a preset length, where the flowchart includes S701-S702:
s701: inputting the text to be processed into an ERNIE model to obtain text characteristics of the text to be processed; the text characteristics of the text to be processed represent grammar, semantics and positions of characters in the text to be processed; the text feature of the text to be processed is a text feature vector in m x n dimensions, wherein m is a preset length, and n is a positive integer.
The ERNIE model is a depth feature extractor based on a self-attention mechanism, and the model is pretrained by a large amount of unlabeled data, so that the model has the characteristics of understanding the position, grammar, semantics and the like of characters in the general field. In the embodiment of the application, before the text feature extraction of the text to be processed is performed by using the ERNIE model, the training of feature extraction of the text in the specific field can be performed on the ERNIE model by using the labeled text belonging to the same field as the text to be processed, so that the ERNIE model has the feature extraction capability of the text in the specific field.
And inputting the text to be processed into the ERNIE model to obtain corresponding text characteristics. Text features of the text to be processed may characterize the grammar, semantics, and location of the characters in the text to be processed. Text characteristics of the text to be processed may be represented as alpha m*n Wherein, m is a preset length, and n is a positive integer, and the text feature of the text to be processed is a text feature vector in m is n dimension.
Specifically, the text that the ERNIE model can process is 512 in length, and the corresponding m can be 512. Text characteristics of the text to be processed may be represented as alpha 512×76812 ,……,α i ,……,α 512 ]Wherein alpha is i The text feature corresponding to the ith character in the text to be processed is represented, i is a positive integer less than or equal to 512, 768 is a dimension parameter preset for extracting the text feature, and the feature with 768 different angles is represented and can be determined by parameters of an ERNIE model.
S702: inputting the text to be processed into a part-of-speech recognition model to obtain part-of-speech features of the text to be processed, wherein the part-of-speech features of the text to be processed are part-of-speech feature vectors in m 1 dimension.
The part-of-speech recognition model may be an open source tool with part-of-speech recognition functionality. Specifically, the part-of-speech recognition model may be LTP (Language Technology Platform ), hanlp (Han Language Processing chinese language processing package), or the like.
Based on the part-of-speech recognition model, part-of-speech characteristics of the text to be processed may be determined. In particular, the part-of-speech feature may be m×1 dimensions. It should be noted that the part-of-speech feature of each character is related to the vocabulary in which the character is located, and the part-of-speech features of the characters belonging to the same vocabulary are identical.
In one possible implementation, part-of-speech recognition results of the respective characters in the text to be processed may be determined first by means of a part-of-speech recognition model. And determining the part-of-speech characteristics of the corresponding text to be processed based on the part-of-speech recognition results of the characters. For example, the part-of-speech recognition result of each character can be converted into the part-of-speech feature corresponding to each character through a predefined part-of-speech coding dictionary, and then the part-of-speech feature of the text to be processed is obtained.
The part-of-speech coding dictionary may be as shown in Table 1:
part of speech Encoding Examples of the examples
Nouns (noun) 1 Left arm "
Verb (verb) 2 "fracture"
Preposition 3 "due to"
Punctuation mark 4 “。”
…… …… ……
TABLE 1
Wherein, the codes corresponding to the parts of speech can be integers greater than 0.
The corresponding part-of-speech feature may be represented as beta 512×112 ,……,β i ,……,β 512 ]Wherein beta is i Representing part-of-speech features corresponding to an ith character in the text to be processed, wherein i is a positive integer less than or equal to 512, and 1 is a dimension for extracting the part-of-speech features.
In the embodiment of the application, the text feature and the part-of-speech feature are extracted from the text to be processed by using the ERNIE model and the part-of-speech recognition model respectively, so that more accurate text fusion features can be obtained, and more accurate text information can be obtained conveniently.
Further, the embodiment of the present application provides a specific implementation manner of fusing text features and part-of-speech features of a text to be processed to obtain text fusion features of the text to be processed, which specifically includes:
mapping the part-of-speech feature vector in m x 1 dimension into part-of-speech feature vector in m x n dimension;
and fusing the part-of-speech feature vector in the m-n dimension with the text feature vector in the m-n dimension to obtain a text fusion feature of the text to be processed, wherein the text fusion feature of the text to be processed is the text fusion feature vector in the m-n dimension.
Because the dimensions of the part-of-speech feature vector are different from those of the text feature vector, the dimensions of the part-of-speech feature vector and the text feature vector need to be unified before the part-of-speech feature vector and the text feature vector are fused.
The part-of-speech feature vector in m x 1 dimensions is mapped to a part-of-speech feature vector in m x n dimensions. For example, beta 512×112 ,……,β i ,……,β 512 ]Mapping to beta 512×76812 ,……,β i ,……,β 512 ]。
In one possible implementation, the mapping of feature vectors may be performed through the full connection layer. The activation function of the fully connected layer may be a Relu function.
After unifying the dimensions, fusing the part-of-speech feature vectors in m x n dimensions with the text feature vectors in m x n dimensions to obtain text fusion feature vectors of the text to be processed in m x n dimensions.
Taking the part-of-speech feature vector and the text feature vector as an example, the fused text fusion feature vector can be expressed as alpha 512×768512×768 =[α 11 ,……,α ii ,……,α 512512 ]。
Further, the labeling result of the text to be processed output by the sequence labeling model of the current level may be an m×1-dimensional labeling result vector.
Aiming at such a situation, the embodiment of the application provides a specific implementation mode for fusing a labeling result of a text to be processed, which is output by a sequence labeling model of a current level, with a text fusion feature of the text to be processed to obtain the text fusion feature of the text to be processed again, which specifically comprises the following steps:
Mapping the m-1-dimensional labeling result vector into an m-n-dimensional labeling result vector;
and fusing the m-n-dimensional labeling result vector with the m-n-dimensional text fusion feature vector to obtain the text fusion feature of the text to be processed again.
Similarly, the dimension of the labeling result vector is different from that of the text fusion feature vector, and the dimension is unified first. And mapping the m-1-dimensional labeling result vector into an m-n-dimensional labeling result vector.
In one possible implementation, the mapping of feature vectors may be performed through the full connection layer. The activation function of the fully connected layer may be a Relu function.
Specifically, gamma is n Representing the labeling result of the sequence labeling model of the nth layer, n represents the number of layers corresponding to the sequence labeling model of the current layer,wherein (1)>And representing a labeling result corresponding to an ith character in the text to be processed and labeled by the sequence labeling model of the current level, wherein i is a positive integer less than or equal to 512. Will->Mapping to
After unifying the dimensions, fusing the m x n-dimensional labeling result vector with the m x n-dimensional text fusion feature vector to obtain the m x n-dimensional text fusion feature vector of the text to be processed.
Taking the labeling result vector and the text fusion feature vector as examples, the fused text Fusion feature vector (x) n ) 512×768 See the following formula:
wherein, gamma n And (3) representing the labeling result vector of the sequence labeling model of the nth level, wherein n represents the level number corresponding to the sequence labeling model of the current level, i represents the ith character in the text to be processed, and i is a positive integer less than or equal to 512.
In the embodiment of the application, the dimension unification and fusion are carried out on the labeling result vector and the text fusion feature vector, so that the text fusion feature vector can be updated, the sequence labeling model can be labeled conveniently by using the updated text fusion feature vector, and a more accurate labeling result can be obtained.
Based on the text information extraction method provided by the above method embodiment, the present application further provides a text information extraction device, and the text information extraction device will be described with reference to the accompanying drawings.
Referring to fig. 8, the structure of a text information extraction device according to an embodiment of the present application is shown. As shown in fig. 8, the text information extracting apparatus includes:
an extracting unit 801, configured to extract text features and part-of-speech features of a text to be processed with a preset length;
a first fusion unit 802, configured to fuse the text feature and the part-of-speech feature of the text to be processed, so as to obtain a text fusion feature of the text to be processed;
A first determining unit 803, configured to determine the sequence annotation model of the first level as the sequence annotation model of the current level;
a first labeling unit 804, configured to input a text fusion feature of the text to be processed into the sequence labeling model of the current level, and label an information item to be extracted corresponding to the sequence labeling model of the current level, so as to obtain a labeling result of the text to be processed output by the sequence labeling model of the current level;
a first judging unit 805 configured to judge whether a sequence annotation model of a next level exists;
a second fusion unit 806, configured to fuse, if a sequence labeling model of a next level exists, a labeling result of the text to be processed output by the sequence labeling model of the current level with a text fusion feature of the text to be processed, so as to obtain a text fusion feature of the text to be processed again;
a second determining unit 807, configured to determine the sequence annotation model of the next level as a sequence annotation model of a current level, and re-execute the inputting the text fusion feature of the text to be processed into the sequence annotation model of the current level and the subsequent steps;
a first obtaining unit 808, configured to obtain, if there is no sequence labeling model of a next level, labeling results of the text to be processed output by the sequence labeling models of each level;
And the analyzing unit 809 is configured to analyze the labeling result of the to-be-processed text output by the sequence labeling model of each level, and obtain information extraction contents of to-be-extracted information items of different levels included in the to-be-processed text.
In one possible implementation, the apparatus further includes:
the processing unit is used for filtering redundant information and desensitizing sensitive information on the original text to obtain a first target text;
the segmentation unit is used for segmenting the first target text into a plurality of second target texts with the length smaller than or equal to the preset length if the length of the first target text is larger than the preset length, and complementing the length of the second target text to the preset length to generate a text to be processed;
the filling unit is used for filling the length of the first target text to the preset length if the length of the first target text is smaller than the preset length, and generating a text to be processed;
and the third determining unit is used for determining the first target text as the text to be processed if the length of the first target text is equal to the preset length.
In one possible implementation, the apparatus further includes:
A second obtaining unit, configured to obtain a text feature of a target information extraction content and a text feature of a target term text, where the target information extraction content is any item of the information extraction content, and the target term text is any item of a predetermined term text;
the matching unit is used for matching the text characteristics of the target information extraction content with the text characteristics of the target term text;
and the replacing unit is used for replacing the target information extraction content with the target term text if the text characteristics of the target information extraction content are matched with the text characteristics of the target term text.
In one possible implementation, the apparatus further includes:
the initialization unit is used for initializing sequence annotation models of all levels;
a fourth determining unit, configured to determine the sequence annotation model of the first level as the sequence annotation model of the current level;
the second labeling unit is used for inputting the text fusion characteristics of the training text into the sequence labeling model of the current level, labeling the information items to be extracted corresponding to the sequence labeling model of the current level, and obtaining the labeling result of the training text output by the sequence labeling model of the current level;
The first execution unit is used for obtaining a loss value of the sequence annotation model of the current level according to a standard annotation result of an information item to be extracted corresponding to the sequence annotation model of the current level in the training text and an annotation result of the training text output by the sequence annotation model of the current level;
the second judging unit is used for judging whether a sequence labeling model of the next level exists or not;
a third fusion unit, configured to fuse, if a sequence labeling model of a next level exists, a labeling result of the training text output by the sequence labeling model of the current level with a text fusion feature of the training text, and retrieve a text fusion feature of the training text;
the second execution unit is used for determining the sequence annotation model of the next level as the sequence annotation model of the current level, and re-executing the text fusion characteristics of the training text to be input into the sequence annotation model of the current level and the subsequent steps;
the third acquisition unit is used for acquiring the loss value of the sequence annotation model of each level if the sequence annotation model of the next level does not exist;
the adjusting unit is used for weighting and adding the loss values of the sequence labeling models of all the layers to obtain comprehensive loss values, and adjusting the sequence labeling models of all the layers according to the comprehensive loss values;
And the third execution unit is used for re-executing the sequence annotation model of the first level to be determined as the sequence annotation model of the current level and the subsequent steps until the preset stop condition is reached, so as to obtain the sequence annotation model of each level generated by training.
In one possible implementation manner, the number of layers of the sequence labeling model and the information items to be extracted corresponding to the sequence labeling models of all the layers are predetermined according to the layers of the information items to be extracted.
In one possible implementation, the extracting unit 801 includes:
the first input subunit is used for inputting a text to be processed with a preset length into an ERNIE model to obtain text characteristics of the text to be processed; the text characteristics of the text to be processed represent grammar, semantics and positions of characters in the text to be processed; the text feature of the text to be processed is a text feature vector in m x n dimensions, wherein m is the preset length, and n is a positive integer;
and the second input subunit is used for inputting the text to be processed into a part-of-speech recognition model to obtain part-of-speech features of the text to be processed, wherein the part-of-speech features of the text to be processed are part-of-speech feature vectors with m 1 dimensions.
In one possible implementation, the first fusing unit 802 includes:
a mapping subunit, configured to map the part-of-speech feature vector in m×1 dimension to a part-of-speech feature vector in m×n dimension;
and the fusion subunit is used for fusing the part-of-speech feature vector in the m x n dimension with the text feature vector in the m x n dimension to obtain the text fusion feature of the text to be processed, wherein the text fusion feature of the text to be processed is the text fusion feature vector in the m x n dimension.
In one possible implementation manner, the labeling result of the text to be processed output by the sequence labeling model of the current level is a labeling result vector with m×1 dimensions;
the second fusion unit 806 is specifically configured to map the m×1-dimensional labeling result vector to an m×n-dimensional labeling result vector; and fusing the m-n-dimensional labeling result vector with the m-n-dimensional text fusion feature vector to obtain the text fusion feature of the text to be processed again.
In addition, the embodiment of the application also provides a text information extraction device, which comprises: the text information extraction method according to any one of the embodiments above is implemented by a memory, a processor, and a computer program stored in the memory and executable on the processor, when the processor executes the computer program.
In addition, an embodiment of the present application further provides a computer readable storage medium, where instructions are stored, where when the instructions are executed on a terminal device, the terminal device is caused to perform the text information extraction method according to any one of the embodiments above.
According to the text information extraction device and the text information extraction equipment, the text features and the part-of-speech features of the text to be processed are extracted, the features of the text to be processed are extracted, and more comprehensive feature information of the text to be processed can be obtained. The part-of-speech features are helpful for more accurately determining the information extraction content, and the accuracy of the obtained information extraction content can be improved. And inputting the text fusion characteristics obtained by fusing the text characteristics and the part-of-speech characteristics into a sequence labeling model of the first level, so that the information items to be extracted corresponding to the current level can be labeled. And fusing the obtained labeling result with the text fusion feature to obtain updated text fusion feature. The sequence labeling models of all levels can be labeled sequentially by replacing the sequence labeling model of the current level, and labeling results of the sequence labeling models of all levels are obtained. The labeling results of the sequence labeling models of all levels are fused with the text fusion features to be used as the input of the sequence labeling model of the next level, so that the sequence labeling model can be labeled based on the labeling results of the sequence labeling model of the previous level, and the accuracy of the labeling results of the sequence labeling model is improved. According to the labeling results of the sequence labeling models of all the layers, the information extraction contents of the information items to be extracted of different layers in the text to be processed can be obtained, multi-layer extraction of the information extraction contents can be realized, and double extraction of the relationship between the information extraction contents and the information extraction contents is considered. In addition, by acquiring multi-level information extraction contents, extraction of information extraction contents having various meanings can be achieved. Therefore, more accurate text information of the text to be processed is obtained on the basis of automatically extracting the text information.
It should be noted that, in the present description, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system or device disclosed in the embodiments, since it corresponds to the method disclosed in the embodiments, the description is relatively simple, and the relevant points refer to the description of the method section.
It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
It is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (11)

1. A text information extraction method, the method comprising:
extracting text features and part-of-speech features of a text to be processed with a preset length;
fusing the text features and the part-of-speech features of the text to be processed to obtain text fusion features of the text to be processed;
determining the sequence labeling model of the first level as the sequence labeling model of the current level;
inputting the text fusion characteristics of the text to be processed into the sequence labeling model of the current level, labeling the information items to be extracted corresponding to the sequence labeling model of the current level, and obtaining a labeling result of the text to be processed, which is output by the sequence labeling model of the current level;
Judging whether a sequence labeling model of the next level exists or not;
if a sequence labeling model of the next level exists, fusing a labeling result of the text to be processed output by the sequence labeling model of the current level with text fusion characteristics of the text to be processed, and obtaining the text fusion characteristics of the text to be processed again;
determining the sequence labeling model of the next level as the sequence labeling model of the current level, and re-executing the text fusion characteristics of the text to be processed and inputting the text fusion characteristics into the sequence labeling model of the current level and the subsequent steps;
if the sequence labeling model of the next level does not exist, labeling results of the text to be processed, which are output by the sequence labeling models of all levels, are obtained;
analyzing the labeling results of the text to be processed, which are output by the sequence labeling models of all the layers, and obtaining information extraction contents of information items to be extracted of different layers, which are included in the text to be processed.
2. The method of claim 1, wherein prior to extracting text features and part-of-speech features of the predetermined length of text to be processed, the method further comprises:
filtering redundant information and desensitizing sensitive information on an original text to obtain a first target text;
If the length of the first target text is greater than a preset length, the first target text is segmented into a plurality of second target texts with the length smaller than or equal to the preset length, and the length of the second target text is complemented to the preset length to generate a text to be processed;
if the length of the first target text is smaller than the preset length, the length of the first target text is complemented to the preset length, and a text to be processed is generated;
and if the length of the first target text is equal to the preset length, determining the first target text as a text to be processed.
3. The method according to claim 1, wherein after obtaining the information extraction content of the information items to be extracted of different levels included in the text to be processed, the method further comprises:
acquiring text characteristics of target information extraction content and text characteristics of target term text, wherein the target information extraction content is any item of the information extraction content, and the target term text is any item of predetermined term text;
matching the text characteristics of the target information extraction content with the text characteristics of the target term text;
And if the text characteristics of the target information extraction content are matched with the text characteristics of the target term text, replacing the target information extraction content with the target term text.
4. The method according to claim 1, wherein the method further comprises:
initializing sequence labeling models of all levels;
determining the sequence labeling model of the first level as the sequence labeling model of the current level;
inputting the text fusion characteristics of the training text into the sequence labeling model of the current level, labeling the information items to be extracted corresponding to the sequence labeling model of the current level, and obtaining a labeling result of the training text output by the sequence labeling model of the current level;
obtaining a loss value of the sequence annotation model of the current level according to a standard annotation result of an information item to be extracted corresponding to the sequence annotation model of the current level in the training text and an annotation result of the training text output by the sequence annotation model of the current level;
judging whether a sequence labeling model of the next level exists or not;
if a sequence labeling model of the next level exists, fusing a labeling result of the training text output by the sequence labeling model of the current level with text fusion characteristics of the training text, and obtaining the text fusion characteristics of the training text again;
Determining the sequence labeling model of the next level as the sequence labeling model of the current level, and re-executing the text fusion characteristics of the training text to be input into the sequence labeling model of the current level and the subsequent steps;
if the sequence labeling model of the next level does not exist, obtaining the loss value of the sequence labeling model of each level;
weighting and adding the loss values of the sequence labeling models of all the layers to obtain comprehensive loss values, and adjusting the sequence labeling models of all the layers according to the comprehensive loss values;
and re-executing the sequence labeling model of the first level to be determined as the sequence labeling model of the current level and the subsequent steps until reaching the preset stopping condition, and obtaining the sequence labeling model of each level generated by training.
5. The method according to claim 1 or 4, wherein the number of layers of the sequence annotation model and the information items to be extracted corresponding to the sequence annotation model of each layer are predetermined according to the layers of the information items to be extracted.
6. The method according to claim 1, wherein extracting text features and part-of-speech features of the text to be processed of a preset length comprises:
Inputting a text to be processed with a preset length into an ERNIE model to obtain text characteristics of the text to be processed; the text characteristics of the text to be processed represent grammar, semantics and positions of characters in the text to be processed; the text feature of the text to be processed is a text feature vector in m x n dimensions, wherein m is the preset length, and n is a positive integer;
inputting the text to be processed into a part-of-speech recognition model to obtain part-of-speech features of the text to be processed, wherein the part-of-speech features of the text to be processed are part-of-speech feature vectors with m 1 dimensions.
7. The method according to claim 6, wherein the fusing the text feature and the part-of-speech feature of the text to be processed to obtain the text fusion feature of the text to be processed includes:
mapping the part-of-speech feature vector in m x 1 dimension into part-of-speech feature vector in m x n dimension;
and fusing the part-of-speech feature vector in the m x n dimension with the text feature vector in the m x n dimension to obtain the text fusion feature of the text to be processed, wherein the text fusion feature of the text to be processed is the text fusion feature vector in the m x n dimension.
8. The method according to claim 7, wherein the labeling result of the text to be processed output by the sequence labeling model of the current level is a labeling result vector with m x 1 dimensions;
Fusing the labeling result of the text to be processed output by the sequence labeling model of the current level with the text fusion feature of the text to be processed to obtain the text fusion feature of the text to be processed again, wherein the method comprises the following steps:
mapping the m-1-dimensional labeling result vector into an m-n-dimensional labeling result vector;
and fusing the m-n-dimensional labeling result vector with the m-n-dimensional text fusion feature vector to obtain the text fusion feature of the text to be processed again.
9. A text information extraction apparatus, characterized in that the apparatus comprises:
the extraction unit is used for extracting text features and part-of-speech features of the text to be processed with preset length;
the first fusion unit is used for fusing the text characteristics and the part-of-speech characteristics of the text to be processed to obtain text fusion characteristics of the text to be processed;
the first determining unit is used for determining the sequence annotation model of the first level as the sequence annotation model of the current level;
the first labeling unit is used for inputting the text fusion characteristics of the text to be processed into the sequence labeling model of the current level, labeling the information items to be extracted corresponding to the sequence labeling model of the current level, and obtaining a labeling result of the text to be processed, which is output by the sequence labeling model of the current level;
The first judging unit is used for judging whether a sequence labeling model of the next level exists or not;
the second fusion unit is used for fusing the labeling result of the text to be processed output by the sequence labeling model of the current level with the text fusion characteristics of the text to be processed if the sequence labeling model of the next level exists, and obtaining the text fusion characteristics of the text to be processed again;
the second determining unit is used for determining the sequence annotation model of the next level as the sequence annotation model of the current level, and re-executing the text fusion characteristics of the text to be processed and inputting the text fusion characteristics into the sequence annotation model of the current level and the subsequent steps;
the first acquisition unit is used for acquiring the labeling result of the text to be processed output by the sequence labeling model of each level if the sequence labeling model of the next level does not exist;
and the analysis unit is used for analyzing the labeling results of the texts to be processed, which are output by the sequence labeling models of all the layers, and obtaining information extraction contents of information items to be extracted of different layers included in the texts to be processed.
10. A text information extracting apparatus, characterized by comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the text information extraction method according to any one of claims 1-8 when the computer program is executed.
11. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein instructions, which when run on a terminal device, cause the terminal device to perform the text information extraction method according to any of claims 1-8.
CN202110707811.5A 2021-06-24 2021-06-24 Text information extraction method, device and equipment Active CN113408296B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110707811.5A CN113408296B (en) 2021-06-24 2021-06-24 Text information extraction method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110707811.5A CN113408296B (en) 2021-06-24 2021-06-24 Text information extraction method, device and equipment

Publications (2)

Publication Number Publication Date
CN113408296A CN113408296A (en) 2021-09-17
CN113408296B true CN113408296B (en) 2024-02-13

Family

ID=77683146

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110707811.5A Active CN113408296B (en) 2021-06-24 2021-06-24 Text information extraction method, device and equipment

Country Status (1)

Country Link
CN (1) CN113408296B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116401381B (en) * 2023-06-07 2023-08-04 神州医疗科技股份有限公司 Method and device for accelerating extraction of medical relations

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020119075A1 (en) * 2018-12-10 2020-06-18 平安科技(深圳)有限公司 General text information extraction method and apparatus, computer device and storage medium
CN111859968A (en) * 2020-06-15 2020-10-30 深圳航天科创实业有限公司 Text structuring method, text structuring device and terminal equipment
WO2021051871A1 (en) * 2019-09-18 2021-03-25 平安科技(深圳)有限公司 Text extraction method, apparatus, and device, and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020119075A1 (en) * 2018-12-10 2020-06-18 平安科技(深圳)有限公司 General text information extraction method and apparatus, computer device and storage medium
WO2021051871A1 (en) * 2019-09-18 2021-03-25 平安科技(深圳)有限公司 Text extraction method, apparatus, and device, and storage medium
CN111859968A (en) * 2020-06-15 2020-10-30 深圳航天科创实业有限公司 Text structuring method, text structuring device and terminal equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
黄胜 ; 李伟 ; 张剑 ; .基于深度学习的简历信息实体抽取方法.计算机工程与设计.2018,(12),1-11. *

Also Published As

Publication number Publication date
CN113408296A (en) 2021-09-17

Similar Documents

Publication Publication Date Title
WO2020211720A1 (en) Data processing method and pronoun resolution neural network training method
CN113011533A (en) Text classification method and device, computer equipment and storage medium
Lei et al. From natural language specifications to program input parsers
US20200311115A1 (en) Method and system for mapping text phrases to a taxonomy
CN111914097A (en) Entity extraction method and device based on attention mechanism and multi-level feature fusion
CN109003677B (en) Structured analysis processing method for medical record data
WO2021046536A1 (en) Automated information extraction and enrichment in pathology report using natural language processing
US11170169B2 (en) System and method for language-independent contextual embedding
CN114036955B (en) Detection method for headword event argument of central word
CN111274829A (en) Sequence labeling method using cross-language information
Gildea et al. Human languages order information efficiently
CN114021573B (en) Natural language processing method, device, equipment and readable storage medium
JP2005181928A (en) System and method for machine learning, and computer program
CN113408296B (en) Text information extraction method, device and equipment
Wong et al. isentenizer-: Multilingual sentence boundary detection model
JP2005208782A (en) Natural language processing system, natural language processing method, and computer program
CN112749277A (en) Medical data processing method and device and storage medium
CN109902162B (en) Text similarity identification method based on digital fingerprints, storage medium and device
CN114021572B (en) Natural language processing method, device, equipment and readable storage medium
KR102518895B1 (en) Method of bio information analysis and storage medium storing a program for performing the same
CN111506764B (en) Audio data screening method, computer device and storage medium
CN111199170B (en) Formula file identification method and device, electronic equipment and storage medium
Behera An Experiment with the CRF++ Parts of Speech (POS) Tagger for Odia.
CN113761875A (en) Event extraction method and device, electronic equipment and storage medium
AU2021106441A4 (en) Method, System and Device for Extracting Compound Words of Pathological location in Medical Texts Based on Word-Formation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant