CN113033204A - Information entity extraction method and device, electronic equipment and storage medium - Google Patents

Information entity extraction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113033204A
CN113033204A CN202110313303.9A CN202110313303A CN113033204A CN 113033204 A CN113033204 A CN 113033204A CN 202110313303 A CN202110313303 A CN 202110313303A CN 113033204 A CN113033204 A CN 113033204A
Authority
CN
China
Prior art keywords
text
information entity
original data
text block
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110313303.9A
Other languages
Chinese (zh)
Inventor
黄进然
林璟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Wondfo Biotech Co Ltd
Original Assignee
Guangzhou Wondfo Biotech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Wondfo Biotech Co Ltd filed Critical Guangzhou Wondfo Biotech Co Ltd
Priority to CN202110313303.9A priority Critical patent/CN113033204A/en
Publication of CN113033204A publication Critical patent/CN113033204A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the application provides an information entity extraction method, an information entity extraction device, electronic equipment and a storage medium, wherein an original data text is obtained, the original data text is sequentially partitioned to obtain at least one text block, the at least one text block is processed according to a pre-constructed information entity extraction model to obtain at least one information entity contained in the original data text, category inference is carried out according to a preset rule to determine the category of the at least one information entity, and automatic extraction of the information entity is realized.

Description

Information entity extraction method and device, electronic equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of data analysis, in particular to an information entity extraction method and device, electronic equipment and a storage medium.
Background
As an important branch of the field of natural language processing, the main function of information extraction is to extract specific fact information from a natural language text, so as to help people quickly find information really needed by people in a large amount of information in an automatic manner, and to meet the challenge brought by information explosion. The information entity extraction is the most practical technique in the information extraction, and the main task of the information entity extraction is to identify and classify proper names and meaningful quantity phrases appearing in texts.
At present, the mainstream method in the industry for information entity extraction is a sequence labeling method, that is, each word in a text may have a plurality of candidate category labels, each label corresponds to the position of the word in each type of information entity, and by performing serialized automatic labeling (i.e., classification) on each word in the text, and then integrating the automatically labeled labels, an information entity composed of a plurality of words and the category thereof are finally obtained.
However, for longer texts, the possible sequence patterns become many, so that the sequence standard method has the problems of poor recognition effect and low recognition efficiency.
Disclosure of Invention
The embodiment of the application provides an information entity extraction method, an information entity extraction device, electronic equipment and a storage medium, and aims to solve the problems of low identification efficiency and accuracy in the prior art.
In a first aspect, an embodiment of the present application provides an information entity extraction method, including:
acquiring an original data text;
sequentially blocking the original data text to obtain at least one text block;
processing the at least one text block according to a pre-constructed information entity extraction model to obtain at least one information entity contained in the original data text;
and performing category inference according to a preset rule, and determining the category of the at least one information entity.
Optionally, the partitioning the original data text to obtain at least one text block that is ordered and semantically continuous includes:
segmenting and/or sentence-dividing the original data text to obtain at least one short text;
and sequencing and carrying out semantic continuity processing on the at least one short text to obtain the at least one text block.
Optionally, the sequentially blocking the original data text to obtain at least one text block includes:
segmenting the original data text according to the paragraph item symbol to obtain at least one paragraph text;
and splitting the paragraph text with the character length larger than a set threshold value according to the sentence tail identifier to obtain the at least one short text.
Optionally, the sorting and semantic continuity processing the at least one short text to obtain the at least one text block includes:
sequencing the at least one short text according to the sequence of the at least one short text in the original data text;
and determining whether the tail key word of the target short text is a part of the target information entity, if so, merging the target short text and the next short text to obtain the at least one text block.
Optionally, the method further comprises:
special identifiers are added at the beginning and the end of each text block.
Optionally, the processing the at least one text block according to a pre-constructed information entity extraction model to obtain at least one information entity included in the original data text includes:
performing feature coding on the at least one text block to obtain a two-dimensional dictionary list of each text block;
performing sequence labeling prediction on the two-dimensional dictionary list according to a preset algorithm to obtain a target labeling sequence of each text block;
and according to the target labeling sequence, extracting characters from the two-dimensional dictionary list to obtain the information entity contained in each text block.
Optionally, the performing sequence labeling prediction on the two-dimensional dictionary list according to a preset algorithm to obtain a target labeling sequence of each text block includes:
calculating the conditional probability of each word sequence marked as a candidate label in the two-dimensional dictionary list according to a Conditional Random Field (CRF) algorithm;
and searching the optimal label from the candidate labels through a Viterbi algorithm according to the conditional probability to obtain the target label sequence.
Optionally, before the processing the at least one text block according to the pre-constructed information entity extraction model to obtain the at least one information entity included in the original data text, the method further includes:
acquiring a sample data text;
marking the sample data text according to a target information entity to obtain a training data set, wherein the target information entity is obtained by combining information entities with the same type of attributes;
and carrying out model training according to the training data set to obtain at least one information entity extraction model.
Optionally, the marking the sample data text according to the target information entity to obtain a training data set, including:
and respectively marking prefix keywords of the target information entity and the target information entity in the sample data text to obtain the training data set.
In a second aspect, an embodiment of the present application provides an information entity extraction apparatus, including:
the acquisition module is used for acquiring an original data text;
the processing module is used for sequentially blocking the original data text to obtain at least one text block; processing the at least one text block according to a pre-constructed information entity extraction model to obtain at least one information entity contained in the original data text; and performing category inference according to a preset rule, and determining the category of the at least one information entity.
In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the information entity extraction method according to the first aspect.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the information entity extraction method according to the first aspect.
According to the information entity extraction method, the information entity extraction device, the electronic equipment and the storage medium, the original data text is obtained, the original data text is sequentially partitioned to obtain at least one text block, the at least one text block is processed according to a pre-constructed information entity extraction model to obtain at least one information entity contained in the original data text, category inference is carried out according to a preset rule, the category of the at least one information entity is determined, automatic extraction of the information entity is achieved, and due to the fact that the strategies of sequential partitioning and merging-predicting-restoring of the original data text are adopted in the information entity extraction process, complexity and workload of the model are reduced, and meanwhile extraction efficiency and extraction accuracy of the information entity are improved.
Drawings
Fig. 1 is a schematic flowchart of an information entity extraction method according to an embodiment of the present application;
FIG. 2 is a schematic diagram illustrating a principle of calculating conditional probability by a CRF algorithm according to an embodiment of the present application;
FIG. 3 is a schematic diagram illustrating the principle of labeled sequence prediction based on the CRF algorithm and the Viterbi algorithm according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an information entity extraction apparatus according to a second embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to a third embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be further noted that, for the convenience of description, only some of the structures related to the present application are shown in the drawings, not all of the structures.
The main ideas of the technical scheme are as follows: based on the technical problems in the prior art, the embodiment of the present application provides a technical scheme for information entity extraction, which converts the problem of information entity extraction into the problem of sequence labeling prediction, and includes firstly, sequentially blocking an original information text, then finding out an optimal sequence label for each text block by using an information entity extraction model pre-constructed by a Conditional Random Field (CRF) algorithm and a Viterbi (chinese: Viterbi) algorithm, finally, determining a correct information entity through regular post-processing, splitting a long text into text blocks through the sequential text blocking, and further performing information entity extraction on the text blocks by using the model, thereby effectively reducing the complexity of the model and also improving the prediction accuracy and the operation efficiency of the model. In addition, in the selection of the model strategy, the embodiment of the application adopts a method of merging-predicting-restoring, namely, a plurality of similar attributes are merged into a coarse-grained attribute, all information entities of the coarse-grained attribute are extracted through a prediction model, then category inference is carried out on the information entities, the coarse-grained attribute is split and restored into fine-grained attributes, and a method of extracting a prefix and an attribute information entity firstly and then removing the prefix through a rule is adopted, so that the prediction precision of the model is improved, and the complexity and the workload of the model are reduced.
The technical solutions of the present application will be described below by taking an example of extracting relevant attribute information from bidding data, and it should be understood that the technical solutions of the embodiments of the present application can also be used in other scenarios.
Example one
Fig. 1 is a schematic flow chart of an information entity extraction method provided in an embodiment of the present application, where the method of the present embodiment may be executed by an information entity extraction device provided in the embodiment of the present application, and the device may be implemented in a software and/or hardware manner and may be integrated in an electronic device such as a server and an intelligent terminal. As shown in fig. 1, the information entity extraction method of this embodiment includes:
s101, acquiring an original data text.
In this embodiment, the original data text refers to a text to be subjected to information entity extraction, and is a data basis for performing information entity extraction. In order to achieve the acquisition of the original data, in this embodiment, a data acquisition probe may be set on a related platform or a website in advance, and the original data text is obtained by collecting and sorting data returned by the data acquisition probe.
Optionally, on the basis of collection and sorting, in this embodiment, some simple pre-processing may be performed on the data returned by the data acquisition probe, including removing special symbols such as spaces, tab symbols, line feed symbols, etc., converting english symbols to chinese symbols, converting full-angle symbols to half-angle symbols, etc., so that the obtained original data text can meet the subsequent analysis and use requirements.
S102, sequentially blocking the original data text to obtain at least one text block.
Because the CRF algorithm has poor training and prediction effects on long texts, in this embodiment, the obtained original data texts are sequentially partitioned, that is, the original data texts are sequentially and gradually partitioned into some shorter text blocks without affecting semantic continuity.
Optionally, in this step, the original data text may be segmented and/or sentence-divided according to the paragraph identifier or the sentence end identifier to obtain at least one short text, and then the at least one split short text is subjected to sequencing and semantic continuity processing to obtain at least one text block.
In one possible implementation, the ordered partitioning of the original data text may be implemented by the following specific steps:
(1) the bidding information content is divided into a plurality of texts by paragraph item symbols (such as 'one', '1', '1.1', '1'), and the like), and the texts obtained by dividing in the step are called paragraph texts for convenience of distinguishing;
(2) for each paragraph text, sequentially judging whether the character length is larger than a set threshold value, such as 100 characters, for the paragraph text with the character length larger than the set threshold value, taking a sentence end identifier (such as a period number) in the paragraph text as a segmentation symbol, and further splitting the paragraph text; for paragraph texts with the character length less than or equal to the set threshold, no further splitting is carried out, and for the convenience of distinguishing, the texts obtained by splitting in the step are called short texts;
(3) firstly, the short texts obtained in the step (2) are split and sequenced according to the sequence of the short texts appearing in the original data text, then semantic judgment is carried out to determine whether the tail key words of the target short texts are part of the target information entity, if yes, the short texts are merged with the next short texts to obtain ordered short texts without influencing semantic continuity, and the short texts obtained in the step are called text blocks for convenience of distinguishing.
The target short text refers to a short text currently subjected to the ambiguity judgment, and can be any short text. The target information entity refers to an information entity to be extracted, and can be determined in advance according to requirements, such as item names, item numbers, unit names, addresses, contacts, contact ways, purchase amounts, purchase ways, product names, quantities, amounts and the like.
Illustratively, if the last keyword of a short text is "purchase budget: "and the next short text is" 30 ten thousand dollars ", then the two short texts need to be merged into" the budget for purchase: 30 ten thousand yuan, thus ensuring the semantic continuity.
In order to facilitate subsequent processing, in this step, a sequence number may be further marked on the obtained at least one text block according to the sequence of the arrangement.
Optionally, since the CRF algorithm has a poor prediction effect on the condition that the information entity appears at the head or the tail of the text, in order to improve the prediction accuracy of the CRF algorithm, in this embodiment, a special text identifier may be further added to the beginning and the tail of each text block, so that the information entity appears in the middle of the text block (instead of at the head or the tail). For example, the conditional random field @' is added before the beginning of the sentence, and the conditional random field @ is added after the end of the sentence.
S103, processing at least one text block according to a pre-constructed information entity extraction model to obtain at least one information entity contained in the original data text.
In this embodiment, in order to extract required data from an original data text, various information entity extraction models are constructed in advance according to a CRF algorithm and a Viterbi algorithm.
Since the original data text usually includes many attribute information, such as bidding information text, it usually includes item number, item name, bidding unit address, bidding unit contact, agency name, agency address, agency contact, winning unit name, winning unit address, winning unit contact, purchasing amount, purchasing method, purchasing name, specification, quantity, unit price, price quote, brand, manufacturer, etc. If the information entities of all attributes are extracted simultaneously through one model, the complexity of the model is increased, and the prediction precision is also reduced; if a specific model is established for each attribute individually, there are N models with N attributes, and each model extracts the attribute with the finest granularity, which increases the training workload of the model and may not have high prediction accuracy. For example, the attribute "contact address of a bidding unit" generally appears only in one place of the bidding information, but the "contact address" may appear in a plurality of places of the bidding information, which may cause much interference to the extraction of the attribute "contact address of a bidding unit".
In order to solve the problem, in this embodiment, when the information entity extraction model is constructed, similar attributes are merged, for example, the attribute — bidding unit contact way, agency mechanism contact way, and winning unit contact way are merged into the coarse-grained attribute — contact way, and then a model is established for each coarse-grained attribute, and when the information entity extraction is performed, all information entities corresponding to the coarse-grained attributes are extracted through the model, so that not only is the complexity of the model reduced, but also the prediction accuracy of the information entity extraction model is considered.
In a possible implementation manner, in this embodiment, a training data set is obtained by obtaining a sample data text, labeling the sample data text according to a target information entity, and performing model training according to the training data set to obtain at least one information entity extraction model.
The target information entity is obtained by combining information entities with the same type of attributes, namely the coarse-grained attributes, such as the contact information.
Since the content format of most original data texts is relatively fixed, and the writing format of the prefix string of the attribute to be extracted is also relatively fixed, for example, the name of the bidding unit is generally preceded by "bidding unit: "and the like. If the prefix keywords and the information entity can be used as the information entity, the complexity of the model and the prediction error can be effectively reduced, and therefore, the prefix keywords of the target information entity and the prefix keywords of the target information entity can be respectively marked to obtain the training data set when the sample data text is marked.
Illustratively, related prefix keywords are added to labels of training data of the entity extraction model of attribute information such as item names, item numbers, unit names, addresses, contacts, contact ways, purchase amounts, purchase ways, product names, quantities, amounts and the like, and the labels are labeled together to obtain a training data set corresponding to the model.
It can be understood that, in this embodiment, the number of the information entity extraction models trained by the above method is consistent with the number of the coarse-grained attributes, and each information entity extraction model is used for performing information entity extraction on one coarse-grained attribute.
In this embodiment, according to different functions, the information entity extraction model may be divided into a feature coding module, a sequence labeling prediction module, and an information entity extraction module, where the feature coding module is configured to perform feature coding on a text block to obtain a two-dimensional dictionary list corresponding to each text block, the sequence labeling prediction module is configured to perform sequence labeling prediction on the two-dimensional dictionary list to obtain a target labeling sequence (optimal labeling sequence) of each text block, and the information entity extraction module is configured to perform character extraction from the two-dimensional dictionary list according to the target labeling sequence to obtain an information entity included in each text block. The following will respectively explain the implementation principle of each module:
(1) feature encoding module
Since the CRF algorithm uses context information in sequence labeling prediction, it is necessary to perform feature coding on each text block, convert the text block into a two-dimensional dictionary list, and determine a previous character and a next character in advance for each character in the text block.
In one possible implementation, in this embodiment, for each text block, it is assumed that the characters in the text block are si(i takes 1,2, this block, n, where n is the length of the input text block) and the feature coding can be performed by the following rules:
(a) if the current character is the middle character of a text block, i.e. 1< i < n, the character encoding can be encoded by the following rule:
current character → [ current character, previous character + current character, current character + next character]I.e. si,[si,si-1si,sisi+1];
(b) If the current character is the first character of the text, i.e. i ═ 1, and its previous character is denoted < start >, the character can be encoded by the following rule:
the current character → [ the current character,<start>current character + next character]I.e. s1,[s1,<start>,s1s2];
(c) If the current character is the last character of the text, i ═ n, and the next character is denoted < end >, the character can be encoded by the following rule:
current character → [ current character, previous character + current character,<end>]i.e. sn,[sn,sn-1sn,<end>]。
Illustratively, as shown in Table 1, in the text block "item Purchase Unit: people's hospital is an example, after feature coding, it can be converted into two-dimensional dictionary list of [ [ item, < start >, item ], [ item harvest ], [ harvest, [ item harvest, purchase ], [ … …, [ hospital, < end > ] ] ".
TABLE 1
Text block Item purchasing unit: people's hospital
Feature coding [ [ item(s) ],<start>item of][ eyes, project, eyes Collection]… …, [ hospital,<end>]]
it should be noted that, in this embodiment, punctuation marks in the text block, such as colon (:), also participate in feature coding.
After feature encoding, each text block is converted into a corresponding two-dimensional dictionary list, assuming that each character s isiThe word sequence obtained after coding uses xiIf it is indicated, the text block S ═ S1,s2,…,sn]The two-dimensional dictionary list obtained after feature coding can be recorded as X ═ X1,x2,…,xn]。
(2) Sequence annotation prediction module
Illustratively, in this embodiment, BIEO (full name: begin,intermediate, end, other) labeling method to label each character in the two-dimensional dictionary list. Where B denotes the header of the information entity, I denotes the inside of the information entity, E denotes the trailer of the information entity, and O denotes any character that is not an information entity. The two-dimensional dictionary list thus labeled is converted to a labeled sequence of B, I, E, O four letters, e.g., [ O, O, B, I, I, …, E]For the sake of convenience of distinction, in the present embodiment, such a labeling sequence is denoted by Y ═ Y1,y2,…,yn]。
As shown in table 2, similarly, the text block "item purchase unit name: for example, assuming that the current model is a model for extracting purchasing units, the labeled sequence obtained after labeling the text block may be represented as "[ O, B, I, …, E ]".
TABLE 2
Text block Item purchasing unit: people's hospital
Feature coding [ [ item(s) ],<start>item of][ eyes, project, eyes Collection]… …, [ hospital,<end>]]
sequence of labels [O,O,B,I,I,…,E]
Since the context information is used as a feature, there is a general problem for a text block with a length of n
Figure BDA0002990819160000121
In a possible implementation manner, in this embodiment, a conditional probability that each word sequence in the two-dimensional dictionary list is labeled as a candidate label (i.e., one of the possible labels, such as any one of B, I, E, O) is calculated by using a CRF algorithm, and an optimal label is found by using a Viterbi algorithm according to the conditional probability to obtain an optimal label sequence, i.e., a target label sequence, corresponding to the two-dimensional dictionary list, so as to effectively reduce the calculation complexity. The implementation principle of the CRF algorithm and the Viterbi algorithm will be described below:
CRF algorithm
For each word sequence x in a two-dimensional dictionaryiThe CRF algorithm calculates x through two characteristic functionsiIs denoted by yi, respectively the transfer function tk1(yi-1,yiI) and a state function sk2(yiX, i), for example, fig. 2 is a schematic diagram illustrating a principle of calculating a conditional probability by using a CRF algorithm according to an embodiment of the present application, as shown in fig. 2, a transfer function tk1(yi-1,yiI) representing the last word sequence x in dependence on the current position and the previous positioni-1Corresponding notation yi-1Transfer to current word sequence xiCorresponding notation yiThe probability of (1), i.e. the transition probability; function of state sk2(yiX, i) depending only on the current position, representing the sequence of words XiIs marked as yiI.e. the state probability.
The CRF calculation conditional probability parameterization is in the form:
Figure BDA0002990819160000131
where P (y | x) represents the conditional probability of x labeled as y, i is the number of the word sequence (i is 1,2, …, n, n is the length of the two-dimensional dictionary list), K is the number of the feature function (K is 1,2, …, K is the number of feature functions), f is the number of the feature function, andk(yi-1,yix, i) is specialThe characteristic function being a transfer function tk1(yi-1,yiI) and a state function sk2(yiUniform notation of X, i) (. omega.)kAs weights for the characteristic functions, are transfer functions tk1(yi-1,yiI) weight and state function sk2(yiX, i) is uniformly signed, z (X) is a normalization factor, which can be formulated as:
Figure BDA0002990819160000132
the conditional probability that the current word sequence in the two-dimensional dictionary list is marked as a candidate label can be calculated through the formulas (1) and (2).
Viterbi algorithm
In this embodiment, the Viterbi algorithm is used to solve the optimal value of the conditional probability of each word sequence to obtain the optimal label of each word sequence, and then find out the optimal label sequence Y*=[y1 *,y2 *,y3 *,…,yn *]. The Viterbi algorithm is based on the assumption that: the sub-paths of the optimal path must also be optimal. The algorithm idea is that starting from a root node, every step is taken, the shortest path from the root node to an upper node plus the shortest distance from the upper node to a current node are compared, the shortest path to the point is calculated recursively, and the shortest path is taken to a terminal.
Note deltai(l) For the ith word sequence x in the two-dimensional dictionary listiThe maximum of the conditional probability, denoted as l (possibly taking the values 1,2, …, m). According to Viterbi algorithm, in the i +1 th word sequence xi+1Maximum value δ of conditional probability labeled li+1(l) Expressed as:
Figure BDA0002990819160000141
remember phii+1(l) To make deltai+1(l) The marking value of the ith character reaching the maximum value is phii+1(l) Expressed as:
Figure BDA0002990819160000142
therefore, the prediction principle of the labeled sequence based on the CRF algorithm and the Viterbi algorithm is as follows: starting from a first word sequence in a two-dimensional dictionary list, for the first word sequence, calculating the conditional probability of each candidate label marked in the current word sequence according to a formula (1) by using a CRF algorithm, substituting the calculated conditional probability into a formula (4) to calculate the optimal label of the current word sequence, for the following word sequences, obtaining the optimal label sequence according to the CRF algorithm and a Viterbi algorithm on the basis of the optimal label of the previous word sequence, and finally combining the optimal sequences of the word sequences to obtain a target label sequence.
Illustratively, in text block "item procurement units: as shown in fig. 3, in the first iteration, assuming that conditional probabilities of word sequences of the first character "term" are calculated by the CRF algorithm to obtain conditional probabilities of 0.75, 0.1, and 0.05, respectively, which are labeled as O (l ═ 1), B (l ═ 2), I (l ═ 3), and E (l ═ 4), at the time of 1 st iteration, the optimal label of "term" may be determined to be "O", and in the second iteration, as known from the Viterbi algorithm, only 4 conditional probabilities of O — > O, O — > B, O — > I, O — > E need to be calculated, so as to determine the second character "purpose", which is sequentially determined until the optimal label of the last character "hospital" is completed, and the optimal selected characters are combined, and obtaining a target labeling sequence corresponding to the text block.
Exemplarily, X for an input1,x2,…,xnThe prediction process of the labeled sequence based on the CRF algorithm and the Viterbi algorithm in the model is as follows:
1) initialization:
Figure BDA0002990819160000151
f1(l)=start,l=1,2,L,m (6)
2) for i ═ 1,2, …, n-1, the calculations were recurrently calculated in order by equations (3) and (4);
3) i is n, the calculation is terminated to obtain the optimal
Figure BDA0002990819160000152
Figure BDA0002990819160000153
4) Backtracking and calculating in sequence to obtain the optimal
Figure BDA0002990819160000154
i=n-1,n-2,…,1:
Figure BDA0002990819160000155
5) Obtaining a target labeling sequence
Figure BDA0002990819160000156
It is understood that, in this embodiment, during the model training phase, the CRF feature functions required by each model may be defined first, and then all the feature functions f of each model are determined by training the data (i.e., the training data set) of the known labeling sequencek(yi-1,yiX, i) and their weights ωkAnd (4) finishing.
(3) Information entity extraction module
The information entity extraction module is mainly used for extracting the character labeled B, I, E from the two-dimensional dictionary list according to the target labeling sequence, and exemplarily, assuming that the text block S is ═ S1,s2,s3,……,sn]Corresponding target labelingSequence is Y*=[y1 *,y2 *,y3 *,…,yn *]Then, the target information entity T can be extracted through the following workflow:
sequentially judging each word sequence s in the two-dimensional dictionary listiAnd label y thereofi *
For 1to n (n is the length of two-dimensional dictionary list)
If notation yi *Is B, and label y of the next characteri-1 *Is I, then
Will label yi *Corresponding character siWrite T, i.e. T ← si
Else if notation yi *Is B, and label y of the next characteri-1 *Is E, then
Will label yi *Corresponding character siWrite T, i.e. T ← si
Else if notation yi *Is I, and the label y of the previous characteri+1 *Is B, then
Will label yi *Corresponding character siWrite T, i.e. T ← si
Else if notation yi *Is E, and the label y of the previous characteri+1 *Is B or I, then
Will label yi *Corresponding character siWrite T, i.e. T ← si
Else
Notation yi *Corresponding character siNon-write T
}
It can be understood that, since the functions and principles of the feature extraction module, the sequence labeling prediction module, and the information entity extraction module are different, in the training process of the information entity extraction model in the embodiment of the present application, the feature extraction module, the sequence labeling prediction module, and the information entity extraction module in the information entity extraction model can be trained respectively.
And S104, performing category inference according to a preset rule, and determining the category of at least one information entity.
Since the information entity extraction model in S103 performs information entity extraction based on the coarse-grained attribute, category inference needs to be performed to obtain the fine-grained attribute, thereby ensuring accuracy of information entity extraction. For example, in the information entity extraction result of the bidding data text, attributes such as unit name, address, contact address, etc. further category estimation is required in this step to determine whether the unit name, address, contact address, etc. of the purchasing unit or the agency. The specific rule for performing the class inference may be set according to the actual situation, and is not limited herein.
Taking the contact as an example, assume that the text block set of the bidding data text (original data text) is S0=[S1,S2,S3,…,Sn]And information entity set T of contact information thereof0=[T1,T2,T3,…,Tn]The category of the information entities in the information entity set (purchasing unit or agency) can be determined by the following procedure:
sequentially judging each text block SiAnd the extracted information entity TiH is an information entity TiIn the text block SiThe first appearing position in the database, the information entity set of the purchasing unit is C ═ C1,c2,c3,……,cn]The agency information entity set is D ═ D1,d2,d3,……,dn]。
Fori ═ 1to n (n is the number of text blocks split by the bidding information)
If information entity TiContaining keywords relating to agencies, such as "Agents", the
Information entity TiBelonging to agencies, i.e. ci=“”,di=Ti
Else if information entity TiContaining a keyword relating to a purchasing unit, such as "purchasing unit", then
Information entity TiBelonging to purchasing units, i.e. ci=Ti,di=“”;
Else if information entity TiCorresponding text block SiFront part (S) ofiThe first h-1 characters) contains a keyword associated with an agent, such as "agent", then
Information entity TiBelonging to agencies, i.e. ci=“”,di=Ti
Else if information entity TiCorresponding text block SiFront part (S) ofiThe first h-1 characters) contains a keyword related to a purchase unit, such as "purchase unit", then
Information entity TiBelonging to purchasing units, i.e. ci=Ti,di=“”;
Else if information entity TiCorresponding text block SiLast text block S ofi-1Containing keywords relating to agencies, such as "Agents", the
Information entity TiBelonging to agencies, i.e. ci=“”,di=Ti
Else if information entity TiCorresponding text block SiLast text block S ofi-1Containing a keyword relating to a purchasing unit, such as "purchasing unit", then
Information entity TiBelonging to purchasing units, i.e. ci=Ti,di=“”;
Else if information entity TiCorresponding text block SiLast two text blocks S ofi-2Containing keywords relating to agencies, such as "Agents", the
Information entity TiBelonging to agencies, i.e. ci=“”,di=Ti
Else if information entity TiCorresponding text block SiLast two text blocks S ofi-2Containing a keyword relating to a purchasing unit, such as "purchasing unit", then
Information entity TiBelonging to purchasing units, i.e. ci=Ti,di=“”;
Else
ci=“”,di=“”
}
In addition, after information entity type inference, different information entities may appear which extract certain attributes from a plurality of text blocks of the same original data text, for example, the contact information of the purchasing unit appears at the head and the tail of the bidding information. For this situation, in this embodiment, one of the information entities may be selected according to a preset rule, and for example, the information entity appearing first in the original data text is selected as the information entity of the attribute, so as to remove the duplicated information.
In addition, because the original data texts from different sources have different writing formats, the extracted information entity formats also need to be unified. Therefore, in this embodiment, the information entity is also normalized according to the related rules, for example, the amount is uniformly converted into the meta value and the value of the purchasing mode is standardized, so that the output result is more normalized, and the method is convenient to use when performing decision analysis.
In the embodiment, the original data text is obtained, the original data text is sequentially partitioned to obtain at least one text block, the at least one text block is processed according to a pre-constructed information entity extraction model to obtain at least one information entity contained in the original data text, category inference is carried out according to a preset rule to determine the category of the at least one information entity, and automatic extraction of the information entity is realized.
Example two
Fig. 4 is a schematic structural diagram of an information entity extraction apparatus according to a second embodiment of the present application, and as shown in fig. 4, an information entity extraction apparatus 10 in the present embodiment includes:
an acquisition module 11 and a processing module 12.
The acquisition module 11 is used for acquiring an original data text;
the processing module 12 is configured to sequentially block the original data text to obtain at least one text block; processing the at least one text block according to a pre-constructed information entity extraction model to obtain at least one information entity contained in the original data text; and performing category inference according to a preset rule, and determining the category of the at least one information entity.
Optionally, the processing module 12 is specifically configured to:
segmenting and/or sentence-dividing the original data text to obtain at least one short text;
and sequencing and carrying out semantic continuity processing on the at least one short text to obtain the at least one text block.
Optionally, the processing module 12 is specifically configured to:
segmenting the original data text according to the paragraph item symbol to obtain at least one paragraph text;
and splitting the paragraph text with the character length larger than a set threshold value according to the sentence tail identifier to obtain the at least one short text.
Optionally, the processing module 12 is specifically configured to:
sequencing the at least one short text according to the sequence of the at least one short text in the original data text;
and determining whether the tail key word of the target short text is a part of the target information entity, if so, merging the target short text and the next short text to obtain the at least one text block.
Optionally, the processing module 12 is further configured to:
special identifiers are added at the beginning and the end of each text block.
Optionally, the processing module 12 is specifically configured to:
performing feature coding on the at least one text block to obtain a two-dimensional dictionary list of each text block;
performing sequence labeling prediction on the two-dimensional dictionary list according to a preset algorithm to obtain a target labeling sequence of each text block;
and according to the target labeling sequence, extracting characters from the two-dimensional dictionary list to obtain the information entity contained in each text block.
Optionally, the processing module 12 is specifically configured to:
calculating the conditional probability of each word sequence marked as a candidate label in the two-dimensional dictionary list according to a Conditional Random Field (CRF) algorithm;
and searching the optimal label from the candidate labels through a Viterbi algorithm according to the conditional probability to obtain the target label sequence.
Optionally, the obtaining module 11 is further configured to:
acquiring a sample data text;
the processing module 12 is further configured to:
marking the sample data text according to a target information entity to obtain a training data set, wherein the target information entity is obtained by combining information entities with the same type of attributes; and carrying out model training according to the training data set to obtain at least one information entity extraction model.
Optionally, the processing module 12 is specifically configured to:
and respectively marking prefix keywords of the target information entity and the target information entity in the sample data text to obtain the training data set.
The information entity extraction device provided by the embodiment can execute the information entity extraction method provided by the method embodiment, and has the corresponding functional modules and beneficial effects of the execution method. The implementation principle and technical effect of this embodiment are similar to those of the above method embodiments, and are not described in detail here.
EXAMPLE III
Fig. 5 is a schematic structural diagram of an electronic device according to a third embodiment of the present application, and as shown in fig. 5, the electronic device 20 includes a memory 21, a processor 22, and a computer program stored in the memory and executable on the processor; the number of the processors 22 of the electronic device 20 may be one or more, and one processor 22 is taken as an example in fig. 5; the processor 22 and the memory 21 in the electronic device 20 may be connected by a bus or other means, and fig. 5 illustrates the connection by the bus as an example.
The memory 21 is a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the acquisition module 11 and the processing module 12 in the embodiment of the present application. The processor 22 executes various functional applications of the device/terminal/server and data processing by running software programs, instructions and modules stored in the memory 21, that is, implements the above-described information entity extraction method.
The memory 21 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 21 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 21 may further include memory located remotely from the processor 22, which may be connected to the device/terminal/server through a grid. Examples of such a mesh include, but are not limited to, the internet, an intranet, a local area network, a mobile communications network, and combinations thereof.
Example four
A fourth embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, the computer program, when executed by a computer processor, is configured to perform a method for information entity extraction, the method including:
acquiring an original data text;
sequentially blocking the original data text to obtain at least one text block;
processing the at least one text block according to a pre-constructed information entity extraction model to obtain at least one information entity contained in the original data text;
and performing category inference according to a preset rule, and determining the category of the at least one information entity.
Of course, the computer program of the computer-readable storage medium provided in this embodiment of the present application is not limited to the method operations described above, and may also perform related operations in the information entity extraction method provided in any embodiment of the present application.
From the above description of the embodiments, it is obvious for those skilled in the art that the present application can be implemented by software and necessary general hardware, and certainly can be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a grid device) to execute the methods described in the embodiments of the present application.
It should be noted that, in the embodiment of the information entity extraction apparatus, each included unit and module are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the application.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present application and the technical principles employed. It will be understood by those skilled in the art that the present application is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the application. Therefore, although the present application has been described in more detail with reference to the above embodiments, the present application is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present application, and the scope of the present application is determined by the scope of the appended claims.

Claims (12)

1. An information entity extraction method, comprising:
acquiring an original data text;
sequentially blocking the original data text to obtain at least one text block;
processing the at least one text block according to a pre-constructed information entity extraction model to obtain at least one information entity contained in the original data text;
and performing category inference according to a preset rule, and determining the category of the at least one information entity.
2. The method of claim 1, wherein partitioning the original data text into at least one text block that is ordered and semantically continuous comprises:
segmenting and/or sentence-dividing the original data text to obtain at least one short text;
and sequencing and carrying out semantic continuity processing on the at least one short text to obtain the at least one text block.
3. The method of claim 2, wherein sequentially blocking the original data text to obtain at least one text block comprises:
segmenting the original data text according to the paragraph item symbol to obtain at least one paragraph text;
and splitting the paragraph text with the character length larger than a set threshold value according to the sentence tail identifier to obtain the at least one short text.
4. The method of claim 2, wherein the sorting and semantic continuity processing of the at least one short text to obtain the at least one text block comprises:
sequencing the at least one short text according to the sequence of the at least one short text in the original data text;
and determining whether the tail key word of the target short text is a part of the target information entity, if so, merging the target short text and the next short text to obtain the at least one text block.
5. The method of claim 2, further comprising:
special identifiers are added at the beginning and the end of each text block.
6. The method of claim 1, wherein the processing the at least one text block according to a pre-constructed information entity extraction model to obtain at least one information entity contained in the original data text comprises:
performing feature coding on the at least one text block to obtain a two-dimensional dictionary list of each text block;
performing sequence labeling prediction on the two-dimensional dictionary list according to a preset algorithm to obtain a target labeling sequence of each text block;
and according to the target labeling sequence, extracting characters from the two-dimensional dictionary list to obtain the information entity contained in each text block.
7. The method of claim 6, wherein the performing sequence labeling prediction on the two-dimensional dictionary list according to a preset algorithm to obtain a target labeling sequence of each text block comprises:
calculating the conditional probability of each word sequence marked as a candidate label in the two-dimensional dictionary list according to a Conditional Random Field (CRF) algorithm;
and searching the optimal label from the candidate labels through a Viterbi algorithm according to the conditional probability to obtain the target label sequence.
8. The method according to any of claims 1-7, wherein before processing the at least one text block according to a pre-constructed information entity extraction model to obtain the at least one information entity contained in the original data text, the method further comprises:
acquiring a sample data text;
marking the sample data text according to a target information entity to obtain a training data set, wherein the target information entity is obtained by combining information entities with the same type of attributes;
and carrying out model training according to the training data set to obtain at least one information entity extraction model.
9. The method of claim 8, wherein said tagging the sample data text according to a target information entity to obtain a training data set comprises:
and respectively marking prefix keywords of the target information entity and the target information entity in the sample data text to obtain the training data set.
10. An information entity extraction apparatus, comprising:
the acquisition module is used for acquiring an original data text;
the processing module is used for sequentially blocking the original data text to obtain at least one text block; processing the at least one text block according to a pre-constructed information entity extraction model to obtain at least one information entity contained in the original data text; and performing category inference according to a preset rule, and determining the category of the at least one information entity.
11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the information entity extraction method according to any one of claims 1to 9 when executing the program.
12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the information entity extraction method according to any one of claims 1to 9.
CN202110313303.9A 2021-03-24 2021-03-24 Information entity extraction method and device, electronic equipment and storage medium Pending CN113033204A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110313303.9A CN113033204A (en) 2021-03-24 2021-03-24 Information entity extraction method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110313303.9A CN113033204A (en) 2021-03-24 2021-03-24 Information entity extraction method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113033204A true CN113033204A (en) 2021-06-25

Family

ID=76473685

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110313303.9A Pending CN113033204A (en) 2021-03-24 2021-03-24 Information entity extraction method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113033204A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116663549A (en) * 2023-05-18 2023-08-29 海南科技职业大学 Digitized management method, system and storage medium based on enterprise files
CN117034942A (en) * 2023-10-07 2023-11-10 之江实验室 Named entity recognition method, device, equipment and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008472A (en) * 2019-03-29 2019-07-12 北京明略软件系统有限公司 A kind of method, apparatus, equipment and computer readable storage medium that entity extracts
CN110276054A (en) * 2019-05-16 2019-09-24 湖南大学 A kind of insurance text structure implementation method
CN111444717A (en) * 2018-12-28 2020-07-24 天津幸福生命科技有限公司 Method and device for extracting medical entity information, storage medium and electronic equipment
CN112257421A (en) * 2020-12-21 2021-01-22 完美世界(北京)软件科技发展有限公司 Nested entity data identification method and device and electronic equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444717A (en) * 2018-12-28 2020-07-24 天津幸福生命科技有限公司 Method and device for extracting medical entity information, storage medium and electronic equipment
CN110008472A (en) * 2019-03-29 2019-07-12 北京明略软件系统有限公司 A kind of method, apparatus, equipment and computer readable storage medium that entity extracts
CN110276054A (en) * 2019-05-16 2019-09-24 湖南大学 A kind of insurance text structure implementation method
CN112257421A (en) * 2020-12-21 2021-01-22 完美世界(北京)软件科技发展有限公司 Nested entity data identification method and device and electronic equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116663549A (en) * 2023-05-18 2023-08-29 海南科技职业大学 Digitized management method, system and storage medium based on enterprise files
CN116663549B (en) * 2023-05-18 2024-03-19 海南科技职业大学 Digitized management method, system and storage medium based on enterprise files
CN117034942A (en) * 2023-10-07 2023-11-10 之江实验室 Named entity recognition method, device, equipment and readable storage medium
CN117034942B (en) * 2023-10-07 2024-01-09 之江实验室 Named entity recognition method, device, equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN110516247B (en) Named entity recognition method based on neural network and computer storage medium
CN108255813B (en) Text matching method based on word frequency-inverse document and CRF
CN111709243A (en) Knowledge extraction method and device based on deep learning
CN111444320A (en) Text retrieval method and device, computer equipment and storage medium
CN111460170B (en) Word recognition method, device, terminal equipment and storage medium
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
US20230076658A1 (en) Method, apparatus, computer device and storage medium for decoding speech data
CN111046660B (en) Method and device for identifying text professional terms
CN112784009B (en) Method and device for mining subject term, electronic equipment and storage medium
CN111143571B (en) Entity labeling model training method, entity labeling method and device
CN113033204A (en) Information entity extraction method and device, electronic equipment and storage medium
CN113033183A (en) Network new word discovery method and system based on statistics and similarity
CN112183102A (en) Named entity identification method based on attention mechanism and graph attention network
CN110751234A (en) OCR recognition error correction method, device and equipment
CN110674301A (en) Emotional tendency prediction method, device and system and storage medium
CN110705261B (en) Chinese text word segmentation method and system thereof
CN113157918B (en) Commodity name short text classification method and system based on attention mechanism
CN112560425A (en) Template generation method and device, electronic equipment and storage medium
CN111680146A (en) Method and device for determining new words, electronic equipment and readable storage medium
CN116090450A (en) Text processing method and computing device
CN114840642A (en) Event extraction method, device, equipment and storage medium
CN115130475A (en) Extensible universal end-to-end named entity identification method
CN113971403A (en) Entity identification method and system considering text semantic information
CN113240485A (en) Training method of text generation model, and text generation method and device
CN113641800B (en) Text duplicate checking method, device and equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination