CN113033204A

CN113033204A - Information entity extraction method and device, electronic equipment and storage medium

Info

Publication number: CN113033204A
Application number: CN202110313303.9A
Authority: CN
Inventors: 黄进然; 林璟
Original assignee: Guangzhou Wondfo Biotech Co Ltd
Current assignee: Guangzhou Wondfo Biotech Co Ltd
Priority date: 2021-03-24
Filing date: 2021-03-24
Publication date: 2021-06-25

Abstract

The embodiment of the application provides an information entity extraction method, an information entity extraction device, electronic equipment and a storage medium, wherein an original data text is obtained, the original data text is sequentially partitioned to obtain at least one text block, the at least one text block is processed according to a pre-constructed information entity extraction model to obtain at least one information entity contained in the original data text, category inference is carried out according to a preset rule to determine the category of the at least one information entity, and automatic extraction of the information entity is realized.

Description

Information entity extraction method and device, electronic equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of data analysis, in particular to an information entity extraction method and device, electronic equipment and a storage medium.

Background

As an important branch of the field of natural language processing, the main function of information extraction is to extract specific fact information from a natural language text, so as to help people quickly find information really needed by people in a large amount of information in an automatic manner, and to meet the challenge brought by information explosion. The information entity extraction is the most practical technique in the information extraction, and the main task of the information entity extraction is to identify and classify proper names and meaningful quantity phrases appearing in texts.

At present, the mainstream method in the industry for information entity extraction is a sequence labeling method, that is, each word in a text may have a plurality of candidate category labels, each label corresponds to the position of the word in each type of information entity, and by performing serialized automatic labeling (i.e., classification) on each word in the text, and then integrating the automatically labeled labels, an information entity composed of a plurality of words and the category thereof are finally obtained.

However, for longer texts, the possible sequence patterns become many, so that the sequence standard method has the problems of poor recognition effect and low recognition efficiency.

Disclosure of Invention

The embodiment of the application provides an information entity extraction method, an information entity extraction device, electronic equipment and a storage medium, and aims to solve the problems of low identification efficiency and accuracy in the prior art.

In a first aspect, an embodiment of the present application provides an information entity extraction method, including:

acquiring an original data text;

sequentially blocking the original data text to obtain at least one text block;

processing the at least one text block according to a pre-constructed information entity extraction model to obtain at least one information entity contained in the original data text;

and performing category inference according to a preset rule, and determining the category of the at least one information entity.

Optionally, the partitioning the original data text to obtain at least one text block that is ordered and semantically continuous includes:

segmenting and/or sentence-dividing the original data text to obtain at least one short text;

and sequencing and carrying out semantic continuity processing on the at least one short text to obtain the at least one text block.

Optionally, the sequentially blocking the original data text to obtain at least one text block includes:

segmenting the original data text according to the paragraph item symbol to obtain at least one paragraph text;

and splitting the paragraph text with the character length larger than a set threshold value according to the sentence tail identifier to obtain the at least one short text.

Optionally, the sorting and semantic continuity processing the at least one short text to obtain the at least one text block includes:

sequencing the at least one short text according to the sequence of the at least one short text in the original data text;

and determining whether the tail key word of the target short text is a part of the target information entity, if so, merging the target short text and the next short text to obtain the at least one text block.

Optionally, the method further comprises:

special identifiers are added at the beginning and the end of each text block.

Optionally, the processing the at least one text block according to a pre-constructed information entity extraction model to obtain at least one information entity included in the original data text includes:

performing feature coding on the at least one text block to obtain a two-dimensional dictionary list of each text block;

performing sequence labeling prediction on the two-dimensional dictionary list according to a preset algorithm to obtain a target labeling sequence of each text block;

and according to the target labeling sequence, extracting characters from the two-dimensional dictionary list to obtain the information entity contained in each text block.

Optionally, the performing sequence labeling prediction on the two-dimensional dictionary list according to a preset algorithm to obtain a target labeling sequence of each text block includes:

calculating the conditional probability of each word sequence marked as a candidate label in the two-dimensional dictionary list according to a Conditional Random Field (CRF) algorithm;

and searching the optimal label from the candidate labels through a Viterbi algorithm according to the conditional probability to obtain the target label sequence.

Optionally, before the processing the at least one text block according to the pre-constructed information entity extraction model to obtain the at least one information entity included in the original data text, the method further includes:

acquiring a sample data text;

marking the sample data text according to a target information entity to obtain a training data set, wherein the target information entity is obtained by combining information entities with the same type of attributes;

and carrying out model training according to the training data set to obtain at least one information entity extraction model.

Optionally, the marking the sample data text according to the target information entity to obtain a training data set, including:

and respectively marking prefix keywords of the target information entity and the target information entity in the sample data text to obtain the training data set.

In a second aspect, an embodiment of the present application provides an information entity extraction apparatus, including:

the acquisition module is used for acquiring an original data text;

the processing module is used for sequentially blocking the original data text to obtain at least one text block; processing the at least one text block according to a pre-constructed information entity extraction model to obtain at least one information entity contained in the original data text; and performing category inference according to a preset rule, and determining the category of the at least one information entity.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the information entity extraction method according to the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the information entity extraction method according to the first aspect.

According to the information entity extraction method, the information entity extraction device, the electronic equipment and the storage medium, the original data text is obtained, the original data text is sequentially partitioned to obtain at least one text block, the at least one text block is processed according to a pre-constructed information entity extraction model to obtain at least one information entity contained in the original data text, category inference is carried out according to a preset rule, the category of the at least one information entity is determined, automatic extraction of the information entity is achieved, and due to the fact that the strategies of sequential partitioning and merging-predicting-restoring of the original data text are adopted in the information entity extraction process, complexity and workload of the model are reduced, and meanwhile extraction efficiency and extraction accuracy of the information entity are improved.

Drawings

Fig. 1 is a schematic flowchart of an information entity extraction method according to an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating a principle of calculating conditional probability by a CRF algorithm according to an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating the principle of labeled sequence prediction based on the CRF algorithm and the Viterbi algorithm according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an information entity extraction apparatus according to a second embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to a third embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be further noted that, for the convenience of description, only some of the structures related to the present application are shown in the drawings, not all of the structures.

The main ideas of the technical scheme are as follows: based on the technical problems in the prior art, the embodiment of the present application provides a technical scheme for information entity extraction, which converts the problem of information entity extraction into the problem of sequence labeling prediction, and includes firstly, sequentially blocking an original information text, then finding out an optimal sequence label for each text block by using an information entity extraction model pre-constructed by a Conditional Random Field (CRF) algorithm and a Viterbi (chinese: Viterbi) algorithm, finally, determining a correct information entity through regular post-processing, splitting a long text into text blocks through the sequential text blocking, and further performing information entity extraction on the text blocks by using the model, thereby effectively reducing the complexity of the model and also improving the prediction accuracy and the operation efficiency of the model. In addition, in the selection of the model strategy, the embodiment of the application adopts a method of merging-predicting-restoring, namely, a plurality of similar attributes are merged into a coarse-grained attribute, all information entities of the coarse-grained attribute are extracted through a prediction model, then category inference is carried out on the information entities, the coarse-grained attribute is split and restored into fine-grained attributes, and a method of extracting a prefix and an attribute information entity firstly and then removing the prefix through a rule is adopted, so that the prediction precision of the model is improved, and the complexity and the workload of the model are reduced.

The technical solutions of the present application will be described below by taking an example of extracting relevant attribute information from bidding data, and it should be understood that the technical solutions of the embodiments of the present application can also be used in other scenarios.

Example one

Fig. 1 is a schematic flow chart of an information entity extraction method provided in an embodiment of the present application, where the method of the present embodiment may be executed by an information entity extraction device provided in the embodiment of the present application, and the device may be implemented in a software and/or hardware manner and may be integrated in an electronic device such as a server and an intelligent terminal. As shown in fig. 1, the information entity extraction method of this embodiment includes:

s101, acquiring an original data text.

In this embodiment, the original data text refers to a text to be subjected to information entity extraction, and is a data basis for performing information entity extraction. In order to achieve the acquisition of the original data, in this embodiment, a data acquisition probe may be set on a related platform or a website in advance, and the original data text is obtained by collecting and sorting data returned by the data acquisition probe.

Optionally, on the basis of collection and sorting, in this embodiment, some simple pre-processing may be performed on the data returned by the data acquisition probe, including removing special symbols such as spaces, tab symbols, line feed symbols, etc., converting english symbols to chinese symbols, converting full-angle symbols to half-angle symbols, etc., so that the obtained original data text can meet the subsequent analysis and use requirements.

S102, sequentially blocking the original data text to obtain at least one text block.

Because the CRF algorithm has poor training and prediction effects on long texts, in this embodiment, the obtained original data texts are sequentially partitioned, that is, the original data texts are sequentially and gradually partitioned into some shorter text blocks without affecting semantic continuity.

Optionally, in this step, the original data text may be segmented and/or sentence-divided according to the paragraph identifier or the sentence end identifier to obtain at least one short text, and then the at least one split short text is subjected to sequencing and semantic continuity processing to obtain at least one text block.

In one possible implementation, the ordered partitioning of the original data text may be implemented by the following specific steps:

(1) the bidding information content is divided into a plurality of texts by paragraph item symbols (such as 'one', '1', '1.1', '1'), and the like), and the texts obtained by dividing in the step are called paragraph texts for convenience of distinguishing;

(2) for each paragraph text, sequentially judging whether the character length is larger than a set threshold value, such as 100 characters, for the paragraph text with the character length larger than the set threshold value, taking a sentence end identifier (such as a period number) in the paragraph text as a segmentation symbol, and further splitting the paragraph text; for paragraph texts with the character length less than or equal to the set threshold, no further splitting is carried out, and for the convenience of distinguishing, the texts obtained by splitting in the step are called short texts;

(3) firstly, the short texts obtained in the step (2) are split and sequenced according to the sequence of the short texts appearing in the original data text, then semantic judgment is carried out to determine whether the tail key words of the target short texts are part of the target information entity, if yes, the short texts are merged with the next short texts to obtain ordered short texts without influencing semantic continuity, and the short texts obtained in the step are called text blocks for convenience of distinguishing.

The target short text refers to a short text currently subjected to the ambiguity judgment, and can be any short text. The target information entity refers to an information entity to be extracted, and can be determined in advance according to requirements, such as item names, item numbers, unit names, addresses, contacts, contact ways, purchase amounts, purchase ways, product names, quantities, amounts and the like.

Illustratively, if the last keyword of a short text is "purchase budget: "and the next short text is" 30 ten thousand dollars ", then the two short texts need to be merged into" the budget for purchase: 30 ten thousand yuan, thus ensuring the semantic continuity.

In order to facilitate subsequent processing, in this step, a sequence number may be further marked on the obtained at least one text block according to the sequence of the arrangement.

Optionally, since the CRF algorithm has a poor prediction effect on the condition that the information entity appears at the head or the tail of the text, in order to improve the prediction accuracy of the CRF algorithm, in this embodiment, a special text identifier may be further added to the beginning and the tail of each text block, so that the information entity appears in the middle of the text block (instead of at the head or the tail). For example, the conditional random field @' is added before the beginning of the sentence, and the conditional random field @ is added after the end of the sentence.

S103, processing at least one text block according to a pre-constructed information entity extraction model to obtain at least one information entity contained in the original data text.

In this embodiment, in order to extract required data from an original data text, various information entity extraction models are constructed in advance according to a CRF algorithm and a Viterbi algorithm.

Since the original data text usually includes many attribute information, such as bidding information text, it usually includes item number, item name, bidding unit address, bidding unit contact, agency name, agency address, agency contact, winning unit name, winning unit address, winning unit contact, purchasing amount, purchasing method, purchasing name, specification, quantity, unit price, price quote, brand, manufacturer, etc. If the information entities of all attributes are extracted simultaneously through one model, the complexity of the model is increased, and the prediction precision is also reduced; if a specific model is established for each attribute individually, there are N models with N attributes, and each model extracts the attribute with the finest granularity, which increases the training workload of the model and may not have high prediction accuracy. For example, the attribute "contact address of a bidding unit" generally appears only in one place of the bidding information, but the "contact address" may appear in a plurality of places of the bidding information, which may cause much interference to the extraction of the attribute "contact address of a bidding unit".

In order to solve the problem, in this embodiment, when the information entity extraction model is constructed, similar attributes are merged, for example, the attribute — bidding unit contact way, agency mechanism contact way, and winning unit contact way are merged into the coarse-grained attribute — contact way, and then a model is established for each coarse-grained attribute, and when the information entity extraction is performed, all information entities corresponding to the coarse-grained attributes are extracted through the model, so that not only is the complexity of the model reduced, but also the prediction accuracy of the information entity extraction model is considered.

In a possible implementation manner, in this embodiment, a training data set is obtained by obtaining a sample data text, labeling the sample data text according to a target information entity, and performing model training according to the training data set to obtain at least one information entity extraction model.

The target information entity is obtained by combining information entities with the same type of attributes, namely the coarse-grained attributes, such as the contact information.

Since the content format of most original data texts is relatively fixed, and the writing format of the prefix string of the attribute to be extracted is also relatively fixed, for example, the name of the bidding unit is generally preceded by "bidding unit: "and the like. If the prefix keywords and the information entity can be used as the information entity, the complexity of the model and the prediction error can be effectively reduced, and therefore, the prefix keywords of the target information entity and the prefix keywords of the target information entity can be respectively marked to obtain the training data set when the sample data text is marked.

Illustratively, related prefix keywords are added to labels of training data of the entity extraction model of attribute information such as item names, item numbers, unit names, addresses, contacts, contact ways, purchase amounts, purchase ways, product names, quantities, amounts and the like, and the labels are labeled together to obtain a training data set corresponding to the model.

It can be understood that, in this embodiment, the number of the information entity extraction models trained by the above method is consistent with the number of the coarse-grained attributes, and each information entity extraction model is used for performing information entity extraction on one coarse-grained attribute.

In this embodiment, according to different functions, the information entity extraction model may be divided into a feature coding module, a sequence labeling prediction module, and an information entity extraction module, where the feature coding module is configured to perform feature coding on a text block to obtain a two-dimensional dictionary list corresponding to each text block, the sequence labeling prediction module is configured to perform sequence labeling prediction on the two-dimensional dictionary list to obtain a target labeling sequence (optimal labeling sequence) of each text block, and the information entity extraction module is configured to perform character extraction from the two-dimensional dictionary list according to the target labeling sequence to obtain an information entity included in each text block. The following will respectively explain the implementation principle of each module:

(1) feature encoding module

Since the CRF algorithm uses context information in sequence labeling prediction, it is necessary to perform feature coding on each text block, convert the text block into a two-dimensional dictionary list, and determine a previous character and a next character in advance for each character in the text block.

In one possible implementation, in this embodiment, for each text block, it is assumed that the characters in the text block are s_i(i takes 1,2, this block, n, where n is the length of the input text block) and the feature coding can be performed by the following rules:

(a) if the current character is the middle character of a text block, i.e. 1< i < n, the character encoding can be encoded by the following rule:

current character → [ current character, previous character + current character, current character + next character]I.e. s_i，[s_i，s_i-1s_i，s_is_i+1]；

(b) If the current character is the first character of the text, i.e. i ═ 1, and its previous character is denoted < start >, the character can be encoded by the following rule:

the current character → [ the current character,<start>current character + next character]I.e. s₁，[s₁，<start>，s₁s₂]；

(c) If the current character is the last character of the text, i ═ n, and the next character is denoted < end >, the character can be encoded by the following rule:

current character → [ current character, previous character + current character,<end>]i.e. s_n，[s_n，s_n-1s_n，<end>]。

Illustratively, as shown in Table 1, in the text block "item Purchase Unit: people's hospital is an example, after feature coding, it can be converted into two-dimensional dictionary list of [ [ item, < start >, item ], [ item harvest ], [ harvest, [ item harvest, purchase ], [ … …, [ hospital, < end > ] ] ".

TABLE 1

Text block	Item purchasing unit: people's hospital
		Feature coding	[ [ item(s) ],<start>item of][ eyes, project, eyes Collection]… …, [ hospital,<end>]]

it should be noted that, in this embodiment, punctuation marks in the text block, such as colon (:), also participate in feature coding.

After feature encoding, each text block is converted into a corresponding two-dimensional dictionary list, assuming that each character s is_iThe word sequence obtained after coding uses x_iIf it is indicated, the text block S ═ S₁，s₂，…，s_n]The two-dimensional dictionary list obtained after feature coding can be recorded as X ═ X₁，x₂，…，x_n]。

(2) Sequence annotation prediction module

Illustratively, in this embodiment, BIEO (full name: begin,intermediate, end, other) labeling method to label each character in the two-dimensional dictionary list. Where B denotes the header of the information entity, I denotes the inside of the information entity, E denotes the trailer of the information entity, and O denotes any character that is not an information entity. The two-dimensional dictionary list thus labeled is converted to a labeled sequence of B, I, E, O four letters, e.g., [ O, O, B, I, I, …, E]For the sake of convenience of distinction, in the present embodiment, such a labeling sequence is denoted by Y ═ Y₁,y₂,…,y_n]。

As shown in table 2, similarly, the text block "item purchase unit name: for example, assuming that the current model is a model for extracting purchasing units, the labeled sequence obtained after labeling the text block may be represented as "[ O, B, I, …, E ]".

TABLE 2

Text block	Item purchasing unit: people's hospital
		Feature coding	[ [ item(s) ],<start>item of][ eyes, project, eyes Collection]… …, [ hospital,<end>]]
sequence of labels	[O，O，B，I，I，…，E]

Since the context information is used as a feature, there is a general problem for a text block with a length of n

In a possible implementation manner, in this embodiment, a conditional probability that each word sequence in the two-dimensional dictionary list is labeled as a candidate label (i.e., one of the possible labels, such as any one of B, I, E, O) is calculated by using a CRF algorithm, and an optimal label is found by using a Viterbi algorithm according to the conditional probability to obtain an optimal label sequence, i.e., a target label sequence, corresponding to the two-dimensional dictionary list, so as to effectively reduce the calculation complexity. The implementation principle of the CRF algorithm and the Viterbi algorithm will be described below:

CRF algorithm

For each word sequence x in a two-dimensional dictionary_iThe CRF algorithm calculates x through two characteristic functions_iIs denoted by yi, respectively the transfer function t_k1(y_i-1，y_iI) and a state function s_k2(y_iX, i), for example, fig. 2 is a schematic diagram illustrating a principle of calculating a conditional probability by using a CRF algorithm according to an embodiment of the present application, as shown in fig. 2, a transfer function t_k1(y_i-1,y_iI) representing the last word sequence x in dependence on the current position and the previous position_i-1Corresponding notation y_i-1Transfer to current word sequence x_iCorresponding notation y_iThe probability of (1), i.e. the transition probability; function of state s_k2(y_iX, i) depending only on the current position, representing the sequence of words X_iIs marked as y_iI.e. the state probability.

The CRF calculation conditional probability parameterization is in the form:

where P (y | x) represents the conditional probability of x labeled as y, i is the number of the word sequence (i is 1,2, …, n, n is the length of the two-dimensional dictionary list), K is the number of the feature function (K is 1,2, …, K is the number of feature functions), f is the number of the feature function, and_k(y_i-1,y_ix, i) is specialThe characteristic function being a transfer function t_k1(y_i-1,y_iI) and a state function s_k2(y_iUniform notation of X, i) (. omega.)_kAs weights for the characteristic functions, are transfer functions t_k1(y_i-1,y_iI) weight and state function s_k2(y_iX, i) is uniformly signed, z (X) is a normalization factor, which can be formulated as:

the conditional probability that the current word sequence in the two-dimensional dictionary list is marked as a candidate label can be calculated through the formulas (1) and (2).

Viterbi algorithm

In this embodiment, the Viterbi algorithm is used to solve the optimal value of the conditional probability of each word sequence to obtain the optimal label of each word sequence, and then find out the optimal label sequence Y^*＝[y₁ ^*,y₂ ^*,y₃ ^*,…,y_n ^*]. The Viterbi algorithm is based on the assumption that: the sub-paths of the optimal path must also be optimal. The algorithm idea is that starting from a root node, every step is taken, the shortest path from the root node to an upper node plus the shortest distance from the upper node to a current node are compared, the shortest path to the point is calculated recursively, and the shortest path is taken to a terminal.

Note delta_i(l) For the ith word sequence x in the two-dimensional dictionary list_iThe maximum of the conditional probability, denoted as l (possibly taking the values 1,2, …, m). According to Viterbi algorithm, in the i +1 th word sequence x_i+1Maximum value δ of conditional probability labeled l_i+1(l) Expressed as:

remember phi_i+1(l) To make delta_i+1(l) The marking value of the ith character reaching the maximum value is phi_i+1(l) Expressed as:

therefore, the prediction principle of the labeled sequence based on the CRF algorithm and the Viterbi algorithm is as follows: starting from a first word sequence in a two-dimensional dictionary list, for the first word sequence, calculating the conditional probability of each candidate label marked in the current word sequence according to a formula (1) by using a CRF algorithm, substituting the calculated conditional probability into a formula (4) to calculate the optimal label of the current word sequence, for the following word sequences, obtaining the optimal label sequence according to the CRF algorithm and a Viterbi algorithm on the basis of the optimal label of the previous word sequence, and finally combining the optimal sequences of the word sequences to obtain a target label sequence.

Illustratively, in text block "item procurement units: as shown in fig. 3, in the first iteration, assuming that conditional probabilities of word sequences of the first character "term" are calculated by the CRF algorithm to obtain conditional probabilities of 0.75, 0.1, and 0.05, respectively, which are labeled as O (l ═ 1), B (l ═ 2), I (l ═ 3), and E (l ═ 4), at the time of 1 st iteration, the optimal label of "term" may be determined to be "O", and in the second iteration, as known from the Viterbi algorithm, only 4 conditional probabilities of O — > O, O — > B, O — > I, O — > E need to be calculated, so as to determine the second character "purpose", which is sequentially determined until the optimal label of the last character "hospital" is completed, and the optimal selected characters are combined, and obtaining a target labeling sequence corresponding to the text block.

Exemplarily, X for an input₁，x₂，…，x_nThe prediction process of the labeled sequence based on the CRF algorithm and the Viterbi algorithm in the model is as follows:

1) initialization:

f₁(l)＝start,l＝1,2,L,m (6)

2) for i ═ 1,2, …, n-1, the calculations were recurrently calculated in order by equations (3) and (4);

3) i is n, the calculation is terminated to obtain the optimal

4) Backtracking and calculating in sequence to obtain the optimal

i＝n-1，n-2，…，1：

5) Obtaining a target labeling sequence

It is understood that, in this embodiment, during the model training phase, the CRF feature functions required by each model may be defined first, and then all the feature functions f of each model are determined by training the data (i.e., the training data set) of the known labeling sequence_k(y_i-1,y_iX, i) and their weights ω_kAnd (4) finishing.

(3) Information entity extraction module

The information entity extraction module is mainly used for extracting the character labeled B, I, E from the two-dimensional dictionary list according to the target labeling sequence, and exemplarily, assuming that the text block S is ═ S₁,s₂,s₃,……,s_n]Corresponding target labelingSequence is Y^*＝[y₁ ^*,y₂ ^*,y₃ ^*,…,y_n ^*]Then, the target information entity T can be extracted through the following workflow:

sequentially judging each word sequence s in the two-dimensional dictionary list_iAnd label y thereof_i ^*

For 1to n (n is the length of two-dimensional dictionary list)

If notation y_i ^*Is B, and label y of the next character_i-1 ^*Is I, then

Will label y_i ^*Corresponding character s_iWrite T, i.e. T ← s_i；

Else if notation y_i ^*Is B, and label y of the next character_i-1 ^*Is E, then

Will label y_i ^*Corresponding character s_iWrite T, i.e. T ← s_i；

Else if notation y_i ^*Is I, and the label y of the previous character_i+1 ^*Is B, then

Will label y_i ^*Corresponding character s_iWrite T, i.e. T ← s_i；

Else if notation y_i ^*Is E, and the label y of the previous character_i+1 ^*Is B or I, then

Will label y_i ^*Corresponding character s_iWrite T, i.e. T ← s_i；

Else

Notation y_i ^*Corresponding character s_iNon-write T

}

It can be understood that, since the functions and principles of the feature extraction module, the sequence labeling prediction module, and the information entity extraction module are different, in the training process of the information entity extraction model in the embodiment of the present application, the feature extraction module, the sequence labeling prediction module, and the information entity extraction module in the information entity extraction model can be trained respectively.

And S104, performing category inference according to a preset rule, and determining the category of at least one information entity.

Since the information entity extraction model in S103 performs information entity extraction based on the coarse-grained attribute, category inference needs to be performed to obtain the fine-grained attribute, thereby ensuring accuracy of information entity extraction. For example, in the information entity extraction result of the bidding data text, attributes such as unit name, address, contact address, etc. further category estimation is required in this step to determine whether the unit name, address, contact address, etc. of the purchasing unit or the agency. The specific rule for performing the class inference may be set according to the actual situation, and is not limited herein.

Taking the contact as an example, assume that the text block set of the bidding data text (original data text) is S₀＝[S₁,S₂,S₃,…,S_n]And information entity set T of contact information thereof₀＝[T₁,T₂,T₃,…,T_n]The category of the information entities in the information entity set (purchasing unit or agency) can be determined by the following procedure:

sequentially judging each text block S_iAnd the extracted information entity T_iH is an information entity T_iIn the text block S_iThe first appearing position in the database, the information entity set of the purchasing unit is C ═ C₁,c₂,c₃,……,c_n]The agency information entity set is D ═ D₁,d₂,d₃,……,d_n]。

Fori ═ 1to n (n is the number of text blocks split by the bidding information)

If information entity T_iContaining keywords relating to agencies, such as "Agents", the

Information entity T_iBelonging to agencies, i.e. c_i＝“”,d_i＝T_i；

Else if information entity T_iContaining a keyword relating to a purchasing unit, such as "purchasing unit", then

Information entity T_iBelonging to purchasing units, i.e. c_i＝T_i，d_i＝“”；

Else if information entity T_iCorresponding text block S_iFront part (S) of_iThe first h-1 characters) contains a keyword associated with an agent, such as "agent", then

Information entity T_iBelonging to agencies, i.e. c_i＝“”，d_i＝T_i；

Else if information entity T_iCorresponding text block S_iFront part (S) of_iThe first h-1 characters) contains a keyword related to a purchase unit, such as "purchase unit", then

Else if information entity T_iCorresponding text block S_iLast text block S of_i-1Containing keywords relating to agencies, such as "Agents", the

Information entity T_iBelonging to agencies, i.e. c_i＝“”,d_i＝T_i；

Else if information entity T_iCorresponding text block S_iLast text block S of_i-1Containing a keyword relating to a purchasing unit, such as "purchasing unit", then

Information entity T_iBelonging to purchasing units, i.e. c_i＝T_i,d_i＝“”；

Else if information entity T_iCorresponding text block S_iLast two text blocks S of_i-2Containing keywords relating to agencies, such as "Agents", the

Information entity T_iBelonging to agencies, i.e. c_i＝“”,d_i＝T_i；

Else if information entity T_iCorresponding text block S_iLast two text blocks S of_i-2Containing a keyword relating to a purchasing unit, such as "purchasing unit", then

Else

c_i＝“”,d_i＝“”

}

In addition, after information entity type inference, different information entities may appear which extract certain attributes from a plurality of text blocks of the same original data text, for example, the contact information of the purchasing unit appears at the head and the tail of the bidding information. For this situation, in this embodiment, one of the information entities may be selected according to a preset rule, and for example, the information entity appearing first in the original data text is selected as the information entity of the attribute, so as to remove the duplicated information.

In addition, because the original data texts from different sources have different writing formats, the extracted information entity formats also need to be unified. Therefore, in this embodiment, the information entity is also normalized according to the related rules, for example, the amount is uniformly converted into the meta value and the value of the purchasing mode is standardized, so that the output result is more normalized, and the method is convenient to use when performing decision analysis.

In the embodiment, the original data text is obtained, the original data text is sequentially partitioned to obtain at least one text block, the at least one text block is processed according to a pre-constructed information entity extraction model to obtain at least one information entity contained in the original data text, category inference is carried out according to a preset rule to determine the category of the at least one information entity, and automatic extraction of the information entity is realized.

Example two

Fig. 4 is a schematic structural diagram of an information entity extraction apparatus according to a second embodiment of the present application, and as shown in fig. 4, an information entity extraction apparatus 10 in the present embodiment includes:

an acquisition module 11 and a processing module 12.

The acquisition module 11 is used for acquiring an original data text;

the processing module 12 is configured to sequentially block the original data text to obtain at least one text block; processing the at least one text block according to a pre-constructed information entity extraction model to obtain at least one information entity contained in the original data text; and performing category inference according to a preset rule, and determining the category of the at least one information entity.

Optionally, the processing module 12 is specifically configured to:

Optionally, the processing module 12 is further configured to:

special identifiers are added at the beginning and the end of each text block.

Optionally, the processing module 12 is specifically configured to:

Optionally, the obtaining module 11 is further configured to:

acquiring a sample data text;

the processing module 12 is further configured to:

marking the sample data text according to a target information entity to obtain a training data set, wherein the target information entity is obtained by combining information entities with the same type of attributes; and carrying out model training according to the training data set to obtain at least one information entity extraction model.

Optionally, the processing module 12 is specifically configured to:

The information entity extraction device provided by the embodiment can execute the information entity extraction method provided by the method embodiment, and has the corresponding functional modules and beneficial effects of the execution method. The implementation principle and technical effect of this embodiment are similar to those of the above method embodiments, and are not described in detail here.

EXAMPLE III

Fig. 5 is a schematic structural diagram of an electronic device according to a third embodiment of the present application, and as shown in fig. 5, the electronic device 20 includes a memory 21, a processor 22, and a computer program stored in the memory and executable on the processor; the number of the processors 22 of the electronic device 20 may be one or more, and one processor 22 is taken as an example in fig. 5; the processor 22 and the memory 21 in the electronic device 20 may be connected by a bus or other means, and fig. 5 illustrates the connection by the bus as an example.

The memory 21 is a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the acquisition module 11 and the processing module 12 in the embodiment of the present application. The processor 22 executes various functional applications of the device/terminal/server and data processing by running software programs, instructions and modules stored in the memory 21, that is, implements the above-described information entity extraction method.

The memory 21 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 21 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 21 may further include memory located remotely from the processor 22, which may be connected to the device/terminal/server through a grid. Examples of such a mesh include, but are not limited to, the internet, an intranet, a local area network, a mobile communications network, and combinations thereof.

Example four

A fourth embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, the computer program, when executed by a computer processor, is configured to perform a method for information entity extraction, the method including:

acquiring an original data text;

sequentially blocking the original data text to obtain at least one text block;

Of course, the computer program of the computer-readable storage medium provided in this embodiment of the present application is not limited to the method operations described above, and may also perform related operations in the information entity extraction method provided in any embodiment of the present application.

From the above description of the embodiments, it is obvious for those skilled in the art that the present application can be implemented by software and necessary general hardware, and certainly can be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a grid device) to execute the methods described in the embodiments of the present application.

It should be noted that, in the embodiment of the information entity extraction apparatus, each included unit and module are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the application.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present application and the technical principles employed. It will be understood by those skilled in the art that the present application is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the application. Therefore, although the present application has been described in more detail with reference to the above embodiments, the present application is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present application, and the scope of the present application is determined by the scope of the appended claims.

Claims

1. An information entity extraction method, comprising:

acquiring an original data text;

sequentially blocking the original data text to obtain at least one text block;

2. The method of claim 1, wherein partitioning the original data text into at least one text block that is ordered and semantically continuous comprises:

3. The method of claim 2, wherein sequentially blocking the original data text to obtain at least one text block comprises:

4. The method of claim 2, wherein the sorting and semantic continuity processing of the at least one short text to obtain the at least one text block comprises:

5. The method of claim 2, further comprising:

special identifiers are added at the beginning and the end of each text block.

6. The method of claim 1, wherein the processing the at least one text block according to a pre-constructed information entity extraction model to obtain at least one information entity contained in the original data text comprises:

7. The method of claim 6, wherein the performing sequence labeling prediction on the two-dimensional dictionary list according to a preset algorithm to obtain a target labeling sequence of each text block comprises:

8. The method according to any of claims 1-7, wherein before processing the at least one text block according to a pre-constructed information entity extraction model to obtain the at least one information entity contained in the original data text, the method further comprises:

acquiring a sample data text;

9. The method of claim 8, wherein said tagging the sample data text according to a target information entity to obtain a training data set comprises:

10. An information entity extraction apparatus, comprising:

the acquisition module is used for acquiring an original data text;

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the information entity extraction method according to any one of claims 1to 9 when executing the program.

12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the information entity extraction method according to any one of claims 1to 9.