CN111291550B - Chinese entity extraction method and device - Google Patents

Chinese entity extraction method and device Download PDF

Info

Publication number
CN111291550B
CN111291550B CN202010054462.7A CN202010054462A CN111291550B CN 111291550 B CN111291550 B CN 111291550B CN 202010054462 A CN202010054462 A CN 202010054462A CN 111291550 B CN111291550 B CN 111291550B
Authority
CN
China
Prior art keywords
short term
word
term memory
memory network
clause
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010054462.7A
Other languages
Chinese (zh)
Other versions
CN111291550A (en
Inventor
董哲
邵若琦
康宇佳
李月恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
North China University of Technology
Original Assignee
North China University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by North China University of Technology filed Critical North China University of Technology
Priority to CN202010054462.7A priority Critical patent/CN111291550B/en
Publication of CN111291550A publication Critical patent/CN111291550A/en
Application granted granted Critical
Publication of CN111291550B publication Critical patent/CN111291550B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks

Abstract

The embodiment of the invention discloses a method and a device for extracting Chinese entities, wherein the method comprises the following steps: dividing the target source sentence into sub-sentences; vectorizing the words in the clauses to obtain word vectors; determining a probability matrix of each label corresponding to each word obtained by the long-short term memory network LSTM according to the word vector and the hierarchical bidirectional long-short term memory network BiLSTM; inputting the probability matrix into a CRF model to obtain a label with the maximum probability in all labels respectively corresponding to each word; and extracting the entity consisting of the words corresponding to the labels with the highest probability. In the embodiment of the invention, the target source clause is divided into clauses, so that the subsequent semantic representation in the learning clauses at the word level and the semantic representation among the learning clauses at the clause level are facilitated; through the CRF model, the label with the maximum probability in the labels respectively corresponding to each word is determined, and the Chinese entity consisting of the word corresponding to the label with the maximum probability is extracted, so that the accuracy of Chinese entity identification is improved.

Description

Chinese entity extraction method and device
Technical Field
The invention relates to the technical field of computers, in particular to a Chinese entity extraction method and device.
Background
With the progress of science and technology and the digitization of information, various industries have undergone great changes and innovations.
In recent years, there has been a continuous interest in Entity identification in specific fields, such as in the field of food safety, NER (Named Entity identification) automatically identifies entities related to food and generates structured data to help construct a knowledge graph in the food field. Domain-specific cases are usually recorded by a logger, but sometimes the logger uses chinese abbreviations, resulting in multiple expressions for the same entity. And for the entity that Chinese characters, letters, numbers and punctuation marks are mixed together, the difficulty of identifying the entity is increased.
At present, entities in a specific field have certain field specificity, and research on identifying entities in a specific field is not deep enough. Deep neural networks have achieved better experimental results in the identification of entities in general domain text, but are less useful in the identification of entities in specific domain text. In addition, in the domain-specific entity identification process, the entities of the domain-specific entity may be in different positions of the sentence, so that different information is required in identifying the domain-specific entity, i.e., the context information has different degrees of influence on the domain-specific entity identification. In order to accurately identify the domain-specific entity, the context information of the sentence needs to be fully considered in the process of identifying the entity. In the prior art, a Long sentence is directly input into a BilSTM-CRF (Bi-directional Long Short-Term Memory-Conditional Random Field) model, and the semantic information of the sentence is not considered sufficiently by the mode.
Disclosure of Invention
Because the existing method has the problems, the embodiment of the invention provides a method and a device for extracting Chinese entities.
In a first aspect, an embodiment of the present invention provides a method for extracting a chinese entity, including:
based on punctuation marks, segmenting a target source sentence to obtain clauses;
vectorizing the words in the clauses to obtain word vectors;
determining a probability matrix of each label corresponding to each word obtained by the long-short term memory network LSTM according to the word vector and the hierarchical bidirectional long-short term memory network BiLSTM; wherein the bidirectional long-short term memory network BilSTM of the hierarchy comprises a first bidirectional long-short term memory network BilSTM and a second bidirectional long-short term memory network BilSTM;
inputting the probability matrix into a conditional random field model CRF to obtain labels with the highest probability in the labels respectively corresponding to each word;
and extracting Chinese entities consisting of words corresponding to the labels with the maximum probability.
Optionally, the segmenting the target source sentence based on the punctuation marks to obtain a clause includes:
based on punctuation marks, segmenting a target source sentence to obtain clauses;
adding a special mark after the last word of each clause;
wherein the special mark represents a clause termination.
Optionally, the vectorizing the words in the clause to obtain a word vector includes:
and vectorizing the words in the clauses by using a Skip-gram model of Word2vec to obtain a Word vector.
Optionally, determining a probability matrix of each label corresponding to each word obtained by the long-short term memory network LSTM according to the word vector and the hierarchical bidirectional long-short term memory network BiLSTM; wherein, the bidirectional long-short term memory network BilSTM of the hierarchy comprises a first bidirectional long-short term memory network BilSTM and a second bidirectional long-short term memory network BilSTM, and comprises:
inputting the word vector into the first bidirectional long-short term memory network BilSTM to obtain a clause semantic vector;
inputting the clause semantic vector into the second bidirectional long-short term memory network BilSTM to obtain a target source sentence semantic vector;
and inputting the clause semantic vector and the target source sentence semantic vector into the long-short term memory network LSTM to obtain a probability matrix of each label corresponding to each word.
In a second aspect, an embodiment of the present invention further provides a chinese entity extracting apparatus, including: the device comprises a cutting module, a vectorization processing module, a determining module, an obtaining module and an extracting module;
the segmentation module is used for segmenting a target source sentence based on punctuation marks to obtain clauses;
the vectorization processing module is used for vectorizing the words in the clauses to obtain word vectors;
the determining module is used for determining the probability matrix of each label corresponding to each word obtained by the long-short term memory network LSTM according to the word vector and the hierarchical bidirectional long-short term memory network BiLSTM; wherein the bidirectional long-short term memory network BilSTM of the hierarchy comprises a first bidirectional long-short term memory network BilSTM and a second bidirectional long-short term memory network BilSTM;
the obtaining module is used for inputting the probability matrix into a conditional random field model CRF to obtain the label with the maximum probability in the labels corresponding to each character;
the extraction module is used for extracting Chinese entities formed by words corresponding to the labels with the maximum probability.
Optionally, the splitting module is specifically configured to:
based on punctuation marks, segmenting a target source sentence to obtain clauses;
adding a special mark after the last word of each clause;
wherein the special mark represents a clause termination.
Optionally, the vectorization processing module is specifically configured to:
and vectorizing the words in the clauses by using a Skip-gram model of Word2vec to obtain a Word vector.
Optionally, the determining module is specifically configured to:
inputting the word vector into the first bidirectional long-short term memory network BilSTM to obtain a clause semantic vector;
inputting the clause semantic vector into the second bidirectional long-short term memory network BilSTM to obtain a target source sentence semantic vector;
and inputting the clause semantic vector and the target source sentence semantic vector into the long-short term memory network LSTM to obtain a probability matrix of each label corresponding to each word.
In a third aspect, an embodiment of the present invention further provides an electronic device, including:
at least one processor; and
at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, which when called by the processor are capable of performing the above-described methods.
In a fourth aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium storing a computer program, which causes the computer to execute the above method.
According to the technical scheme, the target source clause is divided into the clauses, so that the subsequent expression of the semantics in the clauses at the word level and the expression of the semantics among the clauses at the clause level are facilitated; through a conditional random field model CRF, determining the label with the maximum probability in the labels respectively corresponding to each word and extracting the entity consisting of the word corresponding to the label with the maximum probability, so that the accuracy of entity identification is improved; semantic information of the sentence can be more comprehensively learned through a hierarchical BilSTM network.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a schematic flow chart of a method for extracting Chinese entities according to an embodiment of the present invention;
FIG. 2 is another schematic flow chart of a method for extracting Chinese entities according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a Chinese entity extraction apparatus according to an embodiment of the present invention;
fig. 4 is a logic block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following further describes embodiments of the present invention with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
Fig. 1 shows a schematic flow chart of a method for extracting chinese entities provided in this embodiment, including:
and S11, segmenting the target source sentence based on the punctuation marks to obtain clauses.
In the embodiment of the present invention, the target source sentence is a sentence of an entity to be extracted.
In the embodiment of the present invention, the entity may be in different positions of the target source sentence, and the different positions result in different information being required for identifying the entity, that is, the context information has different degrees of influence on the entity identification. In order to accurately identify the entity, the context information of the sentence needs to be fully considered in the process of identifying the entity. Therefore, in the embodiment of the present invention, the target source sentence is not directly input into a bilst (Bi-directional Long Short-Term Memory) network as a whole sentence, but the target source sentence is segmented according to punctuation marks in the target source sentence to obtain each clause, and then the target source sentence passes through the bilst network, so that semantic information of the target source sentence can be learned more comprehensively.
In the embodiment of the present invention, specifically, as shown in fig. 2, in the case of the field of food safety, that is, the target source sentence is "schyu shop, which knows baking powder. "is an example. And segmenting the target source sentence according to the comma and the sentence number in the target source sentence to obtain two clauses, namely 'xu Yu-Fang shop' and 'know baking powder thereof'.
And S12, vectorizing the words in the clauses to obtain word vectors.
In the embodiment of the invention, the food safety field case is applied to the deep learning neural network, and the vectorization expression of the input data of the deep learning neural network is required.
In the embodiment of the present invention, the words in the clause split in S11 are vectorized to obtain a word vector. Specifically, each word in the clauses "xu yu shop" and "know pounding powder thereof" is subjected to vectorization processing, and word vectors corresponding to each word, namely "many", "jade", "aryl", "shop", "know", "pounding", and "powder", are obtained.
S13, determining the probability matrix of each label corresponding to each word obtained by the long-short term memory network LSTM according to the word vector and the hierarchical bidirectional long-short term memory network BilTM; wherein the bidirectional long-short term memory network BilSTM of the hierarchy comprises a first bidirectional long-short term memory network BilSTM and a second bidirectional long-short term memory network BilSTM.
In the embodiment of the invention, the word vector is firstly input into the first bidirectional long-short term memory network BilSTM in the bidirectional long-short term memory network BilSTM of the hierarchy, then the output of the first bidirectional long-short term memory network BilSTM is input into the second bidirectional long-short term memory network BilSTM in the bidirectional long-short term memory network BilSTM of the hierarchy, and then the output of the first bidirectional long-short term memory network BilSTM and the output of the second bidirectional long-short term memory network BilSTM are respectively input into the long-short term memory network LSTM. And determining the probability matrix of each label corresponding to each word by using the long-short term memory network LSTM.
In the embodiment of the invention, the bidirectional long-short term memory network BilsTM of the hierarchy is connected with the long-short term memory network LSTM. It should be noted that, the word vectors labeled with labels are used as a training set to train the well-connected hierarchical bidirectional long-short term memory network BiLSTM and long-short term memory network LSTM.
And S14, inputting the probability matrix into a conditional random field model CRF to obtain the label with the maximum probability in the labels corresponding to each word.
In the embodiment of the present invention, the probability matrix obtained in S13 is used as an input of a CRF (Conditional Random Field) model. And the output of the CRF model is the label with the maximum probability in the labels respectively corresponding to each word.
And S15, extracting Chinese entities consisting of words corresponding to the labels with the maximum probability.
In the embodiment of the invention, the Chinese entity is composed of each word.
In the embodiment of the invention, the Chinese entity consisting of the word corresponding to the label with the maximum probability is extracted.
The embodiment of the invention divides the target source clause into each clause, which is beneficial to the subsequent expression of the semantics in the learning clauses at the word level and the expression of the semantics among the learning clauses at the clause level; through the conditional random field model CRF, the label with the maximum probability in the labels respectively corresponding to each word is determined, and the Chinese entity consisting of the words corresponding to the labels with the maximum probability is extracted, so that the accuracy of entity identification is improved.
Further, on the basis of the above method embodiment, the segmenting a target source sentence based on punctuation marks to obtain clauses includes:
based on punctuation marks, segmenting a target source sentence to obtain clauses;
adding a special mark after the last word of each clause;
wherein the special mark represents a clause termination.
In the embodiment of the invention, the target source sentence is segmented based on punctuation marks to obtain each clause. Assuming that the target source sentence is x, obtaining i sub-sentences x after segmentation1,x2,…xi. Wherein clause x1The word inside is denoted by x1=(x11,x12,...x1j). In the same way, clause xiThe word inside is denoted by xi=(xi1,xi2,...xij). At each clause, i.e. x1,x2,…xiIs followed by a special mark, e.g.<end>. Wherein the special mark represents a clause termination.
The embodiment of the invention divides the target source clause into each clause, which is beneficial to the subsequent expression of the semantics in the learning clauses at the word level and the expression of the semantics among the learning clauses at the clause level; special marks are added after the last word of each clause for distinguishing different clauses.
Further, on the basis of the above method embodiment, the vectorizing the words in the clause to obtain a word vector includes:
and vectorizing the words in the clauses by using a Skip-gram model of Word2vec to obtain a Word vector.
In the embodiment of the invention, the clause, namely x, is subjected to Skip-gram model of Word2vec1,x2,…xiThe words in (2) are subjected to vectorization processing to obtain word vectors.
The embodiment of the invention carries out vectorization processing on the words in the clauses to obtain the word vector so as to ensure the input accuracy of the BilSTM network.
Further, on the basis of the embodiment of the method, the probability matrix of each label corresponding to each word is obtained by the long-short term memory network LSTM according to the word vector and the hierarchical bidirectional long-short term memory network BiLSTM; wherein, the bidirectional long-short term memory network BilSTM of the hierarchy comprises a first bidirectional long-short term memory network BilSTM and a second bidirectional long-short term memory network BilSTM, and comprises:
inputting the word vector into the first bidirectional long-short term memory network BilSTM to obtain a clause semantic vector;
inputting the clause semantic vector into the second bidirectional long-short term memory network BilSTM to obtain a target source sentence semantic vector;
and inputting the clause semantic vector and the target source sentence semantic vector into the long-short term memory network LSTM to obtain a probability matrix of each label corresponding to each word.
In the embodiment of the present invention, "first" and "second" in the first bidirectional long-short term memory network BiLSTM and the second bidirectional long-short term memory network BiLSTM are used to distinguish two different bilstms. The forward LSTM and backward LSRM are combined into a BiLSTM. The hierarchical BilSTM is composed of the intra-clause BilSTM and the inter-clause BilSTM, i.e. the first bidirectional long-short term memory network BilSTM and the second bidirectional long-short term memory network BilSTM.
In the embodiment of the invention, the word vector is input into a first bidirectional long-short term memory network BilSTM to obtain a clause semantic vector. And inputting the clause semantic vector into a second bidirectional long-short term memory network (BilSTM) to obtain a target source sentence semantic vector. Specifically, the hierarchical BilSTM network is composed of the first bidirectional long-short term memory network BilSTM and the second bidirectional long-short term memory network BilSTM. The hierarchical BilTM network replaces the hidden node in the common recurrent neural network with a memory unit. The memory unit is controlled by 3 gates, respectively a forgetting gate, an input gate and an output gate. The forgetting gate is marked as f, information to be discarded and information to be retained in the cell state are determined, the information from the previous hidden state and the currently input information are simultaneously transmitted to the sigmoid function, the output value is between 0 and 1, the closer to 0, the more the information is to be discarded, and the closer to 1, the more the information is to be retained. And the input gate is marked as i, determines information to be added in the cell state, and receives the information of the previous layer hidden state and the currently input information. And the output gate is marked as o. The value of the output gate is multiplied by the intermediate state value at the current moment bit by bit to generate the final output value of the memory unit.
In the embodiment of the present invention, the underlying BilsTM network, i.e., the first bidirectional long and short term memory network BilsTM, first learns the representation of the semantic meaning in the clause at the word level, i.e., between words, and then uses the learned representation of the semantic meaning in the clause as the input of the previous layer BilsTM network, i.e., the second bidirectional long and short term memory network BilsTM, to learn the representation of the semantic meaning between the clauses at the clause level, i.e., between the clauses. Specifically, the formula for using the first bi-directional long-short term memory network BiLSTM within a clause is as follows:
Figure BDA0002372326130000091
Figure BDA0002372326130000092
Figure BDA0002372326130000093
and
Figure BDA0002372326130000094
is a forward and backward hidden state representing the jth word in the ith clause, v (x)i,j) Is the word xi,jIs represented by a vectorization of (a),
Figure BDA0002372326130000095
the semantic information of the words before and after the j-th word in the clause is summarized. More specific forward hidden state
Figure BDA0002372326130000096
Is calculated as:
an input gate:
Figure BDA0002372326130000101
forget the door:
Figure BDA0002372326130000102
an output gate:
Figure BDA0002372326130000103
Figure BDA0002372326130000104
word-level cell state:
Figure BDA0002372326130000105
hidden state at word level:
Figure BDA0002372326130000106
wherein, Wi w
Figure BDA0002372326130000107
Is a weight matrix,. indicates a multiplication by element, and σ indicates a sigmod activation function.
First bidirectional long-term and short-term memoryThe last hidden states in both directions of the network BiLSTM are used together to represent the whole clause, in the form:
Figure BDA0002372326130000108
the formula for using the second bidirectional long-short term memory network BilSTM between clauses is as follows:
Figure BDA0002372326130000109
Figure BDA00023723261300001010
Figure BDA00023723261300001011
and
Figure BDA00023723261300001012
are forward and backward hidden states between clauses,
Figure BDA00023723261300001013
to learn the semantics of the entire target source sentence.
In the embodiment of the invention, the clause semantic vector and the target source sentence semantic vector are input into the long-short term memory network LSTM, and the probability matrix of each label corresponding to each word is obtained. Specifically, the long-short term memory network LSTM is decoded as a decoder, and the formula is as follows:
L=LSTM(Hw,Hs)
during decoding, each time a special marker is predicted, it indicates that the decoding of the current intra-clause word is finished and the decoding of the next intra-clause word is started. Hidden state s of decodertThe calculation formula of (a) is as follows:
st=ot⊙tanh(Ct)
Figure BDA00023723261300001014
Figure BDA00023723261300001015
gt=σ(wgsst-1+wgyv(yt-1)+wgwHw+wgsHs)
ft=σ(wfsst-1+wfyv(yt-1)+(1-gt)wfwHw+gtwfsHs)
it=σ(wisst-1+wiyv(yt-1)+(1-gt)wiwHw+gtwisHs)
ot=σ(wosst-1+woyv(yt-1)+(1-gt)wowHw+gtwosHs)
wherein wcs,wcy,wcw,wgs,wgy,wgw,wfs,wfy,wfw,wis,wiy,wiw,wos,woyW is a weight matrix, consisting oft-1,yt-1,Hw,HsCalculated ft,it,ot,gtRespectively representing a forgetting gate, an input gate, an output gate and semantic information in the BilSTM. For input X ═ X1,x2,...xi) Obtaining a predicted sequence y ═ y (y)1,y2,...yt) Definition of ytThe probability matrix for the predicted sequence tag is:
P(yt|st-1,yt-1,Hw,Hs)∝exp(v(yt)TWmt)
wherein, v (y)t)TRepresents the sequence ytVectorization ofDenotes that W ∈ R2k×nIs a vector matrix, k is the hidden unit of BilSTM, n is the number of label types, mtIs st-1,yt-1,Hw,HsThe combined embedded vector obtained by concatenation.
In an embodiment of the invention, the output of the LSTM decoding layer is converted to the input of the CRF by a linear function using a standard CRF layer on top of the model (as shown in fig. 2).
P=LWp+bp
Wp∈R2k×n,bp∈RnAre the parameters to be learned.
For input X, the probability of outputting the best tag sequence y can be defined as:
Figure BDA0002372326130000111
for each input X, the scores for all possible annotation sequences y are found:
Figure BDA0002372326130000112
wherein, Pi,ytNon-normalized probabilistic entity tag representing word mapping to name, Ayt,yt+1Is ytTo yt+1When the number of labels (B-per, B-loc … …) is n, the dimension of the transition probability matrix in the CRF model is (n +2) × (n +2), and m is the length of the input clause, because a start position and an end position are additionally added. Y isX(ii) ('O', 'B-per', 'I-per', 'B-loc', 'I-loc', 'B-food', 'I-food', 'B-add', 'I-add', 'B-att', 'I-att'), the numerator s function represents the correct tag sequence score and the denominator s function represents each possible tag sequence score. The larger the value of p (y | X), the more accurate the prediction.
During model training, the loss function is defined as:
Figure BDA0002372326130000121
and calculating a loss function value, and continuously updating the network parameters until the iteration is finished.
The embodiment of the invention can learn the semantic information of the sentence more comprehensively through the hierarchical BilSTM network.
It should be noted that, in the embodiment of the present invention, the labeling label adopts a BIO labeling method. "O" represents a generic term that is not a solid; "B-per" represents the beginning word of the name entity; "I-per" represents the middle word of the name entity; "B-loc" represents the start word of the place name entity; "I-loc" represents the middle word of the place name entity; "B-food" represents the start word of the name entity; "I-food" represents the middle word of the food name entity; "B-add" represents the start word of the food additive name entity; "I-add" represents the middle word of the food additive name entity; "B-att" represents the start word of the case attribute entity; "I-att" represents the middle word of the case attribute entity.
It should be noted that, in the embodiment of the present invention, there are a plurality of tags corresponding to each word, and the most probable tag (i.e., the probability of the best tag sequence y mentioned above) is selected from the tags corresponding to each word by the CRF model. And the word corresponding to the label with the maximum probability forms a Chinese entity. As shown in FIG. 2, the Chinese entities "xu Yu Fang" and "baking powder" are screened out.
Fig. 3 is a schematic structural diagram of a chinese entity extraction device provided in this embodiment, where the device includes: a segmentation module 31, a vectorization processing module 32, a determination module 33, an obtaining module 34 and an extraction module 35;
the segmentation module 31 is configured to segment the target source clause based on the punctuation marks to obtain clauses;
the vectorization processing module 32 is configured to perform vectorization processing on the words in the clauses to obtain word vectors;
the determining module 33 is configured to determine, according to the word vector and the hierarchical bidirectional long-short term memory network BiLSTM, a probability matrix of each label corresponding to each word obtained by the long-short term memory network LSTM; wherein the bidirectional long-short term memory network BilSTM of the hierarchy comprises a first bidirectional long-short term memory network BilSTM and a second bidirectional long-short term memory network BilSTM;
the obtaining module 34 is configured to input the probability matrix into a conditional random field model CRF to obtain a label with a highest probability among labels respectively corresponding to each word;
the extracting module 35 is configured to extract a chinese entity composed of words corresponding to the label with the maximum probability.
Further, on the basis of the above device embodiment, the dividing module 31 is specifically configured to:
based on punctuation marks, segmenting a target source sentence to obtain clauses;
adding a special mark after the last word of each clause;
wherein the special mark represents a clause termination.
Further, on the basis of the above device embodiment, the vectorization processing module 32 is specifically configured to:
and vectorizing the words in the clauses by using a Skip-gram model of Word2vec to obtain a Word vector.
Further, on the basis of the above device embodiment, the determining module 33 is specifically configured to:
inputting the word vector into the first bidirectional long-short term memory network BilSTM to obtain a clause semantic vector;
inputting the clause semantic vector into the second bidirectional long-short term memory network BilSTM to obtain a target source sentence semantic vector;
and inputting the clause semantic vector and the target source sentence semantic vector into the long-short term memory network LSTM to obtain a probability matrix of each label corresponding to each word.
The two-dimensional code information processing apparatus described in this embodiment may be used to execute the above method embodiments, and the principle and technical effect are similar, which are not described herein again.
Referring to fig. 4, the electronic device includes: a processor (processor)41, a memory (memory)42, and a bus 43;
wherein the content of the first and second substances,
the processor 41 and the memory 42 complete mutual communication through the bus 43;
the processor 41 is configured to call program instructions in the memory 42 to perform the methods provided by the above-described method embodiments.
The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the method embodiments described above.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
It should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (8)

1. A Chinese entity extraction method is characterized by comprising the following steps:
based on punctuation marks, segmenting a target source sentence to obtain clauses;
vectorizing the words in the clauses to obtain word vectors;
determining a probability matrix of each label corresponding to each word obtained by the long-short term memory network LSTM according to the word vector and the hierarchical bidirectional long-short term memory network BiLSTM; wherein the bidirectional long-short term memory network BilSTM of the hierarchy comprises a first bidirectional long-short term memory network BilSTM and a second bidirectional long-short term memory network BilSTM;
inputting the probability matrix into a conditional random field model CRF to obtain labels with the highest probability in the labels respectively corresponding to each word;
extracting a Chinese entity consisting of words corresponding to the label with the maximum probability;
determining a probability matrix of each label corresponding to each word obtained by the long-short term memory network LSTM according to the word vector and the hierarchical bidirectional long-short term memory network BiLSTM; wherein, the bidirectional long-short term memory network BilSTM of the hierarchy comprises a first bidirectional long-short term memory network BilSTM and a second bidirectional long-short term memory network BilSTM, and comprises:
inputting the word vector into the first bidirectional long-short term memory network BilSTM to obtain a clause semantic vector;
inputting the clause semantic vector into the second bidirectional long-short term memory network BilSTM to obtain a target source sentence semantic vector;
and inputting the clause semantic vector and the target source sentence semantic vector into the long-short term memory network LSTM to obtain a probability matrix of each label corresponding to each word.
2. The method of claim 1, wherein the segmenting a target source sentence based on punctuation to obtain clauses comprises:
based on punctuation marks, segmenting a target source sentence to obtain clauses;
adding a special mark after the last word of each clause;
wherein the special mark represents a clause termination.
3. The method of claim 1, wherein the vectorizing the words in the clauses to obtain a word vector comprises:
and vectorizing the words in the clauses by using a Skip-gram model of Word2vec to obtain a Word vector.
4. A Chinese entity extraction device, comprising: the device comprises a cutting module, a vectorization processing module, a determining module, an obtaining module and an extracting module;
the segmentation module is used for segmenting a target source sentence based on punctuation marks to obtain clauses;
the vectorization processing module is used for vectorizing the words in the clauses to obtain word vectors;
the determining module is used for determining the probability matrix of each label corresponding to each word obtained by the long-short term memory network LSTM according to the word vector and the hierarchical bidirectional long-short term memory network BiLSTM; wherein the bidirectional long-short term memory network BilSTM of the hierarchy comprises a first bidirectional long-short term memory network BilSTM and a second bidirectional long-short term memory network BilSTM;
the obtaining module is used for inputting the probability matrix into a conditional random field model CRF to obtain the label with the maximum probability in the labels corresponding to each character;
the extraction module is used for extracting Chinese entities consisting of words corresponding to the labels with the maximum probability;
wherein the determining module is specifically configured to:
inputting the word vector into the first bidirectional long-short term memory network BilSTM to obtain a clause semantic vector;
inputting the clause semantic vector into the second bidirectional long-short term memory network BilSTM to obtain a target source sentence semantic vector;
and inputting the clause semantic vector and the target source sentence semantic vector into the long-short term memory network LSTM to obtain a probability matrix of each label corresponding to each word.
5. The chinese entity extraction device of claim 4, wherein the segmentation module is specifically configured to:
based on punctuation marks, segmenting a target source sentence to obtain clauses;
adding a special mark after the last word of each clause;
wherein the special mark represents a clause termination.
6. The chinese entity extraction device of claim 4, wherein the vectorization processing module is specifically configured to:
and vectorizing the words in the clauses by using a Skip-gram model of Word2vec to obtain a Word vector.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of chinese entity extraction of any of claims 1 to 3 when executing the program.
8. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the chinese entity extraction method of any of claims 1 to 3.
CN202010054462.7A 2020-01-17 2020-01-17 Chinese entity extraction method and device Active CN111291550B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010054462.7A CN111291550B (en) 2020-01-17 2020-01-17 Chinese entity extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010054462.7A CN111291550B (en) 2020-01-17 2020-01-17 Chinese entity extraction method and device

Publications (2)

Publication Number Publication Date
CN111291550A CN111291550A (en) 2020-06-16
CN111291550B true CN111291550B (en) 2021-09-03

Family

ID=71026284

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010054462.7A Active CN111291550B (en) 2020-01-17 2020-01-17 Chinese entity extraction method and device

Country Status (1)

Country Link
CN (1) CN111291550B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112215005A (en) * 2020-10-12 2021-01-12 小红书科技有限公司 Entity identification method and device
CN113326691B (en) * 2021-05-27 2023-07-28 北京百度网讯科技有限公司 Data processing method and device, electronic equipment and computer readable medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108305612A (en) * 2017-11-21 2018-07-20 腾讯科技(深圳)有限公司 Text-processing, model training method, device, storage medium and computer equipment
CN109460473A (en) * 2018-11-21 2019-03-12 中南大学 The electronic health record multi-tag classification method with character representation is extracted based on symptom
CN109858041A (en) * 2019-03-07 2019-06-07 北京百分点信息科技有限公司 A kind of name entity recognition method of semi-supervised learning combination Custom Dictionaries
CN110232192A (en) * 2019-06-19 2019-09-13 中国电力科学研究院有限公司 Electric power term names entity recognition method and device
CN110348016A (en) * 2019-07-15 2019-10-18 昆明理工大学 Text snippet generation method based on sentence association attention mechanism

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11600194B2 (en) * 2018-05-18 2023-03-07 Salesforce.Com, Inc. Multitask learning as question answering

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108305612A (en) * 2017-11-21 2018-07-20 腾讯科技(深圳)有限公司 Text-processing, model training method, device, storage medium and computer equipment
CN109460473A (en) * 2018-11-21 2019-03-12 中南大学 The electronic health record multi-tag classification method with character representation is extracted based on symptom
CN109858041A (en) * 2019-03-07 2019-06-07 北京百分点信息科技有限公司 A kind of name entity recognition method of semi-supervised learning combination Custom Dictionaries
CN110232192A (en) * 2019-06-19 2019-09-13 中国电力科学研究院有限公司 Electric power term names entity recognition method and device
CN110348016A (en) * 2019-07-15 2019-10-18 昆明理工大学 Text snippet generation method based on sentence association attention mechanism

Also Published As

Publication number Publication date
CN111291550A (en) 2020-06-16

Similar Documents

Publication Publication Date Title
US11501182B2 (en) Method and apparatus for generating model
CN110852087B (en) Chinese error correction method and device, storage medium and electronic device
CN108182295B (en) Enterprise knowledge graph attribute extraction method and system
US11914959B2 (en) Entity linking method and apparatus
CN109960728B (en) Method and system for identifying named entities of open domain conference information
CN110162594B (en) Viewpoint generation method and device for text data and electronic equipment
CN112487820B (en) Chinese medical named entity recognition method
CN110222328B (en) Method, device and equipment for labeling participles and parts of speech based on neural network and storage medium
CN109284361A (en) A kind of entity abstracting method and system based on deep learning
CN111738007A (en) Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network
CN110826334A (en) Chinese named entity recognition model based on reinforcement learning and training method thereof
CN114943230B (en) Method for linking entities in Chinese specific field by fusing common sense knowledge
CN111079418B (en) Named entity recognition method, device, electronic equipment and storage medium
CN113128203A (en) Attention mechanism-based relationship extraction method, system, equipment and storage medium
CN111753545A (en) Nested entity recognition method and device, electronic equipment and storage medium
CN112800184B (en) Short text comment emotion analysis method based on Target-Aspect-Opinion joint extraction
CN113392209A (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN111291550B (en) Chinese entity extraction method and device
CN111881256B (en) Text entity relation extraction method and device and computer readable storage medium equipment
CN111143574A (en) Query and visualization system construction method based on minority culture knowledge graph
CN112434520A (en) Named entity recognition method and device and readable storage medium
CN112528649A (en) English pinyin identification method and system for multi-language mixed text
CN112632253A (en) Answer extraction method and device based on graph convolution network and related components
CN112818091A (en) Object query method, device, medium and equipment based on keyword extraction
CN114298010A (en) Text generation method integrating dual-language model and sentence detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant