CN111291550B

CN111291550B - Chinese entity extraction method and device

Info

Publication number: CN111291550B
Application number: CN202010054462.7A
Authority: CN
Inventors: 董哲; 邵若琦; 康宇佳; 李月恒
Original assignee: North China University of Technology
Current assignee: North China University of Technology
Priority date: 2020-01-17
Filing date: 2020-01-17
Publication date: 2021-09-03
Anticipated expiration: 2040-01-17
Also published as: CN111291550A

Abstract

The embodiment of the invention discloses a method and a device for extracting Chinese entities, wherein the method comprises the following steps: dividing the target source sentence into sub-sentences; vectorizing the words in the clauses to obtain word vectors; determining a probability matrix of each label corresponding to each word obtained by the long-short term memory network LSTM according to the word vector and the hierarchical bidirectional long-short term memory network BiLSTM; inputting the probability matrix into a CRF model to obtain a label with the maximum probability in all labels respectively corresponding to each word; and extracting the entity consisting of the words corresponding to the labels with the highest probability. In the embodiment of the invention, the target source clause is divided into clauses, so that the subsequent semantic representation in the learning clauses at the word level and the semantic representation among the learning clauses at the clause level are facilitated; through the CRF model, the label with the maximum probability in the labels respectively corresponding to each word is determined, and the Chinese entity consisting of the word corresponding to the label with the maximum probability is extracted, so that the accuracy of Chinese entity identification is improved.

Description

Chinese entity extraction method and device

Technical Field

The invention relates to the technical field of computers, in particular to a Chinese entity extraction method and device.

Background

With the progress of science and technology and the digitization of information, various industries have undergone great changes and innovations.

In recent years, there has been a continuous interest in Entity identification in specific fields, such as in the field of food safety, NER (Named Entity identification) automatically identifies entities related to food and generates structured data to help construct a knowledge graph in the food field. Domain-specific cases are usually recorded by a logger, but sometimes the logger uses chinese abbreviations, resulting in multiple expressions for the same entity. And for the entity that Chinese characters, letters, numbers and punctuation marks are mixed together, the difficulty of identifying the entity is increased.

At present, entities in a specific field have certain field specificity, and research on identifying entities in a specific field is not deep enough. Deep neural networks have achieved better experimental results in the identification of entities in general domain text, but are less useful in the identification of entities in specific domain text. In addition, in the domain-specific entity identification process, the entities of the domain-specific entity may be in different positions of the sentence, so that different information is required in identifying the domain-specific entity, i.e., the context information has different degrees of influence on the domain-specific entity identification. In order to accurately identify the domain-specific entity, the context information of the sentence needs to be fully considered in the process of identifying the entity. In the prior art, a Long sentence is directly input into a BilSTM-CRF (Bi-directional Long Short-Term Memory-Conditional Random Field) model, and the semantic information of the sentence is not considered sufficiently by the mode.

Disclosure of Invention

Because the existing method has the problems, the embodiment of the invention provides a method and a device for extracting Chinese entities.

In a first aspect, an embodiment of the present invention provides a method for extracting a chinese entity, including:

based on punctuation marks, segmenting a target source sentence to obtain clauses;

vectorizing the words in the clauses to obtain word vectors;

determining a probability matrix of each label corresponding to each word obtained by the long-short term memory network LSTM according to the word vector and the hierarchical bidirectional long-short term memory network BiLSTM; wherein the bidirectional long-short term memory network BilSTM of the hierarchy comprises a first bidirectional long-short term memory network BilSTM and a second bidirectional long-short term memory network BilSTM;

inputting the probability matrix into a conditional random field model CRF to obtain labels with the highest probability in the labels respectively corresponding to each word;

and extracting Chinese entities consisting of words corresponding to the labels with the maximum probability.

Optionally, the segmenting the target source sentence based on the punctuation marks to obtain a clause includes:

adding a special mark after the last word of each clause;

wherein the special mark represents a clause termination.

Optionally, the vectorizing the words in the clause to obtain a word vector includes:

and vectorizing the words in the clauses by using a Skip-gram model of Word2vec to obtain a Word vector.

Optionally, determining a probability matrix of each label corresponding to each word obtained by the long-short term memory network LSTM according to the word vector and the hierarchical bidirectional long-short term memory network BiLSTM; wherein, the bidirectional long-short term memory network BilSTM of the hierarchy comprises a first bidirectional long-short term memory network BilSTM and a second bidirectional long-short term memory network BilSTM, and comprises:

inputting the word vector into the first bidirectional long-short term memory network BilSTM to obtain a clause semantic vector;

inputting the clause semantic vector into the second bidirectional long-short term memory network BilSTM to obtain a target source sentence semantic vector;

and inputting the clause semantic vector and the target source sentence semantic vector into the long-short term memory network LSTM to obtain a probability matrix of each label corresponding to each word.

In a second aspect, an embodiment of the present invention further provides a chinese entity extracting apparatus, including: the device comprises a cutting module, a vectorization processing module, a determining module, an obtaining module and an extracting module;

the segmentation module is used for segmenting a target source sentence based on punctuation marks to obtain clauses;

the vectorization processing module is used for vectorizing the words in the clauses to obtain word vectors;

the determining module is used for determining the probability matrix of each label corresponding to each word obtained by the long-short term memory network LSTM according to the word vector and the hierarchical bidirectional long-short term memory network BiLSTM; wherein the bidirectional long-short term memory network BilSTM of the hierarchy comprises a first bidirectional long-short term memory network BilSTM and a second bidirectional long-short term memory network BilSTM;

the obtaining module is used for inputting the probability matrix into a conditional random field model CRF to obtain the label with the maximum probability in the labels corresponding to each character;

the extraction module is used for extracting Chinese entities formed by words corresponding to the labels with the maximum probability.

Optionally, the splitting module is specifically configured to:

adding a special mark after the last word of each clause;

wherein the special mark represents a clause termination.

Optionally, the vectorization processing module is specifically configured to:

Optionally, the determining module is specifically configured to:

In a third aspect, an embodiment of the present invention further provides an electronic device, including:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, which when called by the processor are capable of performing the above-described methods.

In a fourth aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium storing a computer program, which causes the computer to execute the above method.

According to the technical scheme, the target source clause is divided into the clauses, so that the subsequent expression of the semantics in the clauses at the word level and the expression of the semantics among the clauses at the clause level are facilitated; through a conditional random field model CRF, determining the label with the maximum probability in the labels respectively corresponding to each word and extracting the entity consisting of the word corresponding to the label with the maximum probability, so that the accuracy of entity identification is improved; semantic information of the sentence can be more comprehensively learned through a hierarchical BilSTM network.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method for extracting Chinese entities according to an embodiment of the present invention;

FIG. 2 is another schematic flow chart of a method for extracting Chinese entities according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a Chinese entity extraction apparatus according to an embodiment of the present invention;

fig. 4 is a logic block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following further describes embodiments of the present invention with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

Fig. 1 shows a schematic flow chart of a method for extracting chinese entities provided in this embodiment, including:

and S11, segmenting the target source sentence based on the punctuation marks to obtain clauses.

In the embodiment of the present invention, the target source sentence is a sentence of an entity to be extracted.

In the embodiment of the present invention, the entity may be in different positions of the target source sentence, and the different positions result in different information being required for identifying the entity, that is, the context information has different degrees of influence on the entity identification. In order to accurately identify the entity, the context information of the sentence needs to be fully considered in the process of identifying the entity. Therefore, in the embodiment of the present invention, the target source sentence is not directly input into a bilst (Bi-directional Long Short-Term Memory) network as a whole sentence, but the target source sentence is segmented according to punctuation marks in the target source sentence to obtain each clause, and then the target source sentence passes through the bilst network, so that semantic information of the target source sentence can be learned more comprehensively.

In the embodiment of the present invention, specifically, as shown in fig. 2, in the case of the field of food safety, that is, the target source sentence is "schyu shop, which knows baking powder. "is an example. And segmenting the target source sentence according to the comma and the sentence number in the target source sentence to obtain two clauses, namely 'xu Yu-Fang shop' and 'know baking powder thereof'.

And S12, vectorizing the words in the clauses to obtain word vectors.

In the embodiment of the invention, the food safety field case is applied to the deep learning neural network, and the vectorization expression of the input data of the deep learning neural network is required.

In the embodiment of the present invention, the words in the clause split in S11 are vectorized to obtain a word vector. Specifically, each word in the clauses "xu yu shop" and "know pounding powder thereof" is subjected to vectorization processing, and word vectors corresponding to each word, namely "many", "jade", "aryl", "shop", "know", "pounding", and "powder", are obtained.

S13, determining the probability matrix of each label corresponding to each word obtained by the long-short term memory network LSTM according to the word vector and the hierarchical bidirectional long-short term memory network BilTM; wherein the bidirectional long-short term memory network BilSTM of the hierarchy comprises a first bidirectional long-short term memory network BilSTM and a second bidirectional long-short term memory network BilSTM.

In the embodiment of the invention, the word vector is firstly input into the first bidirectional long-short term memory network BilSTM in the bidirectional long-short term memory network BilSTM of the hierarchy, then the output of the first bidirectional long-short term memory network BilSTM is input into the second bidirectional long-short term memory network BilSTM in the bidirectional long-short term memory network BilSTM of the hierarchy, and then the output of the first bidirectional long-short term memory network BilSTM and the output of the second bidirectional long-short term memory network BilSTM are respectively input into the long-short term memory network LSTM. And determining the probability matrix of each label corresponding to each word by using the long-short term memory network LSTM.

In the embodiment of the invention, the bidirectional long-short term memory network BilsTM of the hierarchy is connected with the long-short term memory network LSTM. It should be noted that, the word vectors labeled with labels are used as a training set to train the well-connected hierarchical bidirectional long-short term memory network BiLSTM and long-short term memory network LSTM.

And S14, inputting the probability matrix into a conditional random field model CRF to obtain the label with the maximum probability in the labels corresponding to each word.

In the embodiment of the present invention, the probability matrix obtained in S13 is used as an input of a CRF (Conditional Random Field) model. And the output of the CRF model is the label with the maximum probability in the labels respectively corresponding to each word.

And S15, extracting Chinese entities consisting of words corresponding to the labels with the maximum probability.

In the embodiment of the invention, the Chinese entity is composed of each word.

In the embodiment of the invention, the Chinese entity consisting of the word corresponding to the label with the maximum probability is extracted.

The embodiment of the invention divides the target source clause into each clause, which is beneficial to the subsequent expression of the semantics in the learning clauses at the word level and the expression of the semantics among the learning clauses at the clause level; through the conditional random field model CRF, the label with the maximum probability in the labels respectively corresponding to each word is determined, and the Chinese entity consisting of the words corresponding to the labels with the maximum probability is extracted, so that the accuracy of entity identification is improved.

Further, on the basis of the above method embodiment, the segmenting a target source sentence based on punctuation marks to obtain clauses includes:

adding a special mark after the last word of each clause;

wherein the special mark represents a clause termination.

In the embodiment of the invention, the target source sentence is segmented based on punctuation marks to obtain each clause. Assuming that the target source sentence is x, obtaining i sub-sentences x after segmentation₁，x₂，…x_i. Wherein clause x₁The word inside is denoted by x₁＝(x₁₁,x₁₂,...x_1j). In the same way, clause x_iThe word inside is denoted by x_i＝(x_i1,x_i2,...x_ij). At each clause, i.e. x₁，x₂，…x_iIs followed by a special mark, e.g.<end>. Wherein the special mark represents a clause termination.

The embodiment of the invention divides the target source clause into each clause, which is beneficial to the subsequent expression of the semantics in the learning clauses at the word level and the expression of the semantics among the learning clauses at the clause level; special marks are added after the last word of each clause for distinguishing different clauses.

Further, on the basis of the above method embodiment, the vectorizing the words in the clause to obtain a word vector includes:

In the embodiment of the invention, the clause, namely x, is subjected to Skip-gram model of Word2vec₁，x₂，…x_iThe words in (2) are subjected to vectorization processing to obtain word vectors.

The embodiment of the invention carries out vectorization processing on the words in the clauses to obtain the word vector so as to ensure the input accuracy of the BilSTM network.

Further, on the basis of the embodiment of the method, the probability matrix of each label corresponding to each word is obtained by the long-short term memory network LSTM according to the word vector and the hierarchical bidirectional long-short term memory network BiLSTM; wherein, the bidirectional long-short term memory network BilSTM of the hierarchy comprises a first bidirectional long-short term memory network BilSTM and a second bidirectional long-short term memory network BilSTM, and comprises:

In the embodiment of the present invention, "first" and "second" in the first bidirectional long-short term memory network BiLSTM and the second bidirectional long-short term memory network BiLSTM are used to distinguish two different bilstms. The forward LSTM and backward LSRM are combined into a BiLSTM. The hierarchical BilSTM is composed of the intra-clause BilSTM and the inter-clause BilSTM, i.e. the first bidirectional long-short term memory network BilSTM and the second bidirectional long-short term memory network BilSTM.

In the embodiment of the invention, the word vector is input into a first bidirectional long-short term memory network BilSTM to obtain a clause semantic vector. And inputting the clause semantic vector into a second bidirectional long-short term memory network (BilSTM) to obtain a target source sentence semantic vector. Specifically, the hierarchical BilSTM network is composed of the first bidirectional long-short term memory network BilSTM and the second bidirectional long-short term memory network BilSTM. The hierarchical BilTM network replaces the hidden node in the common recurrent neural network with a memory unit. The memory unit is controlled by 3 gates, respectively a forgetting gate, an input gate and an output gate. The forgetting gate is marked as f, information to be discarded and information to be retained in the cell state are determined, the information from the previous hidden state and the currently input information are simultaneously transmitted to the sigmoid function, the output value is between 0 and 1, the closer to 0, the more the information is to be discarded, and the closer to 1, the more the information is to be retained. And the input gate is marked as i, determines information to be added in the cell state, and receives the information of the previous layer hidden state and the currently input information. And the output gate is marked as o. The value of the output gate is multiplied by the intermediate state value at the current moment bit by bit to generate the final output value of the memory unit.

In the embodiment of the present invention, the underlying BilsTM network, i.e., the first bidirectional long and short term memory network BilsTM, first learns the representation of the semantic meaning in the clause at the word level, i.e., between words, and then uses the learned representation of the semantic meaning in the clause as the input of the previous layer BilsTM network, i.e., the second bidirectional long and short term memory network BilsTM, to learn the representation of the semantic meaning between the clauses at the clause level, i.e., between the clauses. Specifically, the formula for using the first bi-directional long-short term memory network BiLSTM within a clause is as follows:

and

is a forward and backward hidden state representing the jth word in the ith clause, v (x)_i,j) Is the word x_i,jIs represented by a vectorization of (a),

the semantic information of the words before and after the j-th word in the clause is summarized. More specific forward hidden state

Is calculated as:

an input gate:

forget the door:

an output gate:

word-level cell state:

hidden state at word level:

wherein, W_i ^w，

Is a weight matrix,. indicates a multiplication by element, and σ indicates a sigmod activation function.

First bidirectional long-term and short-term memoryThe last hidden states in both directions of the network BiLSTM are used together to represent the whole clause, in the form:

the formula for using the second bidirectional long-short term memory network BilSTM between clauses is as follows:

and

are forward and backward hidden states between clauses,

to learn the semantics of the entire target source sentence.

In the embodiment of the invention, the clause semantic vector and the target source sentence semantic vector are input into the long-short term memory network LSTM, and the probability matrix of each label corresponding to each word is obtained. Specifically, the long-short term memory network LSTM is decoded as a decoder, and the formula is as follows:

L＝LSTM(H^w,H^s)

during decoding, each time a special marker is predicted, it indicates that the decoding of the current intra-clause word is finished and the decoding of the next intra-clause word is started. Hidden state s of decoder_tThe calculation formula of (a) is as follows:

s_t＝o_t⊙tanh(C_t)

g_t＝σ(w^gss_t-1+w^gyv(y_t-1)+w^gwH^w+w^gsH^s)

f_t＝σ(w^fss_t-1+w^fyv(y_t-1)+(1-g_t)w^fwH^w+g_tw^fsH^s)

i_t＝σ(w^iss_t-1+w^iyv(y_t-1)+(1-g_t)w^iwH^w+g_tw^isH^s)

o_t＝σ(w^oss_t-1+w^oyv(y_t-1)+(1-g_t)w^owH^w+g_tw^osH^s)

wherein w^cs,w^cy,w^cw,w^gs,w^gy,w^gw,w^fs,w^fy,w^fw，w^is，w^iy，w^iw，w^os，w^oyW is a weight matrix, consisting of_t-1,y_t-1,H^w,H^sCalculated f_t,i_t,o_t,g_tRespectively representing a forgetting gate, an input gate, an output gate and semantic information in the BilSTM. For input X ═ X₁,x₂,...x_i) Obtaining a predicted sequence y ═ y (y)₁,y₂,...y_t) Definition of y_tThe probability matrix for the predicted sequence tag is:

P(y_t|s_t-1,y_t-1,H^w,H^s)∝exp(v(y_t)^TWm_t)

wherein, v (y)_t)^TRepresents the sequence y_tVectorization ofDenotes that W ∈ R^2k×nIs a vector matrix, k is the hidden unit of BilSTM, n is the number of label types, m_tIs s_t-1,y_t-1,H^w,H^sThe combined embedded vector obtained by concatenation.

In an embodiment of the invention, the output of the LSTM decoding layer is converted to the input of the CRF by a linear function using a standard CRF layer on top of the model (as shown in fig. 2).

P＝LW_p+b_p

W_p∈R^2k×n，b_p∈RⁿAre the parameters to be learned.

For input X, the probability of outputting the best tag sequence y can be defined as:

for each input X, the scores for all possible annotation sequences y are found:

wherein, P_i,ytNon-normalized probabilistic entity tag representing word mapping to name, A_yt,yt+1Is y_tTo y_t+1When the number of labels (B-per, B-loc … …) is n, the dimension of the transition probability matrix in the CRF model is (n +2) × (n +2), and m is the length of the input clause, because a start position and an end position are additionally added. Y is_X(ii) ('O', 'B-per', 'I-per', 'B-loc', 'I-loc', 'B-food', 'I-food', 'B-add', 'I-add', 'B-att', 'I-att'), the numerator s function represents the correct tag sequence score and the denominator s function represents each possible tag sequence score. The larger the value of p (y | X), the more accurate the prediction.

During model training, the loss function is defined as:

and calculating a loss function value, and continuously updating the network parameters until the iteration is finished.

The embodiment of the invention can learn the semantic information of the sentence more comprehensively through the hierarchical BilSTM network.

It should be noted that, in the embodiment of the present invention, the labeling label adopts a BIO labeling method. "O" represents a generic term that is not a solid; "B-per" represents the beginning word of the name entity; "I-per" represents the middle word of the name entity; "B-loc" represents the start word of the place name entity; "I-loc" represents the middle word of the place name entity; "B-food" represents the start word of the name entity; "I-food" represents the middle word of the food name entity; "B-add" represents the start word of the food additive name entity; "I-add" represents the middle word of the food additive name entity; "B-att" represents the start word of the case attribute entity; "I-att" represents the middle word of the case attribute entity.

It should be noted that, in the embodiment of the present invention, there are a plurality of tags corresponding to each word, and the most probable tag (i.e., the probability of the best tag sequence y mentioned above) is selected from the tags corresponding to each word by the CRF model. And the word corresponding to the label with the maximum probability forms a Chinese entity. As shown in FIG. 2, the Chinese entities "xu Yu Fang" and "baking powder" are screened out.

Fig. 3 is a schematic structural diagram of a chinese entity extraction device provided in this embodiment, where the device includes: a segmentation module 31, a vectorization processing module 32, a determination module 33, an obtaining module 34 and an extraction module 35;

the segmentation module 31 is configured to segment the target source clause based on the punctuation marks to obtain clauses;

the vectorization processing module 32 is configured to perform vectorization processing on the words in the clauses to obtain word vectors;

the determining module 33 is configured to determine, according to the word vector and the hierarchical bidirectional long-short term memory network BiLSTM, a probability matrix of each label corresponding to each word obtained by the long-short term memory network LSTM; wherein the bidirectional long-short term memory network BilSTM of the hierarchy comprises a first bidirectional long-short term memory network BilSTM and a second bidirectional long-short term memory network BilSTM;

the obtaining module 34 is configured to input the probability matrix into a conditional random field model CRF to obtain a label with a highest probability among labels respectively corresponding to each word;

the extracting module 35 is configured to extract a chinese entity composed of words corresponding to the label with the maximum probability.

Further, on the basis of the above device embodiment, the dividing module 31 is specifically configured to:

adding a special mark after the last word of each clause;

wherein the special mark represents a clause termination.

Further, on the basis of the above device embodiment, the vectorization processing module 32 is specifically configured to:

Further, on the basis of the above device embodiment, the determining module 33 is specifically configured to:

The two-dimensional code information processing apparatus described in this embodiment may be used to execute the above method embodiments, and the principle and technical effect are similar, which are not described herein again.

Referring to fig. 4, the electronic device includes: a processor (processor)41, a memory (memory)42, and a bus 43;

wherein the content of the first and second substances,

the processor 41 and the memory 42 complete mutual communication through the bus 43;

the processor 41 is configured to call program instructions in the memory 42 to perform the methods provided by the above-described method embodiments.

The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the method embodiments described above.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

It should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A Chinese entity extraction method is characterized by comprising the following steps:

vectorizing the words in the clauses to obtain word vectors;

extracting a Chinese entity consisting of words corresponding to the label with the maximum probability;

determining a probability matrix of each label corresponding to each word obtained by the long-short term memory network LSTM according to the word vector and the hierarchical bidirectional long-short term memory network BiLSTM; wherein, the bidirectional long-short term memory network BilSTM of the hierarchy comprises a first bidirectional long-short term memory network BilSTM and a second bidirectional long-short term memory network BilSTM, and comprises:

2. The method of claim 1, wherein the segmenting a target source sentence based on punctuation to obtain clauses comprises:

adding a special mark after the last word of each clause;

wherein the special mark represents a clause termination.

3. The method of claim 1, wherein the vectorizing the words in the clauses to obtain a word vector comprises:

4. A Chinese entity extraction device, comprising: the device comprises a cutting module, a vectorization processing module, a determining module, an obtaining module and an extracting module;

the extraction module is used for extracting Chinese entities consisting of words corresponding to the labels with the maximum probability;

wherein the determining module is specifically configured to:

5. The chinese entity extraction device of claim 4, wherein the segmentation module is specifically configured to:

adding a special mark after the last word of each clause;

wherein the special mark represents a clause termination.

6. The chinese entity extraction device of claim 4, wherein the vectorization processing module is specifically configured to:

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of chinese entity extraction of any of claims 1 to 3 when executing the program.

8. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the chinese entity extraction method of any of claims 1 to 3.