CN109710927B - Named entity identification method and device, readable storage medium and electronic equipment - Google Patents

Named entity identification method and device, readable storage medium and electronic equipment Download PDF

Info

Publication number
CN109710927B
CN109710927B CN201811519563.6A CN201811519563A CN109710927B CN 109710927 B CN109710927 B CN 109710927B CN 201811519563 A CN201811519563 A CN 201811519563A CN 109710927 B CN109710927 B CN 109710927B
Authority
CN
China
Prior art keywords
participle
real
conditional probability
target
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811519563.6A
Other languages
Chinese (zh)
Other versions
CN109710927A (en
Inventor
贾弼然
崔朝辉
赵立军
张霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Neusoft Corp
Original Assignee
Neusoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Neusoft Corp filed Critical Neusoft Corp
Priority to CN201811519563.6A priority Critical patent/CN109710927B/en
Publication of CN109710927A publication Critical patent/CN109710927A/en
Application granted granted Critical
Publication of CN109710927B publication Critical patent/CN109710927B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The disclosure relates to a method and a device for identifying named entities and a readable storage mediumAnd an electronic device. The method comprises the following steps: determining the t-th target participle x in text t All possible corresponding real participles; respectively determining a first conditional probability p (a) of each participle state corresponding to each real participle for each real participle d |l i ) Wherein a is d Characterizing the d-th real participle, l i Characterizing an ith word segmentation state; according to each real participle corresponding to the target participle x t Second conditional probability p (x) t |a d ) And the first conditional probability p (a) d |l i ) Determining that each participle state corresponds to the target participle x t Third conditional probability p (x) t |l i ) (ii) a According to the third conditional probability p (x) t |l i ) For the target word segmentation x t Named entity recognition is performed. Therefore, the accuracy and recall rate of named entity recognition are improved, and the situations of multiple characters, few characters or wrongly-written characters in the text recognition process can be effectively avoided.

Description

Named entity identification method and device, readable storage medium and electronic equipment
Technical Field
The present disclosure relates to the field of natural language processing, and in particular, to a method and an apparatus for identifying a named entity, a readable storage medium, and an electronic device.
Background
With the application of artificial intelligence, natural language processing is increasingly gaining importance and popularity. In the natural language processing engineering, named entity recognition is an important step in the early stage of natural language processing, and has great significance for entities such as time, numbers, names of people, place names, names of organizations and the like in texts in many research fields. At present, hidden Markov Models (HMMs) are mostly used for named entity recognition, but some problems usually occur in the recognition process, for example, different translated words may occur to transliterated entities in an open-set text, which may cause great ambiguity and high error rate in the recognition process, or problems of many words, few words, or wrongly written words may occur in texts obtained by labeling and translating some linguistic data with low quality. Therefore, named entities in text cannot be accurately identified using existing HMM models.
Disclosure of Invention
In order to overcome the problems in the prior art, embodiments of the present disclosure provide a method and an apparatus for identifying a named entity, a readable storage medium, and an electronic device.
In order to achieve the above object, a first aspect of the present disclosure provides a method for identifying a named entity, including:
determining the t-th target word segmentation x in the text t All possible corresponding real participles;
respectively determining a first conditional probability p (a) of each participle state corresponding to each real participle for each real participle d |l i ) Wherein a is d Characterizing the d-th real participle, l i Representing the ith word segmentation state;
according to each real participle corresponding to the target participle x t Second conditional probability p (x) t |a d ) And the first conditional probability p (a) d |l i ) Determining that each participle state corresponds to the target participle x t Third conditional probability p (x) t |l i );
According to the third conditional probability p (x) t |l i ) For the target word segmentation x t Named entity recognition is performed.
Optionally, for each of the real participles, determining a first conditional probability p (a) that each participle state corresponds to the real participle respectively d |l i ) The method comprises the following steps:
determining the target participle x for each of the real participles t A fourth conditional probability p (a) corresponding to the true participle d |x t );
According to the target word segmentation x t The fourth conditional probability p (a) corresponding to each of the real participles d |x t ) Estimate each ofThe participle state corresponds to the first conditional probability p (a) of each of the real participles d |l i )。
Optionally, the word segmentation according to the target x t The fourth conditional probability p (a) corresponding to each of the real participles d |x t ) Estimating said first conditional probability p (a) that each participle state corresponds to each said real participle d |l i ) The method comprises the following steps:
according to the following equations (1) to (2), d (z) will be made t ,y i ) Satisfying predetermined conditions
Figure BDA0001902881500000027
Determining the first conditional probability that each participle state corresponds to each of the real participles:
Figure BDA0001902881500000021
Figure BDA0001902881500000022
wherein D characterizes a total number of the real participles,
Figure BDA0001902881500000023
characterizing the target participle x t A fourth conditional probability corresponding to the d-th real participle,
Figure BDA0001902881500000024
a first conditional probability characterizing an ith participle state as corresponding to a d-th real participle,
Figure BDA0001902881500000025
characterizing the target participle x t A vector of fourth conditional probabilities corresponding to each of the real participles,
Figure BDA0001902881500000026
characterizing the ith participle stateVector of first conditional probabilities, d (z), corresponding to each real participle t ,y i ) Characterization z t And y i Relative entropy of (2).
Optionally, the preset conditions are: loss function
Figure BDA0001902881500000031
Minimum; wherein, T i The representation belongs to the ith participle state l i L characterizes a total number of the participle states,
Figure BDA0001902881500000033
representing the ith word segmentation state and the target word segmentation x t If the relation exists, the relation is 1, otherwise, the relation is 0.
Optionally, said method further comprises said determining each said real participle corresponds to said target participle x t Second conditional probability p (x) t |a d ) And the first conditional probability p (a) d |l i ) Determining that each participle state corresponds to the target participle x t Third conditional probability p (x) t |l i ) The method comprises the following steps:
determining that each participle state corresponds to the target participle x according to the following formula (3) t Third conditional probability p (x) t |l i ):
Figure BDA0001902881500000032
Wherein D characterizes a total number of the true participles.
Optionally, said determining is according to said third conditional probability p (x) t |l i ) For the target participle x t Conducting named entity recognition, including:
determining the participle state corresponding to the maximum third conditional probability as the target participle x t Identifies the result.
A second aspect of the present disclosure provides an apparatus for identifying a named entity, including:
a first determining module for determining the t target participle x in the text t All possible corresponding real participles;
a second determining module, configured to determine, for each of the real participles determined by the first determining module, a first conditional probability p (a) that each participle state corresponds to the real participle d |l i ) Wherein a is d Characterizing the d-th real participle, l i Representing the ith word segmentation state;
a third determining module, configured to determine that each of the real participles corresponds to the target participle x according to the determination result of the second determining module t Second conditional probability p (x) t |a d ) And the first conditional probability p (a) d |l i ) Determining that each participle state corresponds to the target participle x t Third conditional probability p (x) t |l i );
An identification module for determining the third conditional probability p (x) according to the third condition probability t |l i ) For the target word segmentation x t Named entity recognition is performed.
Optionally, the second determining module includes:
a first determining sub-module for determining the target participle x for each of the real participles t A fourth conditional probability p (a) corresponding to the true participle d |x t );
An estimation submodule for determining the target word segmentation x according to the first determination submodule t The fourth conditional probability p (a) corresponding to each of the real participles d |x t ) Estimating said first conditional probability p (a) that each participle state corresponds to each said real participle d |l i )。
Optionally, the estimation sub-module comprises:
a second determination submodule for making d (z) according to the following equations (1) to (2) t ,y i ) Satisfying predetermined conditions
Figure BDA0001902881500000041
Determining the first conditional probability for each participle state corresponding to each of the real participles:
Figure BDA0001902881500000042
Figure BDA0001902881500000043
wherein D characterizes a total number of the real participles,
Figure BDA0001902881500000044
characterizing the target participle x t A fourth conditional probability corresponding to the d-th real participle,
Figure BDA0001902881500000045
characterizing a first conditional probability that an ith participle state corresponds to a d-th true participle,
Figure BDA0001902881500000046
characterizing the target participle x t A vector of fourth conditional probabilities corresponding to each of the real participles,
Figure BDA0001902881500000047
vector characterizing the first conditional probability that the ith participle state corresponds to each real participle, d (z) t ,y i ) Characterization z t And y i Relative entropy of (2).
Optionally, the preset conditions are: loss function
Figure BDA0001902881500000051
Minimum; wherein, T i The representation belongs to the ith participle state l i L characterizes a total number of the participles states,
Figure BDA0001902881500000053
characterizing the ith participleState and the target participle x t Whether the two are related or not is judged, if so, the relation is 1, otherwise, the relation is 0.
Optionally, the third determining module includes:
a third determining sub-module for determining that each participle state corresponds to the target participle x according to the following formula (3) t Third conditional probability p (x) t |l i ):
Figure BDA0001902881500000052
Wherein D characterizes a total number of the true participles.
Optionally, the identification module comprises:
a fourth determining submodule, configured to determine a word segmentation state corresponding to the maximum third conditional probability as the target word segmentation x t Identifies the result.
The third aspect of the present disclosure also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method provided by the first aspect of the present disclosure.
The fourth aspect of the present disclosure also provides an electronic device, including:
a memory having a computer program stored thereon;
a processor for executing the computer program in the memory to implement the steps of the method provided by the first aspect of the present disclosure.
According to the technical scheme, all possible real participles corresponding to the target participle, the first conditional probability of each participle state corresponding to the real participle and the second conditional probability of each real participle corresponding to the target participle are taken into consideration when the target participle is subjected to named entity recognition, so that the obtained third conditional probability of each participle state corresponding to the target participle substantially represents the relation among the target participle, the real participle and the participle state, the accuracy and recall rate of named entity recognition are improved, and the situations of multiple characters, few characters or wrongly written characters in the text recognition process can be effectively avoided.
Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:
FIG. 1 is a flow diagram illustrating a method for named entity identification in accordance with an exemplary embodiment.
FIG. 2 is a flow diagram illustrating a method for named entity identification in accordance with another exemplary embodiment.
Fig. 3 is a flow diagram illustrating an apparatus for identifying named entities in accordance with an exemplary embodiment.
FIG. 4 is a block diagram illustrating an electronic device in accordance with an example embodiment.
Detailed Description
The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.
First, an HMM model will be explained.
The HMM model consists of five parts:
(1) The number of state words L in the model, namely the number of the role labeling state sets.
(2) The number of different symbols (also called participles) that each participle state may output, T i I.e. the role tagging state may output the total number of participles.
(3) State transition probability matrix a = { a = ij And (5) a probability matrix for conversion among all role labeling states is defined as follows:
a ij =P(l j |l i ),1≤i,j≤L
a ij ≥0
Figure BDA0001902881500000061
wherein, a ij Characterizing role annotation State Slave State l i Transition to State l j I represents the ith angular color annotation state and j represents the jth angular color annotation state.
(4) In a state of i Occurrence of participle x t Probability distribution matrix B = { B = { B } i (t), wherein the probability distribution matrix, also called the emission probability matrix, characterizes the relationship between states and participles:
b i (t)=P(x t |l i ),1≤i≤L,1≤t≤T i
b i (t)≥0
Figure BDA0001902881500000071
wherein T represents the tth participle, T i The characterization belongs to the ith angular color annotation state l i The total number of participles of (c).
(5) Initial state matrix probability distribution pi = { pi = i I.e. the probability of which role a participle starts with, states:
π i =P(l i ),1≤i≤L
π i ≥0
Figure BDA0001902881500000072
in summary, the quintuple of the HMM model can be denoted as μ = (l, t, a, B, pi).
When the HMM model is used for named entity recognition, firstly, a text is input into the HMM model and roughly cut, then, the result after rough cutting is compared with a training corpus, role marking is carried out, and a calculation value required by a Viterbi (viterbi) algorithm is calculated, wherein the calculation value is the number of each parameter in a quintuple in the HMM modelThe value is then identified on the basis of the resulting calculated value using the viterbi (viterbi) algorithm. That is, the state l is calculated separately i Occurrence of participle x t Wherein i is greater than or equal to 1 and less than or equal to L, t is greater than or equal to 1 and less than or equal to M, and recognizing the text according to the probability distribution matrix. Illustratively, the state l is referenced to an existing set of role labels i May for example be: the characters of the surname, the double first name, the person first name and the above, etc. Wherein, the role mark set has no absolute standard and needs the adjustment of the prior general statement and expert knowledge.
In summary, the accuracy of named entities in the text is the same as determined above in state l i Occurrence of participle x t Is determined, and therefore, in order to improve the accuracy of the recognition of the named entity in the text, it is ensured that the state l calculated in the HMM model is in state i i Occurrence of participle x t The accuracy of the probability distribution matrix.
Next, a method for identifying a named entity provided in the present disclosure will be described. Referring to fig. 1, fig. 1 is a flow chart illustrating a method for identifying a named entity according to an exemplary embodiment. As shown in fig. 1, the method may include the following steps.
In step 11, the t-th target participle x in the text is determined t Corresponding to all possible real participles.
After the text is segmented, a plurality of target segmented words can be obtained, and all possible real segmented words corresponding to the target segmented words are determined for each target segmented word. The target word segmentation may be a single word or a word composed of multiple words, and is not specifically limited in this disclosure.
In the present disclosure, for convenience of explanation, one target word segmentation may be exemplified. Illustratively, as described in step 11, for the t-th target participle x in the text t Determining the target participle x t Corresponding to all possible real participles. In particular, in the known target participle x t Then, the target word segmentation x can be identified under the actual condition through statistics according to the historical text identification result t All possible true participles encountered.
For example, the text is "Beatrice", which is the name of a singer, and it should be "beidellite" after being correctly translated, and some people sometimes translate the name of the person into "bialarce", so that the obtained target participles include "ratio" and "adalarce", wherein for the target participle "ratio", the real participle corresponding to the target participle can be considered statistically as: "shellfish" and "ratio"; for the target word "atlas", the real word corresponding to the target word may be considered as: "Yacuisi" and "atlas".
For another example, if the text is "Peking collaborate with Hospital department", the target participle is "Peking" or "collaborating with Hospital" or "department of neurology", and the real participle can be considered as "Peking" or "collaborating with Hospital" or "department of neurology" in statistics.
In step 12, for each real participle, a first conditional probability p (a) of each participle state corresponding to the real participle is determined respectively d |l i )。
Wherein, a d Characterizing the d-th real participle, l i And characterizing the ith word segmentation state, wherein the word segmentation state is determined according to the daily newspaper marking corpus of people and is stored in the HMM model in advance.
In step 13, according to the second conditional probability p (x) that each real participle corresponds to the target participle t |a d ) And a first conditional probability p (a) d |l i ) Determining that each participle state corresponds to a target participle x t Third conditional probability p (x) t |l i )。
Second conditional probability p (x) t |a d ) The relationship between the representation target word segmentation and the real word segmentation can be determined according to the historical text recognition result. In particular, as described above, given the target participle x t Under the condition of all corresponding possible real participles, calculating the target participle x appearing under the condition that each real participle appears t Is the second conditional probability p (x) t |a d ) According to whichStep second conditional probability p (x) t |a d ) And the first conditional probability p (a) determined in step 12 d |l i ) Determining that each participle state corresponds to a target participle x t Third conditional probability p (x) t |l i )。
Wherein the second conditional probability p (x) t |a d ) Can be expressed as
Figure BDA0001902881500000091
w(x t ,a d ) Characterized in that the real participle a appears d Temporal target word segmentation x t Number of times of (a), w (a) d ) Characterizing the occurrence of true participles a d The number of times.
In step 14, according to the third conditional probability p (x) t |l i ) To target participle x t Named entity recognition is performed.
After determining the third conditional probability p (x) t |l i ) Then, further according to the third conditional probability p (x) t |l i ) To target participle x t Named entity recognition is performed.
It should be noted that the above steps 11 to 14 may be performed on each target participle in the text, so as to implement named entity recognition on each target participle in the text.
By adopting the technical scheme, all possible real participles corresponding to the target participle are considered when the target participle is subjected to named entity recognition, and the first conditional probability that each participle state corresponds to the real participle and the second conditional probability that each real participle corresponds to the target participle are considered, so that the obtained third conditional probability that each participle state corresponds to the target participle substantially represents the relation among the target participle, the real participle and the participle state, the accuracy and recall rate of named entity recognition are improved, and the situations of multiple characters, few characters or wrongly-distinguished characters in the text recognition process can be effectively avoided.
Determining the first conditional probability p (a) of each participle state corresponding to each real participle d |l i ) Then, can be based onDetermining, by a probability formula, that each participle state corresponds to the target participle x t Third conditional probability p (x) t |l i ) Specifically, the specific implementation of step 13 may be: determining that each participle state corresponds to the target participle x according to the following formula (3) t Third conditional probability p (x) t |l i ):
Figure BDA0001902881500000101
Wherein D represents the total number of true participles.
Thus, each determined participle state corresponds to a target participle x t The third conditional probability represents the relationship among the target participle, the real participle and the participle state, so that the accuracy and recall rate of named entity recognition are improved, and the situations of multiple characters, few characters or wrongly-written characters in the text recognition process can be effectively avoided.
In addition, a third conditional probability p (x) is determined t |l i ) Then, in a possible embodiment, according to the third conditional probability p (x) t |l i ) Segmenting target word x t Named entity recognition is performed. In another preferred embodiment, in order to further improve the accuracy and recall rate of the named entity identification, the specific implementation manner of step 14 may be: determining the participle state corresponding to the maximum third conditional probability as the target participle x t The named entity of (1) identifies the result.
Specifically, the third conditional probability includes a plurality of participle states corresponding to the target participle x t And since the target participle x usually appears in each participle state t The probabilities of the target participle are different, that is, the probabilities are all different, and the probability that the target participle appears in the participle state corresponding to the maximum third conditional probability is the highest, so in the present disclosure, the participle state corresponding to the maximum third conditional probability may be determined as the target participle x t To further improve the segmentation of the target wordx t The accuracy of named entity identification.
For example, for a text of "Beatrice", if the target segmented word obtained by translation is "ratio", and the real segmented word corresponding to the target segmented word "ratio" is "shell" and "ratio", in the prior art, since the real segmented word "shell" is not input in the HMM model, and since the "ratio" does not have a segmented state of "last name" in the history data, the "ratio" cannot be recognized as "last name" when the target segmented word "ratio" is recognized. Similarly, for the target segmented word "atlas", since the real segmented word "jazz" is not input in the HMM model, and since "atlas" does not have a segmented state of "name" in the history data, it is impossible to recognize "atlas" as "name" when recognizing the target segmented word "atlas". Thus, in the conventional technique, "biaterice" cannot be recognized as a person name if the input text is "Beatrice".
In the present scheme, by inputting the real participles "shell" into the HMM model, all possible real participles encountered when identifying the target participle "ratio" are "shell" and "ratio", and determining that the first conditional probabilities of the participle states of "last name" corresponding to the real participles "shell" and "ratio" are p (shell | last name) and p (shell | last name), respectively, and the second conditional probabilities of each real participle corresponding to the target participle are p (shell | and p (ratio | ratio), respectively, thereby determining that the third conditional probability p (shell | last name) = p (shell | last name) + p (ratio | last name) of the participle states of "last name" corresponding to the target participle "ratio", respectively. As described above, since "ratio" does not have a word-dividing state of "last name" in the history data, the value of p (ratio | last name) is zero, and "shell" corresponds to the word-dividing state of "last name", and the probability value p (bei | last name) is larger, therefore, p (ratio | shell) p (bei | last name) is added on the basis of p (ratio | ratio) p (ratio | last name), so that the determined probability p (ratio | last name) is increased, that is, the possibility of identifying "ratio" as "last name" can be greatly increased. Similarly, the determined probability p (the attrices | name) is also increased in accordance with the above principle, that is, the possibility of identifying "attrices" as the "name" can be greatly increased. In summary, if the input text is "Beatrice", the translated "bialarce" can be recognized as a person name.
In summary, compared with the prior art, in the present disclosure, the probability that the word segmentation state corresponds to the target word segmentation is influenced by using the real word segmentation, so as to further improve the accuracy of the named entity identification of the target word segmentation.
Since the participle state belongs to the hidden parameter in the HMM model, the first conditional probability p (a) that each participle state corresponds to the real participle cannot be determined according to the historical text recognition result d |l i ) Therefore, in the present disclosure, the first condition probability may be estimated from a fourth condition probability determined from the historical text recognition result. Specifically, as shown in fig. 2, the step 12 may include the following steps.
In step 121, for each real participle, a target participle x is determined t A fourth conditional probability p (a) corresponding to the true participle d |x t )。
In this disclosure, the target participle x t A fourth conditional probability p (a) corresponding to the true participle d |x t ) This may be referred to as a posterior probability, and this posterior probability may be statistically derived. As described above, in the known target participle x t In the case of (2), the word segmentation x with the target word can be determined in the historical text recognition result t Corresponding all real participles, and then the target participle x can be determined t Fourth conditional probability p (a) corresponding to each true participle d |x t )。
Wherein the fourth conditional probability p (a) d |x t ) Can be expressed as
Figure BDA0001902881500000121
w(a d ,x t ) Characterised by the presence of the target word-segment x t The real participle a appears d Number of times of (a), w (x) t ) Characterizing emerging target participles x t The number of times.
In step 122, the target participle x is segmented according to the target t Corresponding to each real participleFourth condition probability p (a) d |x t ) Estimating a first conditional probability p (a) that each participle state corresponds to each real participle d |l i )。
The fourth condition probability represents the relation between the real participle and the target participle, and is determined from the historical text recognition result, so the fourth condition probability is more accurate. Therefore, in the present disclosure, from the accurate fourth conditional probability, the first conditional probability p (a) that each participle state corresponds to each true participle can be accurately estimated d |l i ) Ensuring the estimated first conditional probability p (a) d |l i ) To the accuracy of (2).
For example, considering that KL divergence (Kullback-Leibler divergence), also called relative entropy, is a measure for measuring the relative difference between two probability distributions in the same event space, the first condition probability p (a) closest to the fourth condition probability can be determined according to the relative entropy formula d |l i )。
Specifically, the implementation of step 122 may be: according to the following equations (1) to (2), d (z) will be made to t ,y i ) Satisfying predetermined conditions
Figure BDA0001902881500000138
Determining a first conditional probability that each participle state corresponds to each true participle:
Figure BDA0001902881500000131
Figure BDA0001902881500000132
wherein D represents the total number of the real participles,
Figure BDA0001902881500000133
characterizing a target participle x t A fourth conditional probability corresponding to the d-th real participle,
Figure BDA0001902881500000134
characterizing a first conditional probability that an ith participle state corresponds to a d-th true participle,
Figure BDA0001902881500000135
characterizing the target participle x t A vector of fourth conditional probabilities corresponding to each of the real participles,
Figure BDA0001902881500000136
vector characterizing the first conditional probability that the ith participle state corresponds to each real participle, d (z) t ,y i ) Characterization z t And y i Relative entropy of (c).
Each real participle determined above is an independent unit, but has a dependency relationship in context. And the real participles appearing in each participle state in the HMM model are not fixed, and each participle state may result in many possibilities. Thus, in this disclosure, the first conditional probability p (a) d |l i ) The following conditions also need to be satisfied:
0≤P(a d |l i )≤1
Figure BDA0001902881500000137
furthermore, one possible implementation is: the preset condition may represent a difference between the first condition probability distribution and the fourth condition probability distribution accepted by the user, where the difference may be a default numerical value or a numerical value set by the user, and the numerical values are all greater than zero.
Considering that the probability of many different and identical characters occurs in an actual problem, and when segmenting and labeling word segmentation states of a text, each word segmentation state may include a plurality of target words, therefore, in order to improve the accuracy of identifying named entities in the whole text, another preferred embodiment is: the predetermined condition being a loss function
Figure BDA0001902881500000141
And is minimal. Wherein, the T i The representation belongs to the ith participle state l i L represents the total number of participles states,
Figure BDA0001902881500000142
representing the ith word segmentation state and target word segmentation x t If the relation exists, the relation is 1, otherwise, the relation is 0.
Thus, the above solution can be made such that d (z) t ,y i ) Satisfying predetermined conditions
Figure BDA0001902881500000143
Is converted into a problem solving equation (4), i.e., the solution is made such that d (z) t ,y i ) Satisfying predetermined conditions
Figure BDA0001902881500000144
The problem (2) is converted into a problem for solving optimization, and then the solution is carried out according to the formula (4) and the formula (2), and the obtained optimal solution is the optimal solution
Figure BDA0001902881500000145
Figure BDA0001902881500000146
According to the formula (4), the following is solved: for any
Figure BDA0001902881500000147
Are all provided with
Figure BDA0001902881500000148
Wherein, as described above, the first and second substrates,
Figure BDA0001902881500000149
indicates if the state l i And the observed value x t There is a connection between them,
Figure BDA00019028815000001410
otherwise
Figure BDA00019028815000001411
Figure BDA00019028815000001412
And does not participate in the calculation. Therefore, for any
Figure BDA00019028815000001413
All have:
Figure BDA00019028815000001414
wherein, T i Indicating the status of belonging to the ith participle i Total number of target participles.
Furthermore, the obtained values can be compared by theorem
Figure BDA00019028815000001415
Verification is performed to determine whether the above equation (5) or equation (6) is the optimal solution of equation (4). Wherein the solution is proved by theorem
Figure BDA00019028815000001416
I.e. the optimal solution of equation (4), which belongs to the prior art and is not described herein again.
By adopting the technical scheme, the first conditional probability p (a) is solved through a relative entropy formula d |l i ) Becomes an optimization problem of a convex function, and the optimization problem can be proved to contain strict local minimum points by using theorem, and the solution is determined as the first conditional probability p (a) d |l i )。
Based on the same inventive concept, the disclosure also provides a named entity recognition device. Referring to fig. 3, fig. 3 is a block diagram illustrating an apparatus for identifying a named entity according to an example embodiment. As shown in fig. 3, the means for identifying the named entity may include:
a first determining module 31 for determining the t-th target participle x in the text t All possible corresponding real participles;
a second determining module 32, configured to determine, for each of the real participles determined by the first determining module, a first conditional probability p (a) that each participle state corresponds to the real participle, respectively d |l i ) Wherein a is d Characterizing the d-th real participle, l i Representing the ith word segmentation state;
a third determining module 33, configured to determine that each of the real participles determined by the second determining module corresponds to the target participle x t Second conditional probability p (x) t |a d ) And the first conditional probability p (a) d |l i ) Determining that each participle state corresponds to the target participle x t Third conditional probability p (x) t |l i );
An identification module 34 for determining the third conditional probability p (x) according to the third determination module t |l i ) For the target word segmentation x t Named entity recognition is performed.
Optionally, the second determining module includes:
a first determining sub-module for determining the target participle x for each of the real participles t A fourth conditional probability p (a) corresponding to the true participle d |x t );
An estimation submodule for determining the target word segmentation x according to the first determination submodule t The fourth conditional probability p (a) corresponding to each of the real participles d |x t ) Estimating said first conditional probability p (a) that each participle state corresponds to each said real participle d |l i )。
Optionally, the estimation sub-module comprises:
a second determination submodule for making d (z) to be equal to the above-mentioned formula (1) to formula (2) t ,y i ) Satisfying predetermined conditions
Figure BDA0001902881500000151
Determining the first conditional probability that the respective participle state corresponds to each of the real participles.
Optionally, the preset conditions are: loss function
Figure BDA0001902881500000161
Minimum; wherein, T i The representation belongs to the ith participle state l i L characterizes a total number of the participle states,
Figure BDA0001902881500000162
representing the ith word segmentation state and the target word segmentation x t If the relation exists, the relation is 1, otherwise, the relation is 0.
Optionally, the third determining module includes:
a third determining sub-module, configured to determine that each participle state corresponds to the target participle x according to the above formula (3) t Third conditional probability p (x) t |l i )。
Optionally, the identification module comprises:
a fourth determining submodule, configured to determine a word segmentation state corresponding to the maximum third conditional probability as the target word segmentation x t The named entity of (1) identifies the result.
With regard to the apparatus in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.
Fig. 4 is a block diagram illustrating an electronic device 400 according to an example embodiment. As shown in fig. 4, the electronic device 400 may include: a processor 401 and a memory 402. The electronic device 400 may also include one or more of a multimedia component 403, an input/output (I/O) interface 404, and a communication component 405.
The processor 401 is configured to control the overall operation of the electronic device 400, so as to complete all or part of the steps in the named entity identification method. The memory 402 is used to store various types of data to support operations at the electronic device 400, such as instructions for any application or method operating on the electronic device 400 and application-related data, such as contact data, messaging, pictures, audio, video, and the like. The Memory 402 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically Erasable Programmable Read-Only Memory (EEPROM), erasable Programmable Read-Only Memory (EPROM), programmable Read-Only Memory (PROM), read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk. The multimedia components 403 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 402 or transmitted through the communication component 405. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 404 provides an interface between the processor 401 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 405 is used for wired or wireless communication between the electronic device 400 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, near Field Communication (NFC), 2G, 3G, or 4G, or a combination of one or more of them, so that the corresponding Communication component 405 may include: wi-Fi module, bluetooth module, NFC module.
In an exemplary embodiment, the electronic Device 400 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the named entity identification method described above.
In another exemplary embodiment, a computer-readable storage medium is also provided, comprising program instructions which, when executed by a processor, carry out the steps of the above-mentioned method of identifying a named entity. For example, the computer readable storage medium may be the memory 402 comprising program instructions executable by the processor 401 of the electronic device 400 to perform the named entity identification method described above.
The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.
It should be noted that the various features described in the foregoing embodiments may be combined in any suitable manner without contradiction. In order to avoid unnecessary repetition, various possible combinations will not be separately described in this disclosure.
In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.

Claims (10)

1. A method for identifying a named entity, comprising:
determining the t-th target participle x in text t All possible corresponding real participles;
respectively determining a first conditional probability p (a) of each participle state corresponding to each real participle for each real participle d |l i ) Which isIn (a) d Characterizing the d-th real participle, l i Representing the ith word segmentation state;
according to each real participle corresponding to the target participle x t Second conditional probability p (x) t |a d ) And the first conditional probability p (a) d |l i ) Determining that each participle state corresponds to the target participle x t Third conditional probability p (x) t |l i );
According to the third conditional probability p (x) t |l i ) For the target word segmentation x t Named entity recognition is performed.
2. The method according to claim 1, characterized in that for each of said real participles, a first conditional probability p (a) is determined that the participle state corresponds to the real participle, respectively d |l i ) The method comprises the following steps:
determining the target participle x for each of the real participles t A fourth conditional probability p (a) corresponding to the true participle d |x t );
According to the target word segmentation x t The fourth conditional probability p (a) corresponding to each of the real participles d |x t ) Estimating said first conditional probability p (a) that each participle state corresponds to each said real participle d |l i )。
3. The method of claim 2, wherein the target-based word segmentation x is performed according to the target word segmentation t The fourth conditional probability p (a) corresponding to each of the real participles d |x t ) Estimating said first conditional probability p (a) that each participle state corresponds to each said real participle d |l i ) The method comprises the following steps:
according to the following equations (1) to (2), d (z) will be made t ,y i ) Y satisfying a predetermined condition i d Determining the first conditional probability for each participle state corresponding to each of the real participles:
Figure FDA0003927895880000021
Figure FDA0003927895880000022
wherein D characterizes a total number of the real participles,
Figure FDA0003927895880000023
characterizing the target participle x t A fourth conditional probability corresponding to the d-th real participle,
Figure FDA0003927895880000024
a first conditional probability characterizing an ith participle state as corresponding to a d-th real participle,
Figure FDA0003927895880000025
characterizing the target participle x t A vector of fourth conditional probabilities corresponding to each of the real participles,
Figure FDA0003927895880000026
vector characterizing the first conditional probability that the ith participle state corresponds to each real participle, d (z) t ,y i ) Characterization z t And y i Relative entropy of (2).
4. The method according to claim 3, wherein the preset condition is: loss function
Figure FDA0003927895880000027
Minimum; wherein, T i The representation belongs to the ith participle state l i L characterizes a total number of the participles states,
Figure FDA0003927895880000028
representing the ith word segmentation state and the target word segmentation x t If the relation exists, the relation is 1, otherwise, the relation is 0.
5. The method according to any of claims 1-4, wherein said each of said real participles corresponds to said target participle x t Second conditional probability p (x) t |a d ) And the first conditional probability p (a) d |l i ) Determining that each participle state corresponds to the target participle x t Third conditional probability p (x) t |l i ) The method comprises the following steps:
determining that each participle state corresponds to the target participle x according to the following formula (3) t Third conditional probability p (x) t |l i ):
Figure FDA0003927895880000029
Wherein D characterizes a total number of the true participles.
6. The method according to any one of claims 1-4, wherein said determining is based on said third conditional probability p (x) t |l i ) For the target participle x t Conducting named entity recognition, including:
determining the participle state corresponding to the maximum third conditional probability as the target participle x t Identifies the result.
7. An apparatus for identifying named entities, comprising:
a first determining module for determining the t target participle x in the text t All possible corresponding real participles;
a second determining module, configured to determine, for each of the real participles determined by the first determining module, a first conditional probability p (a) that each participle state corresponds to the real participle d |l i ) Wherein a is d Characterizing the d-th real participle, l i Characterizing an ith word segmentation state;
a third determining module, configured to determine that each of the real participles corresponds to the target participle x according to the determination result of the second determining module t Second conditional probability p (x) t |a d ) And the first conditional probability p (a) d |l i ) Determining that each participle state corresponds to the target participle x t Third conditional probability p (x) t |l i );
An identification module for identifying the third conditional probability p (x) determined by the third determination module t |l i ) For the target word segmentation x t Named entity recognition is performed.
8. The apparatus of claim 7, wherein the third determining module comprises:
a third determining sub-module for determining that each participle state corresponds to the target participle x according to the following formula (3) t Third conditional probability p (x) t |l i ):
Figure FDA0003927895880000031
Wherein D characterizes a total number of the true participles.
9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.
10. An electronic device, comprising:
a memory having a computer program stored thereon;
a processor for executing the computer program in the memory to carry out the steps of the method of any one of claims 1 to 6.
CN201811519563.6A 2018-12-12 2018-12-12 Named entity identification method and device, readable storage medium and electronic equipment Active CN109710927B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811519563.6A CN109710927B (en) 2018-12-12 2018-12-12 Named entity identification method and device, readable storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811519563.6A CN109710927B (en) 2018-12-12 2018-12-12 Named entity identification method and device, readable storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN109710927A CN109710927A (en) 2019-05-03
CN109710927B true CN109710927B (en) 2022-12-20

Family

ID=66256392

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811519563.6A Active CN109710927B (en) 2018-12-12 2018-12-12 Named entity identification method and device, readable storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN109710927B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116013027A (en) * 2022-08-05 2023-04-25 航天神舟智慧系统技术有限公司 Group event early warning method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006178865A (en) * 2004-12-24 2006-07-06 Nippon Telegr & Teleph Corp <Ntt> Device, method and program for extracting intrinsic expression, and recording medium with the program recorded thereon
JP6077727B1 (en) * 2016-01-28 2017-02-08 楽天株式会社 Computer system, method, and program for transferring multilingual named entity recognition model
CN106776544A (en) * 2016-11-24 2017-05-31 四川无声信息技术有限公司 Character relation recognition methods and device and segmenting method
CN107203511A (en) * 2017-05-27 2017-09-26 中国矿业大学 A kind of network text name entity recognition method based on neutral net probability disambiguation
CN107832476A (en) * 2017-12-01 2018-03-23 北京百度网讯科技有限公司 A kind of understanding method of search sequence, device, equipment and storage medium
CN107908614A (en) * 2017-10-12 2018-04-13 北京知道未来信息技术有限公司 A kind of name entity recognition method based on Bi LSTM

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388559B (en) * 2018-02-26 2021-11-19 中译语通科技股份有限公司 Named entity identification method and system under geographic space application and computer program
CN108536679B (en) * 2018-04-13 2022-05-20 腾讯科技(成都)有限公司 Named entity recognition method, device, equipment and computer readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006178865A (en) * 2004-12-24 2006-07-06 Nippon Telegr & Teleph Corp <Ntt> Device, method and program for extracting intrinsic expression, and recording medium with the program recorded thereon
JP6077727B1 (en) * 2016-01-28 2017-02-08 楽天株式会社 Computer system, method, and program for transferring multilingual named entity recognition model
CN106776544A (en) * 2016-11-24 2017-05-31 四川无声信息技术有限公司 Character relation recognition methods and device and segmenting method
CN107203511A (en) * 2017-05-27 2017-09-26 中国矿业大学 A kind of network text name entity recognition method based on neutral net probability disambiguation
CN107908614A (en) * 2017-10-12 2018-04-13 北京知道未来信息技术有限公司 A kind of name entity recognition method based on Bi LSTM
CN107832476A (en) * 2017-12-01 2018-03-23 北京百度网讯科技有限公司 A kind of understanding method of search sequence, device, equipment and storage medium

Also Published As

Publication number Publication date
CN109710927A (en) 2019-05-03

Similar Documents

Publication Publication Date Title
US11157693B2 (en) Stylistic text rewriting for a target author
CN107220235B (en) Speech recognition error correction method and device based on artificial intelligence and storage medium
CN111222317B (en) Sequence labeling method, system and computer equipment
JP5901001B1 (en) Method and device for acoustic language model training
US9697819B2 (en) Method for building a speech feature library, and method, apparatus, device, and computer readable storage media for speech synthesis
CN111461301B (en) Serialized data processing method and device, and text processing method and device
JPWO2007138875A1 (en) Word dictionary / language model creation system, method, program, and speech recognition system for speech recognition
WO2022174496A1 (en) Data annotation method and apparatus based on generative model, and device and storage medium
US9135326B2 (en) Text mining method, text mining device and text mining program
CN109582775B (en) Information input method, device, computer equipment and storage medium
CN109710927B (en) Named entity identification method and device, readable storage medium and electronic equipment
CN110929514B (en) Text collation method, text collation apparatus, computer-readable storage medium, and electronic device
CN112487813A (en) Named entity recognition method and system, electronic equipment and storage medium
CN111209746B (en) Natural language processing method and device, storage medium and electronic equipment
CN112632956A (en) Text matching method, device, terminal and storage medium
CN112395880A (en) Error correction method and device for structured triples, computer equipment and storage medium
CN115858776B (en) Variant text classification recognition method, system, storage medium and electronic equipment
US20230130662A1 (en) Method and apparatus for analyzing multimodal data
US11934779B2 (en) Information processing device, information processing method, and program
JP7194759B2 (en) Translation data generation system
CN115101072A (en) Voice recognition processing method and device
CN114048733A (en) Training method of text error correction model, and text error correction method and device
CN116306612A (en) Word and sentence generation method and related equipment
CN113761845A (en) Text generation method and device, storage medium and electronic equipment
JP2018077677A (en) Character string converting device, model learning device, method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant