CN109815483B - Synthetic word recognition method and device, readable storage medium and electronic equipment - Google Patents

Synthetic word recognition method and device, readable storage medium and electronic equipment Download PDF

Info

Publication number
CN109815483B
CN109815483B CN201811559551.6A CN201811559551A CN109815483B CN 109815483 B CN109815483 B CN 109815483B CN 201811559551 A CN201811559551 A CN 201811559551A CN 109815483 B CN109815483 B CN 109815483B
Authority
CN
China
Prior art keywords
domain
word
target
field
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811559551.6A
Other languages
Chinese (zh)
Other versions
CN109815483A (en
Inventor
贾弼然
崔朝辉
赵立军
张霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Neusoft Corp
Original Assignee
Neusoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Neusoft Corp filed Critical Neusoft Corp
Priority to CN201811559551.6A priority Critical patent/CN109815483B/en
Publication of CN109815483A publication Critical patent/CN109815483A/en
Application granted granted Critical
Publication of CN109815483B publication Critical patent/CN109815483B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure relates to a synthetic word recognition method, a synthetic word recognition device, a readable storage medium and electronic equipment. The method comprises the following steps: calculating the domain deviation between the target domain and each marked domain in the marked document set respectively, wherein the synthetic words in the marked domain are known; according to the deviation of each field, determining at least one similar field of the target field from each marked field according to a preset rule; generating a target HMM model according to the HMM models corresponding to the similar fields and the weights corresponding to the similar fields; determining a role labeling result according to the text in the target field, the target HMM model and the Viterbi algorithm, wherein the role labeling result is used for indicating the role state corresponding to each word in the text; and determining the synthesized word in the text in the target field according to the role labeling result. Therefore, the recognition accuracy of the synthesized words can be improved, the word segmentation accuracy and recall rate aiming at the specific field are improved, and the labor can be saved.

Description

Synthetic word recognition method and device, readable storage medium and electronic equipment
Technical Field
The present disclosure relates to the field of natural language processing, and in particular, to a method and apparatus for identifying a synthesized word, a readable storage medium, and an electronic device.
Background
With the application of artificial intelligence, natural language processing is becoming more and more important and popular. In natural language processing engineering, chinese word segmentation is the most basic step and the most important step. Before text processing, chinese word segmentation is needed first, so that the accuracy of Chinese word segmentation directly affects the accuracy of subsequent processing. In the word segmentation for a specific field, errors often occur in identifying the synthesized word, such as erroneously dividing a longer synthesized word into several small segmented words. For example, in the field of medical tuberculosis, the "viral hepatitis b liver cirrhosis decompensation period" is a very common synthetic word, and in other fields of medicine, the "b", "viral", "hepatitis", "liver cirrhosis", "decompensation" and "compensation period" may be mistakenly identified, which causes inaccurate recognition of the synthetic word and thus incorrect word segmentation. At present, if accurate word segmentation is required for a specific field, a large amount of corpus is required to be collected for word segmentation training, and the training corpus is required to be marked manually, so that labor and time are consumed.
Disclosure of Invention
The disclosure aims to provide a synthetic word recognition method, a synthetic word recognition device, a readable storage medium and electronic equipment, so as to accurately recognize synthetic words in a specific field.
In order to achieve the above object, according to a first aspect of the present disclosure, there is provided a synthetic word recognition method. The method comprises the following steps:
calculating domain deviation between the target domain and each marked domain in the marked document set respectively, wherein synthetic words in the marked domain are known;
according to the domain deviation, determining at least one similar domain of the target domain from the marked domains according to a preset rule;
generating a target HMM model according to the HMM models corresponding to the similar fields and the weights corresponding to the similar fields;
determining a role labeling result according to the text in the target field, the target HMM model and a Viterbi algorithm, wherein the role labeling result is used for indicating a role state corresponding to each word in the text;
and determining the synthesized word in the text in the target field according to the role marking result.
Optionally, the domain deviation between the target domain and the annotated domain is calculated by:
Performing word segmentation processing on the text in the target field and the text in the marked field, and determining a word segmentation set without repeated word segmentation according to the generated word segmentation result;
and calculating the domain deviation between the target domain and the marked domain at least according to the frequency of each word in the word segmentation set in the text under the target domain, the frequency of each word in the text under the marked domain and the word segmentation length corresponding to the word segmentation.
Optionally, the calculating, according to at least the frequency of occurrence of each word in the word segmentation set in the text under the target domain and the frequency of occurrence of each word in the text under the noted domain and the word segmentation length corresponding to the word segmentation, a domain deviation between the target domain and the noted domain includes:
calculating the target field d according to the formula (1) i And marked field d j Domain deviation between d i And d j FD (d) for domain deviation between i ,d j ) The representation is:
wherein q is a first calculation coefficient, alpha is a second calculation coefficient, m is the total number of words of the word segmentation set, fr ik For the kth word in the word segmentation set to be in the target field d i Frequency of occurrence in the text below, fr jk For the kth word in the word segmentation set in the noted field d j And the frequency of occurrence in the text below, i (k), is the word segmentation length corresponding to the kth word segmentation in the word segmentation set.
Optionally, the calculating, according to at least the frequency of occurrence of each word in the word segmentation set in the text under the target domain and the frequency of occurrence of each word in the text under the noted domain and the word segmentation length corresponding to the word segmentation, a domain deviation between the target domain and the noted domain includes:
calculating the target field d according to the formula (2) i And marked field d j Domain deviation between d i And d j FD (d) for domain deviation between i ,d j ) The representation is:
wherein q is a first calculation coefficient, alpha is a second calculation coefficient, m is the total number of words of the word segmentation set, fr ik For the kth word in the word segmentation set to be in the target field d i Frequency of occurrence in the text below, fr jk For the kth word in the word segmentation set in the noted field d j Frequency of occurrence in the text below, fr Dk And l (k) is the word segmentation length corresponding to the kth word in the word segmentation set, wherein the frequency of occurrence of the kth word in the word segmentation set in the marked document set is used as the frequency of occurrence of the kth word in the word segmentation set.
Optionally, the determining, according to the domain deviation, at least one similar domain of the target domain from the labeled domains according to a preset rule includes:
according to the sequence of the domain deviations from small to large, determining labeled domains corresponding to the domain deviations ranked in the preset number as the similar domains;
or determining at least one similar domain of the target domain from the marked domains according to the domain deviation and a preset rule, wherein the determining comprises the following steps:
and determining the marked domain corresponding to the domain deviation less than or equal to the domain deviation threshold as the similar domain.
Optionally, the HMM model includes three parameters of a state transition probability matrix, an observation probability matrix, and an initial state probability distribution;
the generating a target HMM model according to the HMM model corresponding to each similar field and the weight corresponding to the similar field includes:
determining respective parameters of the target HMM model according to formulas (3) to (5):
wherein p is the total number of similar fields, A * B is a state transition probability matrix in a target HMM model * Is the observation probability matrix pi in the target HMM model * For initial state probability distribution in the target HMM model, A k A state transition probability matrix B in the HMM model corresponding to the kth similar field k Is the observation probability matrix pi in the HMM model corresponding to the kth similar field k For the initial state probability distribution, w, in the HMM model corresponding to the kth similar field k Is the weight corresponding to the kth similar field, andthe value of (2) is 1.
Optionally, the determining, according to the role labeling result, a synthesized word in the text in the target domain includes:
and determining the word segmentation or word segmentation combination conforming to the character state combination in the text as the synthesized word according to the character labeling result and the known character state combination forming the synthesized word.
According to a second aspect of the present disclosure, there is provided a synthesized word recognition apparatus. The device comprises:
the computing module is used for respectively computing the domain deviation between the target domain and each marked domain in the marked document set, wherein the synthesized words in the marked domain are known;
the first determining module is used for determining at least one similar field of the target field from the marked fields according to the field deviation and preset rules;
The generation module is used for generating a target HMM model according to the HMM models corresponding to the similar fields and the weights corresponding to the similar fields;
the second determining module is used for determining a role marking result according to the text in the target field, the target HMM model and the Viterbi algorithm, wherein the role marking result is used for indicating the role state corresponding to each word in the text;
and the third determining module is used for determining the synthetic words in the text in the target field according to the role marking result.
Optionally, the computing module includes:
the processing sub-module is used for carrying out word segmentation on the text in the target field and the text in the marked field, and determining a word segmentation set without repeated word segmentation according to the generated word segmentation result;
and the calculating sub-module is used for calculating the domain deviation between the target domain and the marked domain at least according to the frequency of each word in the word segmentation set in the text in the target domain, the frequency of each word in the text in the marked domain and the word segmentation length corresponding to the word segmentation.
Optionally, the calculating submodule is used for calculating the target field d according to the formula (1) i And marked field d j Domain deviation between d i And d j FD (d) for domain deviation between i ,d j ) The representation is:
wherein q is a first calculation coefficient, alpha is a second calculation coefficient, m is the total number of words of the word segmentation set, fr ik For the kth word in the word segmentation set to be in the target field d i Frequency of occurrence in the text below, fr jk For the kth word in the word segmentation set in the noted field d j The frequency of occurrence in the text below, l (k) isAnd the word segmentation length corresponding to the kth word segmentation in the word segmentation set.
Optionally, the calculating submodule is used for calculating the target field d according to the formula (2) i And marked field d j Domain deviation between d i And d j FD (d) for domain deviation between i ,d j ) The representation is:
wherein q is a first calculation coefficient, alpha is a second calculation coefficient, m is the total number of words of the word segmentation set, fr ik For the kth word in the word segmentation set to be in the target field d i Frequency of occurrence in the text below, fr jk For the kth word in the word segmentation set in the noted field d j Frequency of occurrence in the text below, fr Dk And l (k) is the word segmentation length corresponding to the kth word in the word segmentation set, wherein the frequency of occurrence of the kth word in the word segmentation set in the marked document set is used as the frequency of occurrence of the kth word in the word segmentation set.
Optionally, the first determining module is configured to determine, according to the order of the domain deviations from small to large, labeled domains corresponding to the domain deviations ranked in the preset number as the similar domains;
or the first determining module is configured to determine, as the similar domain, an annotated domain corresponding to a domain deviation less than or equal to a domain deviation threshold.
Optionally, the HMM model includes three parameters of a state transition probability matrix, an observation probability matrix, and an initial state probability distribution;
the generation module is used for determining each parameter of the target HMM model according to formulas (3) to (5):
wherein p is the total number of similar fields, A * B is a state transition probability matrix in a target HMM model * Is the observation probability matrix pi in the target HMM model * For initial state probability distribution in the target HMM model, A k A state transition probability matrix B in the HMM model corresponding to the kth similar field k Is the observation probability matrix pi in the HMM model corresponding to the kth similar field k For the initial state probability distribution, w, in the HMM model corresponding to the kth similar field k Is the weight corresponding to the kth similar field, andthe value of (2) is 1.
Optionally, the third determining module is configured to determine, according to the character labeling result and a known character state combination that forms a synthesized word, a word segment or a word segment combination that conforms to the character state combination in the text as the synthesized word.
According to a third aspect of the present disclosure there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method of the first aspect of the present disclosure.
According to a fourth aspect of the present disclosure, there is provided an electronic device comprising:
a memory having a computer program stored thereon;
a processor for executing the computer program in the memory to implement the steps of the method of the first aspect of the disclosure.
According to the technical scheme, the domain deviation between the target domain and each marked domain in the marked document set is calculated respectively, at least one similar domain of the target domain is determined from each marked domain, a target HMM model is generated according to the HMM model corresponding to each similar domain and the weight corresponding to the similar domain, then a role marking result is determined according to the text in the target domain, the target HMM model and the Viterbi algorithm, and finally the synthetic word in the text in the target domain is determined according to the role marking result. Therefore, a plurality of similar fields of the target field can be determined, and a target HMM model is obtained according to the HMM models corresponding to the similar fields, so that the synthetic word recognition is carried out according to the target HMM model. Therefore, the effect of improving the recognition accuracy of the synthesized words is achieved by utilizing a plurality of marked fields similar to the target field under the condition that corpus is not required to be trained, so that the accuracy and recall rate of word segmentation aiming at the specific field are improved, and manpower can be saved.
Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.
Drawings
The accompanying drawings are included to provide a further understanding of the disclosure, and are incorporated in and constitute a part of this specification, illustrate the disclosure and together with the description serve to explain, but do not limit the disclosure. In the drawings:
FIG. 1 is a flow chart of a method of synthesized word recognition provided in accordance with one embodiment of the present disclosure;
FIG. 2 is a flow chart of one exemplary implementation of a manner in which domain deviation between a target domain and a labeled domain is calculated in a synthesized word recognition method provided in accordance with the present disclosure;
FIG. 3 is a block diagram of a synthesized word recognition apparatus provided in accordance with one embodiment of the present disclosure;
fig. 4 is a block diagram of an electronic device, according to an example embodiment.
Detailed Description
Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the disclosure, are not intended to limit the disclosure.
Fig. 1 is a flowchart of a synthesized word recognition method provided according to one embodiment of the present disclosure. As shown in fig. 1, the method may include the following steps.
In step 11, domain deviations between the target domain and each of the labeled domains in the labeled document set are calculated, respectively.
Wherein the synthesized words in the labeled field are known. The domain deviation can be used to characterize the degree of difference of each labeled domain relative to the target domain. The smaller the domain deviation between a domain and a target domain, the more similar the domain and the target domain are explained.
In step 12, at least one similar domain of the target domain is determined from the labeled domains according to the respective domain deviations and the preset rules.
According to the calculated domain deviation, the similar domain of the target domain can be determined from the marked domains in the marked document set. For example, the labeled domain corresponding to the number of smaller domain deviations may be determined as a similar domain to the target domain. For example, if there are 5 marked areas D1 to D5 in the marked document set, the corresponding area deviation is F1 to F5 in sequence, and the values of F1 to F5 gradually increase, at this time, if two similar areas need to be selected, it may be determined that the marked area D1 and the marked area D2 are similar areas of the target area.
In step 13, a target HMM model is generated according to the HMM model corresponding to each similar domain and the weight corresponding to the similar domain.
The HMM model (Hidden Markov Model ) is a statistical model used to describe implicit unknown parameters. The HMM model is generally defined as a five-tuple, which is a set of character states (set of character state values), a set of observations (set of observations), a state transition probability matrix, an observation probability matrix, and an initial probability distribution, respectively. The five-membered relation is related by a Viterbi algorithm, the observed value is the input of the Viterbi algorithm, the character state value sequence is the output of the Viterbi algorithm, and the Viterbi algorithm obtains the output by using the input by means of three parameters of a state transition probability matrix, an observation probability matrix and an initial probability distribution of the HMM model.
That is, in the case where the observation set, the state transition probability matrix, the observation probability matrix, and the initial probability distribution are known, the character state value sequence can be solved using the viterbi algorithm. Wherein the observation set is the text described in the disclosure, such as the text in the target area. A set of character states is a set of all possible character states that can be defined by humans. For example, a character state set including 6 character states { R, S, T, X, Y, Z } may be predefined, where the context of the synthesized word is represented by a character state value R, the context of the synthesized word is represented by a character state value S, the irrelevant word of the synthesized word is represented by a character state value T, the beginning of the synthesized word is represented by a character state value X, the end of the synthesized word is represented by a character state value Z, among words of the synthesized word.
The state transition probability matrix is used to represent the probability of a character state transitioning to another character state, and if the number of character states in the character state set is 4, the state transition probability matrix is a 4*4 matrix. For example, if the character state set is { R, S, T, X, Y, Z }, the state transition probability matrix is a 6*6 matrix, and is RSTXYZ in both the horizontal and vertical directions, and if the state transition probability matrix is denoted by a, a [0] [0] represents the probability that the character state represented by the character state R transitions to the character state represented by R. The observation probability matrix, also called emission probability matrix, represents the probability that a character state outputs a particular symbol (i.e., word segmentation). The initial probability distribution is used to represent the probability that a sentence head is a state of a certain character. It should be noted that, concepts and solving methods of the state transition probability matrix, the observation probability matrix and the initial probability distribution are well known to those skilled in the art, and the foregoing is for understanding convenience and is only briefly explained, and detailed description is omitted here.
The synthesized words in the labeled domain similar to the target domain are known, so that the HMM model is known, namely, three parameters of a state transition probability matrix, an observation probability matrix and an initial probability distribution are known, and therefore, according to the HMM models in the similar domains, a new state transition probability matrix, an observation probability matrix and an initial probability distribution can be calculated respectively as parameters of the target HMM model by combining weights corresponding to each similar domain. For example, each similar field may correspond to the same weight. As another example, the smaller the domain deviation between the similar domain and the target domain, the greater its corresponding weight may be.
In step 14, a role labeling result is determined according to the text in the target field, the target HMM model and the viterbi algorithm.
The character labeling result is used for indicating the character state corresponding to each word in the text.
In the above description, the text in the target field is used as input of the viterbi algorithm, and the three parameters of the target HMM model are combined to perform role labeling, so as to obtain a role state value sequence, that is, a role labeling result. For example, if the character state set is { R, S, T, X, Y, Z }, one possible character labeling result is TRXYZST, which represents that the character state values of 7 tokens in the text are T, R, X, Y, Z, S, T in turn.
In step 15, according to the role labeling result, determining the synthesized word in the text in the target field.
And determining the synthesized word in the text in the target field according to the role marking result. For example, the defined character state set is { R, S, T, X, Y, Z }, which sequentially represents the context of the synthesized word, the irrelevant word of the synthesized word, the word head of the synthesized word, the word of the synthesized word, and the word tail of the synthesized word, and if the character labeling result is TRXYZST, it can be determined that three word segments corresponding to XYZ positions can form one synthesized word, so that the recognition of the synthesized word can be realized.
According to the technical scheme, the domain deviation between the target domain and each marked domain in the marked document set is calculated respectively, at least one similar domain of the target domain is determined from each marked domain, a target HMM model is generated according to the HMM model corresponding to each similar domain and the weight corresponding to the similar domain, then a role marking result is determined according to the text in the target domain, the target HMM model and the Viterbi algorithm, and finally the synthetic word in the text in the target domain is determined according to the role marking result. Therefore, a plurality of similar fields of the target field can be determined, and a target HMM model is obtained according to the HMM models corresponding to the similar fields, so that the synthetic word recognition is carried out according to the target HMM model. Therefore, the effect of improving the recognition accuracy of the synthesized words is achieved by utilizing a plurality of marked fields similar to the target field under the condition that corpus is not required to be trained, so that the accuracy and recall rate of word segmentation aiming at the specific field are improved, and manpower can be saved.
In order to enable those skilled in the art to better understand the technical solutions provided by the embodiments of the present invention, the following details of the corresponding steps are described below.
First, an example is described for a manner of calculating a domain deviation between a target domain and an annotated domain. It should be noted that, described herein is calculation of a domain deviation between a target domain and one labeled domain, and in the case where a plurality of labeled domains are included in a labeled document set, the domain deviation is determined for each labeled domain in the labeled document set in the same manner as described below.
In one possible embodiment, as shown in fig. 2, step 11 may include the following steps.
In step 21, word segmentation processing is performed on the text in the target field and the text in the marked field, and a word segmentation set without repeated word segmentation is determined according to the generated word segmentation result.
Firstly, aiming at texts in the target field and texts in the marked field, word segmentation processing is carried out on the texts by combining an existing word segmentation model. For example, a word segmentation model trained by the daily necessities of people can be used for word segmentation processing of texts in the target field and texts in the marked field. Aiming at the obtained word segmentation result, a word segmentation set can be determined, wherein the words in the word segmentation set are all words contained in the word segmentation result of the target field text and the word segmentation result of the marked field text, and the repeated word segmentation is not contained. For example, the word segmentation results obtained by performing word segmentation on the text in the target field are e1, e2, e3, e4 and e5, and the word segmentation results obtained by performing word segmentation on the text in the labeled field are e5, e6, e4, e7, e3 and e2, and the word segmentation set obtained by performing word segmentation on the text in the labeled field is { e1, e2, e3, e4, e5, e6 and e7}.
In step 22, a domain deviation between the target domain and the labeled domain is calculated based at least on the frequency of occurrence of each word in the set of words in the text under the target domain and the frequency of occurrence of each word in the text under the labeled domain, and the word length corresponding to the word.
In one possible embodiment, the target area d may be calculated according to the following formula (1) i And marked field d j Domain deviation between, wherein d i And d j FD (d) for domain deviation between i ,d j ) The representation is:
wherein q is a first calculation coefficient, alpha is a second calculation coefficient, m is the total number of words in the word segmentation set, fr ik For the kth word in the word segmentation set in the target field d i Frequency of occurrence in the text below, fr jk For the kth word in the word segmentation set in the marked field d j The frequency of occurrence in the text below, i (k), is the word segment length corresponding to the kth word segment in the word segment set.
The first calculation coefficient q may be a preset constant and may be set manually, for example, the first calculation coefficient q may take a value in [2,4 ]. The second calculation coefficient α may be a preset constant, and may be set manually, for example, the second calculation coefficient α may take a value in [2,4 ].
By adopting the mode, when calculating the domain deviation, the domain deviation between the two domains can be determined by using the frequency of occurrence of the segmentation in the text under the target domain, the frequency of occurrence of the segmentation in the text under the marked domain and the segmentation length corresponding to the segmentation so as to reflect the difference degree between the two domains and provide a basis for the subsequent similar domain selection.
In another possible embodiment, it is possible toThe target area d is calculated according to the following formula (2) i And marked field d j Domain deviation between d i And d j FD (d) for domain deviation between i ,d j ) The representation is:
wherein q is a first calculation coefficient, alpha is a second calculation coefficient, m is the total number of words in the word segmentation set, fr ik For the kth word in the word segmentation set in the target field d i Frequency of occurrence in the text below, fr jk For the kth word in the word segmentation set in the marked field d j Frequency of occurrence in the text below, fr Dk For the frequency of occurrence of the kth word in the word segmentation set in the marked document set, l (k) is the word segmentation length corresponding to the kth word in the word segmentation set.
By adopting the mode, when calculating the domain deviation, the frequency of occurrence of the segmentation in the text under the target domain, the frequency of occurrence of the segmentation in the text under the marked domain and the segmentation length corresponding to the segmentation are utilized, the frequency of occurrence of the segmentation in the marked document set is introduced, the influence of the words which occur for multiple times and are irrelevant to the synthesized words on the domain deviation calculation can be reduced, the calculated domain deviation can reflect the difference degree among the domains more accurately, and the similar domains determined for the target domain in the subsequent steps are more accurate.
The following illustrates that, in step 12, at least one similar domain of the target domain is determined from the labeled domains according to a preset rule according to the deviation of each domain.
In one possible embodiment, step 12 may comprise the steps of:
and determining labeled fields corresponding to the field deviation ranked in the preset number as similar fields according to the sequence of the field deviations from small to large.
After the domain deviation between the target domain and each marked domain in the marked document set is calculated, the marked domains corresponding to the domain deviation ranked in the preset number are determined to be similar domains according to the order of the domain deviation from small to large.
By adopting the mode, the preset number of marked areas which are similar to the target areas in the marked document set can be selected, so that the similar areas can be used for identifying the synthesized words in the subsequent steps.
In another possible embodiment, step 12 may comprise the steps of:
and determining the marked domain corresponding to the domain deviation less than or equal to the domain deviation threshold as the similar domain.
When determining similar fields of the target field, a field deviation threshold may be set, after calculating the field deviation between the target field and each labeled field in the labeled document set, the labeled field corresponding to the field deviation less than or equal to the field deviation threshold is determined as the similar field of the target field based on the field deviation threshold. For example, the domain deviation threshold may be set manually.
By adopting the mode, the marked fields corresponding to the field deviation threshold value or less are determined to be similar fields by using the field deviation threshold value, and the similar fields of the target field can be determined so as to perform synthesized word recognition by using the similar fields in the subsequent steps.
The HMM model is based on corpus after word segmentation in the field to carry out statistical calculation, and corresponding frequency and conditional probability are calculated to obtain corresponding parameters. From the above, the synthesized words of the similar fields of the determined target field are known, so that each similar field corresponds to a trained HMM model, and in the trained HMM model, three parameters of the state transition probability matrix, the observation probability matrix and the initial state probability distribution are known, so that after determining at least one similar field of the target field, a new HMM model, that is, the target HMM model, can be determined using the HMM models existing in the similar fields.
It should be noted that, the calculation of the HMM model according to the segmented corpus is a method known to those skilled in the art, and will not be described here in detail.
Thus, in one possible implementation, the generating the target HMM model according to the HMM model corresponding to each similar domain and the weight corresponding to the similar domain in step 13 may include the following steps:
determining respective parameters of the target HMM model according to formulas (3) to (5):
wherein p is the total number of similar fields, A * B is a state transition probability matrix in a target HMM model * Is the observation probability matrix pi in the target HMM model * For initial state probability distribution in the target HMM model, A k A state transition probability matrix B in the HMM model corresponding to the kth similar field k Is the observation probability matrix pi in the HMM model corresponding to the kth similar field k For the initial state probability distribution, w, in the HMM model corresponding to the kth similar field k Is the weight corresponding to the kth similar field, andthe value of (2) is 1.
Knowing the state transition probability matrix of each similar field, combining the weights corresponding to the similar fields, a new matrix can be calculated and used as the state transition probability matrix in the target HMM model. And the observation probability matrix and the calculation of the initial state probability distribution in the target HMM model are the same as the above. For example, the weights corresponding to the similar fields may be manually set, and only the sum of the weights is guaranteed to be 1. In one possible implementation, each similar field may correspond to the same weight. In another possible embodiment, the smaller the domain deviation between the similar domain and the target domain, the greater its corresponding weight may be.
Through the mode, the existing HMM models in the similar fields are utilized, the weights corresponding to the similar fields are combined to obtain the new target HMM model, the single defect of the HMM model in the single similar field can be reduced, the diversity of the HMM model is improved, and positive effects exist on the improvement of the accuracy of the recognition of the synthesized words in the subsequent steps.
After determining the target HMM model, step 14 may be performed, i.e. determining the role labeling result according to the text in the target domain, the target HMM model and the viterbi algorithm. The process of obtaining the state value sequence by combining the text and the HMM model with the viterbi algorithm is well known to those skilled in the art, and will not be described herein.
After the role marking result is obtained, step 15 may be performed. In one possible embodiment, the following steps may be included in step 15:
and determining the word segmentation or word segmentation combination conforming to the character state combination in the text to be the synthesized word according to the character labeling result and the known character state combination forming the synthesized word.
For example, the defined character state set is { R, S, T, X, Y, Z, O }, which in turn represents the context of the synthesized word, the irrelevant word of the synthesized word, the beginning of the synthesized word, the word of the synthesized word, the end of the synthesized word, the individual synthesized word, and the known character state combinations that make up the synthesized word include XYZ, O. Then, if the character labeling result is TRXYZST, it can be determined that three segmentation words corresponding to XYZ positions can form one synthesized word. If the character labeling result is TRXYZSTROST, it can be determined that three segmentation words corresponding to XYZ positions can form a synthetic word, and the segmentation word corresponding to O positions can be used as a synthetic word, so that recognition of the synthetic word is realized.
By the method, the composite word in the text can be determined by combining the character labeling result and the known character state combination forming the composite word, so that the composite word can be identified.
Fig. 3 is a block diagram of a synthesized word recognition apparatus provided according to one embodiment of the present disclosure. As shown in fig. 3, the apparatus 30 includes:
a calculating module 31, configured to calculate domain deviations between a target domain and each labeled domain in a labeled document set, where synthetic words in the labeled domain are known;
a first determining module 32, configured to determine, according to a preset rule, at least one similar domain of the target domain from the labeled domains according to the domain deviation;
a generating module 33, configured to generate a target HMM model according to the HMM model corresponding to each similar domain and the weight corresponding to the similar domain;
a second determining module 34, configured to determine a role labeling result according to the text in the target domain, the target HMM model, and a viterbi algorithm, where the role labeling result is used to indicate a role state corresponding to each word in the text;
and a third determining module 35, configured to determine a synthesized word in the text in the target domain according to the role labeling result.
Optionally, the computing module 31 includes:
the processing sub-module is used for carrying out word segmentation on the text in the target field and the text in the marked field, and determining a word segmentation set without repeated word segmentation according to the generated word segmentation result;
and the calculating sub-module is used for calculating the domain deviation between the target domain and the marked domain at least according to the frequency of each word in the word segmentation set in the text in the target domain, the frequency of each word in the text in the marked domain and the word segmentation length corresponding to the word segmentation.
Optionally, provided thatThe calculating submodule is used for calculating the target field d according to the formula (1) i And marked field d j Domain deviation between d i And d j FD (d) for domain deviation between i ,d j ) The representation is:
wherein q is a first calculation coefficient, alpha is a second calculation coefficient, m is the total number of words of the word segmentation set, fr ik For the kth word in the word segmentation set to be in the target field d i Frequency of occurrence in the text below, fr jk For the kth word in the word segmentation set in the noted field d j And the frequency of occurrence in the text below, i (k), is the word segmentation length corresponding to the kth word segmentation in the word segmentation set.
Optionally, the calculating submodule is used for calculating the target field d according to the formula (2) i And marked field d j Domain deviation between d i And d j FD (d) for domain deviation between i ,d j ) The representation is:
wherein q is a first calculation coefficient, alpha is a second calculation coefficient, m is the total number of words of the word segmentation set, fr ik For the kth word in the word segmentation set to be in the target field d i Frequency of occurrence in the text below, fr jk For the kth word in the word segmentation set in the noted field d j Frequency of occurrence in the text below, fr Dk And l (k) is the word segmentation length corresponding to the kth word in the word segmentation set, wherein the frequency of occurrence of the kth word in the word segmentation set in the marked document set is used as the frequency of occurrence of the kth word in the word segmentation set.
Optionally, the first determining module 32 is configured to determine, according to the order of the domain deviations from small to large, labeled domains corresponding to the domain deviations ranked in the preset number as the similar domains;
alternatively, the first determining module 32 is configured to determine, as the similar domain, an annotated domain corresponding to a domain deviation less than or equal to a domain deviation threshold.
Optionally, the HMM model includes three parameters of a state transition probability matrix, an observation probability matrix, and an initial state probability distribution;
The generating module 33 is configured to determine respective parameters of the target HMM model according to formulas (3) to (5):
wherein p is the total number of similar fields, A * B is a state transition probability matrix in a target HMM model * Is the observation probability matrix pi in the target HMM model * For initial state probability distribution in the target HMM model, A k A state transition probability matrix B in the HMM model corresponding to the kth similar field k Is the observation probability matrix pi in the HMM model corresponding to the kth similar field k For the initial state probability distribution, w, in the HMM model corresponding to the kth similar field k Is the weight corresponding to the kth similar field, andthe value of (2) is 1.
Optionally, the third determining module 35 is configured to determine, according to the character labeling result and a known character status combination that forms a composite word, a word segment or a word segment combination that conforms to the character status combination in the text as the composite word.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
Fig. 4 is a block diagram of an electronic device, according to an example embodiment. For example, electronic device 1900 may be provided as a server. Referring to fig. 4, the electronic device 1900 includes a processor 1922, which may be one or more in number, and a memory 1932 for storing computer programs executable by the processor 1922. The computer program stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, the processor 1922 may be configured to execute the computer program to perform the synthetic word recognition method described above.
In addition, the electronic device 1900 may further include a power component 1926 and a communication component 1950, the power component 1926 may be configured to perform power management of the electronic device 1900, and the communication component 1950 may be configured to enable communication of the electronic device 1900, e.g., wired or wireless communication. In addition, the electronic device 1900 may also include an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server, mac OS XTM, unixTM, linuxTM, and the like.
In another exemplary embodiment, a computer readable storage medium is also provided that includes program instructions that, when executed by a processor, implement the steps of the synthesized word recognition method described above. For example, the computer readable storage medium may be the memory 1932 including program instructions described above that are executable by the processor 1922 of the electronic device 1900 to perform the synthetic word recognition method described above.
The preferred embodiments of the present disclosure have been described in detail above with reference to the accompanying drawings, but the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solutions of the present disclosure within the scope of the technical concept of the present disclosure, and all the simple modifications belong to the protection scope of the present disclosure.
In addition, the specific features described in the above embodiments may be combined in any suitable manner without contradiction. The various possible combinations are not described further in this disclosure in order to avoid unnecessary repetition.
Moreover, any combination between the various embodiments of the present disclosure is possible as long as it does not depart from the spirit of the present disclosure, which should also be construed as the disclosure of the present disclosure.

Claims (10)

1. A method for identifying a compound word, the method comprising:
calculating domain deviation between the target domain and each marked domain in the marked document set respectively, wherein synthetic words in the marked domain are known;
according to the domain deviation, determining at least one similar domain of the target domain from the marked domains according to a preset rule;
generating a target HMM model according to the HMM models corresponding to the similar fields and the weights corresponding to the similar fields;
determining a role labeling result according to the text in the target field, the target HMM model and a Viterbi algorithm, wherein the role labeling result is used for indicating a role state corresponding to each word in the text;
And determining the synthesized word in the text in the target field according to the role marking result.
2. The method of claim 1, wherein the domain deviation between the target domain and the labeled domain is calculated by:
performing word segmentation processing on the text in the target field and the text in the marked field, and determining a word segmentation set without repeated word segmentation according to the generated word segmentation result;
and calculating the domain deviation between the target domain and the marked domain at least according to the frequency of each word in the word segmentation set in the text under the target domain, the frequency of each word in the text under the marked domain and the word segmentation length corresponding to the word segmentation.
3. The method according to claim 2, wherein the calculating the domain deviation between the target domain and the noted domain based at least on the frequency of occurrence of each word in the set of words in the text under the target domain and the frequency of occurrence in the text under the noted domain, and the word length corresponding to the word, includes:
calculating the target field d according to the formula (1) i And marked field d j Domain deviation between d i And d j FD (d) for domain deviation between i ,d j ) The representation is:
wherein q is a first calculation coefficient, alpha is a second calculation coefficient, m is the total number of words of the word segmentation set, fr ik For the kth word in the word segmentation set to be in the target field d i Frequency of occurrence in the text below, fr jk For the kth word in the word segmentation set in the noted field d j And the frequency of occurrence in the text below, i (k), is the word segmentation length corresponding to the kth word segmentation in the word segmentation set.
4. The method according to claim 2, wherein the calculating the domain deviation between the target domain and the noted domain based at least on the frequency of occurrence of each word in the set of words in the text under the target domain and the frequency of occurrence in the text under the noted domain, and the word length corresponding to the word, includes:
calculating the target field d according to the formula (2) i And marked field d j Domain deviation between d i And d j FD (d) for domain deviation between i ,d j ) The representation is:
wherein q is a first calculation coefficient, alpha is a second calculation coefficient, m is the total number of words of the word segmentation set, fr ik For the kth word in the word segmentation set to be in the target field d i Frequency of occurrence in the text below, fr jk For the kth word in the word segmentation set in the noted field d j Frequency of occurrence in the text below, fr Dk And l (k) is the word segmentation length corresponding to the kth word in the word segmentation set, wherein the frequency of occurrence of the kth word in the word segmentation set in the marked document set is used as the frequency of occurrence of the kth word in the word segmentation set.
5. The method according to claim 1, wherein said determining at least one similar domain of said target domain from each of said labeled domains according to a preset rule based on each of said domain deviations comprises:
according to the sequence of the domain deviations from small to large, determining labeled domains corresponding to the domain deviations ranked in the preset number as the similar domains;
or determining at least one similar domain of the target domain from the marked domains according to the domain deviation and a preset rule, wherein the determining comprises the following steps:
and determining the marked domain corresponding to the domain deviation less than or equal to the domain deviation threshold as the similar domain.
6. The method according to claim 1, wherein the HMM model comprises three parameters of a state transition probability matrix, an observation probability matrix, and an initial state probability distribution;
The generating a target HMM model according to the HMM model corresponding to each similar field and the weight corresponding to the similar field includes:
determining respective parameters of the target HMM model according to formulas (3) to (5):
wherein p is the total number of similar fields, A * B is a state transition probability matrix in a target HMM model * Is the observation probability matrix pi in the target HMM model * For initial state probability distribution in the target HMM model, A k A state transition probability matrix B in the HMM model corresponding to the kth similar field k Is the observation probability matrix pi in the HMM model corresponding to the kth similar field k For the initial state probability distribution, w, in the HMM model corresponding to the kth similar field k Is the weight corresponding to the kth similar field, andthe value of (2) is 1.
7. The method according to any one of claims 1-6, wherein the determining the synthesized word in the text in the target field according to the role labeling result includes:
and determining the word segmentation or word segmentation combination conforming to the character state combination in the text as the synthesized word according to the character labeling result and the known character state combination forming the synthesized word.
8. A synthetic word recognition apparatus, the apparatus comprising:
the computing module is used for respectively computing the domain deviation between the target domain and each marked domain in the marked document set, wherein the synthesized words in the marked domain are known;
the first determining module is used for determining at least one similar field of the target field from the marked fields according to the field deviation and preset rules;
the generation module is used for generating a target HMM model according to the HMM models corresponding to the similar fields and the weights corresponding to the similar fields;
the second determining module is used for determining a role marking result according to the text in the target field, the target HMM model and the Viterbi algorithm, wherein the role marking result is used for indicating the role state corresponding to each word in the text;
and the third determining module is used for determining the synthetic words in the text in the target field according to the role marking result.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any one of claims 1-7.
10. An electronic device, comprising:
a memory having a computer program stored thereon;
a processor for executing the computer program in the memory to implement the steps of the method of any one of claims 1-7.
CN201811559551.6A 2018-12-19 2018-12-19 Synthetic word recognition method and device, readable storage medium and electronic equipment Active CN109815483B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811559551.6A CN109815483B (en) 2018-12-19 2018-12-19 Synthetic word recognition method and device, readable storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811559551.6A CN109815483B (en) 2018-12-19 2018-12-19 Synthetic word recognition method and device, readable storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN109815483A CN109815483A (en) 2019-05-28
CN109815483B true CN109815483B (en) 2023-08-08

Family

ID=66601650

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811559551.6A Active CN109815483B (en) 2018-12-19 2018-12-19 Synthetic word recognition method and device, readable storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN109815483B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110969009B (en) * 2019-12-03 2023-10-13 哈尔滨工程大学 Word segmentation method for Chinese natural language text

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1083195A (en) * 1996-09-09 1998-03-31 Oki Electric Ind Co Ltd Input language recognition device and input language recognizing method
CN101154226A (en) * 2006-09-27 2008-04-02 腾讯科技(深圳)有限公司 Method for adding unlisted word to word stock of input method and its character input device
CN107807910A (en) * 2017-10-10 2018-03-16 昆明理工大学 A kind of part-of-speech tagging method based on HMM
CN107861940A (en) * 2017-10-10 2018-03-30 昆明理工大学 A kind of Chinese word cutting method based on HMM
CN108170680A (en) * 2017-12-29 2018-06-15 厦门市美亚柏科信息股份有限公司 Keyword recognition method, terminal device and storage medium based on Hidden Markov Model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1083195A (en) * 1996-09-09 1998-03-31 Oki Electric Ind Co Ltd Input language recognition device and input language recognizing method
CN101154226A (en) * 2006-09-27 2008-04-02 腾讯科技(深圳)有限公司 Method for adding unlisted word to word stock of input method and its character input device
CN107807910A (en) * 2017-10-10 2018-03-16 昆明理工大学 A kind of part-of-speech tagging method based on HMM
CN107861940A (en) * 2017-10-10 2018-03-30 昆明理工大学 A kind of Chinese word cutting method based on HMM
CN108170680A (en) * 2017-12-29 2018-06-15 厦门市美亚柏科信息股份有限公司 Keyword recognition method, terminal device and storage medium based on Hidden Markov Model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于虚拟图像生成与融合HMM的单样本人脸识别技术研究;刘宵;《中国优秀硕士学位论文全文数据库信息科技辑》;20090915(第9期);摘要,第47-54页 *

Also Published As

Publication number Publication date
CN109815483A (en) 2019-05-28

Similar Documents

Publication Publication Date Title
He et al. Decoding with value networks for neural machine translation
CN110210028B (en) Method, device, equipment and medium for extracting domain feature words aiming at voice translation text
CN105068997B (en) The construction method and device of parallel corpora
WO2018121531A1 (en) Method and apparatus for generating test case script
WO2020215694A1 (en) Chinese word segmentation method and apparatus based on deep learning, and storage medium and computer device
CN112861518B (en) Text error correction method and device, storage medium and electronic device
CN110210041B (en) Inter-translation sentence alignment method, device and equipment
CN110352423B (en) Method, storage medium, and system for generating a target sequence using a noisy channel model
CN109147868B (en) Protein function prediction method, device, equipment and storage medium
CN112580324A (en) Text error correction method and device, electronic equipment and storage medium
CN114841274B (en) Language model training method and device, electronic equipment and storage medium
US20210081612A1 (en) Relationship estimation model learning device, method, and program
CN110751234A (en) OCR recognition error correction method, device and equipment
CN114757225B (en) Method, device, equipment and storage medium for determining signal sampling quality
CN109815483B (en) Synthetic word recognition method and device, readable storage medium and electronic equipment
CN113408273B (en) Training method and device of text entity recognition model and text entity recognition method and device
JP2016224483A (en) Model learning device, method and program
WO2022142123A1 (en) Training method and apparatus for named entity model, device, and medium
CN116756536B (en) Data identification method, model training method, device, equipment and storage medium
CN107273360A (en) Chinese notional word extraction algorithm based on semantic understanding
CN111460789A (en) L STM sentence segmentation method, system and medium based on character embedding
CN116542254A (en) Wind tunnel test data anomaly decision method and device, electronic equipment and storage medium
JP5295037B2 (en) Learning device using Conditional Random Fields or Global Conditional Log-linearModels, and parameter learning method and program in the learning device
JP6261669B2 (en) Query calibration system and method
CN112541557B (en) Training method and device for generating countermeasure network and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant