CN111144096B - Pinyin completion training method, completion model, completion method and completion input method based on HMM - Google Patents

Pinyin completion training method, completion model, completion method and completion input method based on HMM Download PDF

Info

Publication number
CN111144096B
CN111144096B CN201911265387.2A CN201911265387A CN111144096B CN 111144096 B CN111144096 B CN 111144096B CN 201911265387 A CN201911265387 A CN 201911265387A CN 111144096 B CN111144096 B CN 111144096B
Authority
CN
China
Prior art keywords
pinyin
probability
completion
hmm
strings
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911265387.2A
Other languages
Chinese (zh)
Other versions
CN111144096A (en
Inventor
王兴维
邰从越
刘龙
史黎鑫
刘慧芳
王慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Senyint International Digital Medical System Dalian Co ltd
Original Assignee
Senyint International Digital Medical System Dalian Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Senyint International Digital Medical System Dalian Co ltd filed Critical Senyint International Digital Medical System Dalian Co ltd
Priority to CN201911265387.2A priority Critical patent/CN111144096B/en
Publication of CN111144096A publication Critical patent/CN111144096A/en
Application granted granted Critical
Publication of CN111144096B publication Critical patent/CN111144096B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The method for training pinyin completion based on HMM, the completion model, the completion method and the completion input method belong to the field of medical information, and are used for solving the problem of improving the completion speed and accuracy, and are characterized by comprising the following steps of S1, acquiring legal complete single-word pinyin strings; s2, obtaining a possible pinyin string to be complemented corresponding to the legal complete single-word pinyin string; s3, obtaining the corresponding relation between the to-be-completed pinyin string and the complete pinyin string; s4, acquiring complete pinyin content corresponding to the training data, wherein the pinyin content is separated according to single words; s5, carrying out statistical learning on the training data to obtain initial probability, emission probability and transition probability, and the effect is that the process of traversing all the complement results is avoided, so that the matching efficiency is improved.

Description

Pinyin completion training method, completion model, completion method and completion input method based on HMM
Technical Field
The invention belongs to the field of medical information, and relates to a pinyin complement training method, a complement model, a complement method and a complement input method based on an HMM.
Background
Along with the progress of informatization construction of hospitals, doctors inevitably need to input various text contents such as electronic medical records, examination reports and the like in the office process. The input method is used as a main entrance for interaction between doctors and computers, and the accuracy and the applicability of the input method have great influence on the working efficiency of the doctors.
At present, most doctors use a pinyin input method which is realized aiming at the general field. Meanwhile, since most doctors type at a relatively slow speed, the general input method tends to obtain the expected Chinese character result with less input.
When a user inputs some pinyin contents, such as pinyin initial, some pinyin prefix, etc., a relatively perfect pinyin input method program should infer complete pinyin contents according to a predetermined rule algorithm and give corresponding candidate Chinese character results.
The conventional pinyin input method is to perform the completion based on phrase frequency when performing the pinyin completion, and when the input content is long, it is difficult to give a relatively accurate candidate result, for example, when "yxxt" is input, the completion result obtained based on phrase frequency may be "yi xia xi yang", so that the given optimal candidate result may be "under system", which is obviously not the expected result. The result of the method not only influences the efficiency of inputting the text of the doctor, but also improves the probability of inputting the error information.
In a common pinyin input method, the pinyin is completed based on frequency information in the pinyin completion process. The main process is as follows: and counting the pinyin and the frequency thereof in the training data, storing by adopting a corresponding data structure such as a Trie, when the input content is incomplete pinyin, performing prefix matching in the stored information, taking the complete pinyin with the highest frequency in the matched result as a complement result, and giving out corresponding Chinese character candidates.
In a specific implementation process, the storage structure of the data and the matching process are optimized according to different strategies, and the above process is generally described by taking one method as an example. Firstly, cleaning and word segmentation is carried out on training data, and if the original content is "pancreas morphology is normal", the possible word segmentation result is "pancreas morphology is normal"; then, the pinyin content corresponding to the phrase in the training data is obtained, and if the original phrase is 'pancreas', the pinyin result 'yi xin' corresponding to the original phrase is obtained; counting the frequency of all phrases, and storing the information of 'phrases + pinyin + frequency' according to a specified data structure. The phrase library based on training data is obtained through the steps, and the phrase library comprises corresponding information such as pinyin, frequency and the like. When "yxxt" is input, the input method divides the pinyin component according to a specified pinyin component specification to obtain a result of "y x t", wherein "/represents that one or more arbitrary letters can be matched; matching "yxxxχ" in an existing phrase library may match to several combinations such as "yi xia xi teng", "yi xin xing tai", etc.; and sequencing the frequencies of the matched phrase pinyin combinations, selecting one result with the highest frequency as a complement result, and carrying out pinyin-to-Chinese character operation based on the result. The above is the general flow of pinyin complement in the conventional pinyin input method.
In the pinyin complement process of the common pinyin input method, the following problems mainly exist:
firstly, in the matching process, as a similar fuzzy matching method is adopted, all possible complement results and combinations need to be traversed in each matching process, and when input content is short and phrase library scale is large, matching efficiency is affected;
secondly, because the method screens candidate results based on phrase frequency, the candidate results can be completely dependent on training data, and if word segmentation errors exist in the training data or phrases which do not exist in the training data, the method cannot be completed well;
meanwhile, when the input incomplete pinyin strings are longer, if the input strings are not subjected to additional processing, proper candidate results are difficult to be directly matched in the phrase library, if the input strings are subjected to processing and then are subjected to unmatching, additional processing procedures are added, the completion efficiency is affected, and the selection of the candidate results depends on the frequency of each phrase and the co-occurrence probability among the phrases is not considered, so that the accuracy of the long pinyin string completion results is affected.
Disclosure of Invention
In order to solve the problem of improving the completion speed and accuracy, the invention provides a pinyin completion method based on an HMM (hidden Markov model, hidden Markov Models), which acquires statistical information such as pronunciation probability of polyphones in specific contexts and word forming probability of word forming and the like by carrying out statistical learning on a large number of medical texts, carries out pinyin completion based on the statistical information, improves the completion speed and greatly improves the accuracy of the completion result.
In order to achieve the above purpose, the technical scheme of the invention is as follows: a training method for pinyin completion based on HMM comprises the following steps:
s1, acquiring legal complete single-word pinyin strings;
s2, obtaining a possible pinyin string to be complemented corresponding to the legal complete single-word pinyin string;
s3, obtaining the corresponding relation between the to-be-completed pinyin string and the complete pinyin string;
s4, acquiring complete pinyin content corresponding to the training data, wherein the pinyin content is separated according to single words;
s5, carrying out statistical learning on the training data to obtain initial probability, emission probability and transition probability.
Further:
the step of S1 is as follows: obtaining legal complete single-word pinyin strings according to the pinyin composition rules;
the step of S2 is as follows: removing the rest part of the prefix of the single word pinyin after legal pinyin, and adding the single word pinyin as a possible pinyin string to be complemented corresponding to the single word pinyin;
the step of S3 is as follows: and for the pinyin strings to be complemented, arranging and acquiring the corresponding relation of the pinyin strings to be complemented and the corresponding complete pinyin strings which can be complemented.
The invention also relates to a pinyin complement model obtained by the training method.
The invention also relates to a pinyin completion model, comprising:
the pinyin string acquisition module is provided with legal complete single-word pinyin strings;
the to-be-completed pinyin string module is provided with possible to-be-completed pinyin strings corresponding to legal complete single-word pinyin strings;
the corresponding relation module is used for establishing a corresponding relation between the to-be-completed pinyin string and the complete pinyin string;
the separation module is used for training complete pinyin contents corresponding to the data, and separating the pinyin contents according to single words;
and the statistical learning module performs statistical learning on the training data to acquire initial probability, emission probability and transition probability.
The invention also relates to a pinyin completion method based on the HMM, which comprises the following steps of
Acquiring an input string of a user, and dividing according to the composition principle of pinyin;
and performing pinyin completion by using the segmented input strings by the pinyin completion model.
Further, the pinyin complement method comprises the following steps: and calculating the probability of occurrence of all possible completion results and probability information of each completion result of one input, which occurs after each completion result of the other input, by using the segmented input string, calculating the score of each complete completion result, and finally taking one item with the highest score as the final completion result.
The invention also relates to a pinyin completion input method based on the HMM, which is characterized in that the completion input string is completed according to the completion method, and the completion result is converted into Chinese characters and the Chinese character string is output.
The beneficial effects are that: according to the invention, the statistical probability information in the large-scale training data is obtained by learning, the probability information corresponding to all possible completion results is comprehensively considered by adopting an HMM algorithm in the pinyin completion process, the comprehensive score of each result is calculated by combining the context, and finally the candidate result is screened according to the score. Because the score calculation can be carried out by adopting a corresponding algorithm based on the pre-trained model information in the process of searching the candidate results, the process of traversing all the complement results is avoided, and the matching efficiency is improved. Meanwhile, in the matching process, the score is comprehensively calculated by combining the information such as the occurrence probability of the single word, the co-occurrence probability of the context and the like, so that a completion result can be given under the condition of encountering word segmentation errors of training data. When the input incomplete pinyin string is longer, the completion method based on the HMM can comprehensively consider the co-occurrence probability of the context, so that the completion result is more in line with expectations.
The pinyin completion method based on the HMM fully utilizes probability information such as occurrence probability and co-occurrence probability existing in large-scale training data, and solves the problems of inaccurate completion, low efficiency and the like caused by the fact that only phrase frequency information is considered in the traditional method in the pinyin completion process. Because various statistical information is fully considered in the process of calculating the score of each candidate completion result, compared with the traditional method, the obtained result is more in line with the user's expectations and is closer to the term characteristics of training data, and therefore the effect is obviously better than the completion accuracy of the traditional method under the condition of using medical texts as the training data. Meanwhile, according to the method, since only each item of statistical information between single-word pinyin is reserved, the size of the generated model file is much smaller than that of phrase frequency information, and the effect of saving the memory space can be achieved. In the completion process, the completion method based on the HMM provided herein automatically discards partial candidate results with lower scores when calculating the scores of the candidates, so that the efficiency is superior to that of the traditional method.
Drawings
FIG. 1 is an observation sequence transfer trellis diagram;
fig. 2 is a path diagram.
Detailed Description
Definition: HMM: hidden Markov Models, hidden Markov models; viterbi: viterbi algorithm, viterbi algorithm, a dynamic programming algorithm.
In the embodiment, each technical scheme is a method and a device realized by software.
The main difference between the medical special input method based on the HMM and other common medical input methods is that the training data is subjected to statistical learning by adopting an HMM algorithm, and the pinyin is converted into Chinese characters based on statistical information.
The proposal of the scheme is based on the following background:
with the development of medical informatization and intellectualization, doctors need to perform more and more text input operations, such as medical records, doctor orders and other medical documents, in the office process, the doctors need to input text contents by using a certain input method.
Currently, there are two main input methods used by most doctors: one is a general input method, such as a dog search input method, a hundred-degree input method and the like, and the other is a medical input method based on word stock.
In the working process of doctors, a great amount of proprietary terms, nouns, proprietary knowledge and other contents are required to be input, and the existing general input method can consume a great amount of time in the process of manually selecting candidate entries, so that the input time is long, the efficiency is low, and mistakes are easy to occur. Although the medical input method based on the word stock can solve the input efficiency problem of proper nouns to a certain extent, as the word stock is mostly fixed phrase with limited length, when a doctor inputs longer content at one time, the problem of low efficiency caused by far difference between candidate terms and contexts and need to manually perform a large amount of selection operations is still faced.
At present, the common medical input method is mostly realized by adding a medical special dictionary through a specific method on the basis of a general input method, and the dictionary is generally stored in a local hard disk or a cloud server. The main process is as follows: firstly, acquiring a pinyin string input by a user; then, matching is carried out in a special word stock according to the input pinyin strings; if the phrase content conforming to the input pinyin string is matched in the medical special word stock, returning the phrase content as a candidate result with higher priority; continuing to use a general algorithm to convert the input pinyin strings and the candidate Chinese characters, wherein the result priority is lower than the retrieval result in the medical special word stock; and combining and sorting the two candidate results, and returning to the input method interface for the user to select.
In the above common medical input method, the basic flow in the process of converting pinyin into Chinese characters may be added into other operations such as rule verification, but the basic flow is not changed. In the pinyin complement process of the common pinyin input method, the problem that the probability of outputting an expected result is very low when a long string is input is solved, and the working efficiency of doctors is affected.
HMM is a basic statistical model that introduces a set of hidden states in a standard markov process, and observes some probabilistic relationships between states and hidden states, describing a markov process that contains hidden states.
When using HMM models, the general problem requires two main features:
1. the problem is sequence-based, such as time sequence, state sequence;
2. two types of data exist in the problem, one type of data is observable, namely an observation sequence, and the other type of data is unobservable, namely a hidden state sequence, namely a state sequence for short;
in the medical special input method based on the HMM, the pinyin is regarded as a hidden state in the HMM, and a character result obtained by pinyin conversion is regarded as an observation state.
In the HMM model used in the present invention, there are mainly the following parameters:
hiding state set, with q= { Q 1 ,q 2 ,…,q N -N is the number of possible hidden states, corresponding to the number of states of all possible chinese characters in the present invention;
observe state set, with v= { V 1 ,v 2 ,…,v M -representing, where M is the number of possible observation states, the number of states corresponding to all possible pinyins in the present invention;
hidden state sequence with length T is represented by I= { I 1 ,i 2 ,...,i T The expression of the Chinese character sequence corresponding to the phonetic sequence input by the user;
observed state sequence of length T, with o= { O 1 ,o 2 ,...,o T -representing a pinyin sequence corresponding to the user input in the present invention;
hidden state transition probability distribution, denoted by A, describing the state q at time t i Under the condition of (1), transition to state q occurs at time t+1 j The probability of the corresponding Chinese character is the transition probability between adjacent Chinese characters in the medical document;
observing a state emission probability distribution, denoted by B, describing the hidden state q at time t i To the observation state v k Corresponding to the emission probability from Chinese characters to pinyin in the medical document;
the initial state probability is represented by pi, and describes the probability of being in a certain hidden state at the time t=1, and corresponds to the occurrence probability of the first Chinese character in the Chinese characters corresponding to the input pinyin sequence.
Based on the parameters, the process of pinyin string-to-Chinese characters related by the invention is a process of searching the most likely corresponding hidden state sequence for a given observation sequence.
Based on the background, the pinyin Chinese character conversion process of the invention is mainly divided into two parts of a training stage and a using stage.
The training phase is as follows:
the training method of the HMM input method model of the medical document comprises the following steps:
acquiring all legal pinyin strings as an observation state set of the HMM model, and acquiring all Chinese character results corresponding to the pinyin strings as a hidden state set of the HMM model;
the training medical document content is divided into words according to the single words, and the occurrence probability of each word is counted to be used as the initial state probability of the HMM model;
converting all Chinese characters in the medical document for training into corresponding pinyin, and counting the Chinese characters corresponding to each pinyin and the occurrence probability of each Chinese character as the observation state emission probability of the HMM model;
and counting the probability of other Chinese characters appearing behind each Chinese character in the medical document for training, and taking the probability as the hidden state transition probability of the HMM model.
In one embodiment, the training method is applied in a specific example as follows:
1. acquiring all legal pinyin strings (observation state sets) and all Chinese character results (hiding state sets) corresponding to the legal pinyin strings;
for example, "ceng" means "layer, rubbing cinquefoil ever" means all Chinese characters corresponding to the Pinyin string "ceng".
2. The training medical document content is divided into words according to single characters, the occurrence probability of each word is counted, and the word is sequentially used as an initial probability pi;
such as "one" 0.0040433090858546655, "seven" 6.855110316782822e-06, represent initial probability sizes of "one" and "seven".
3. All Chinese characters in the medical document for training are converted into corresponding pinyin, and the probability of each pinyin corresponding to the Chinese characters and the probability of each occurrence are counted to be used as the emission probability B;
such as "and": { "ju":0.0006393861892583121, "qie":0.9993606138107417}, represent several pinyin strings corresponding to "and" words and their probabilities.
4. Counting the probability of other Chinese characters appearing behind each Chinese character in the medical document for training, and taking the probability as the hidden state transition probability;
such as "nose": { "sinus": 0.09741548503759896, "viscosity": 0.006450261318279049, "longitudinal": 2.2052175447107856e-05, }, indicates the probability of the "nose" word being followed by "sinus", "viscosity", "longitudinal" words.
Through the process, the basic content of an HMM model can be obtained, and the model is reserved. The HMM input method model comprises the following steps:
the observation state collection module is used for obtaining all legal pinyin strings,
the hidden state collection module of the HMM model is used for obtaining all Chinese character results corresponding to the pinyin strings,
the initial state probability distribution module divides the training medical document content into words according to single characters and counts the probability distribution of each word,
the observation state emission probability distribution module is used for converting all Chinese characters in the medical document for training into corresponding pinyin, and counting the Chinese characters corresponding to each pinyin and the probability distribution of each occurrence;
the hidden state transition probability distribution module is the probability distribution of other Chinese characters appearing behind each Chinese character in the medical document for statistical training.
The using stage is as follows:
the medical document content-based input method comprises the following steps:
acquiring a pinyin string, and cutting and completing the pinyin string;
cutting the completed pinyin strings as an observation state sequence of the HMM model, and inputting the HMM input method model;
the method comprises the steps of outputting Chinese character strings, transferring an observation state sequence into a hidden state sequence by an HMM input method model, searching the most likely corresponding hidden state sequence, converting the pinyin strings into Chinese characters, and returning the first Chinese character strings with the maximum probability.
In one embodiment, the application of the usage method in a specific example is as follows:
1. acquiring a pinyin string input by a user, and segmenting and completing the pinyin string according to a specific method;
for example, the result after the segmentation and complementation is "shu niao gun" by the user input method "shuniaoguan".
2. Using the pinyin strings (observed state sequences) after segmentation and complementation, and combining the trained model results to perform conversion from pinyin to Chinese characters (hidden state sequences);
in the invention, the process of converting pinyin to Chinese characters based on the trained HMM model mainly adopts the Viterbi algorithm to solve. The Viterbi algorithm is a general dynamic programming algorithm for solving the shortest path of the sequence, in short, each time a state transition is performed after a state is started, the maximum probability value of all paths corresponding to each state at the moment is recorded, the maximum probability value is used as a reference to continue to advance until the completion, and finally the whole path is traced back, namely the required shortest path of the sequence.
As shown in fig. 1, the grid in fig. 1 represents the transition of chinese characters to an observation sequence (pinyin sequence). For each intermediate and termination state in the grid, there is a most probable path that can reach the state, for example, for three states at time t=3, each most probable path that reaches the path may be as shown in fig. 2, based on which it is known that during the conversion process of pinyin chinese characters, each state at the termination time has a local probability and a corresponding optimal path, so that a global optimal path may be determined by selecting a state (and its corresponding optimal path) with the highest local probability value at the time, that is, a final chinese character string. The obtained Chinese character string is returned, and the first few values with the highest probability are usually returned for selection by doctors.
The method is the complete training and using process of the medical input method based on the HMM.
According to the invention, through carrying out statistical learning on a large-scale medical document, probability information in the large-scale medical document is obtained, in the process of converting Chinese characters by pinyin, all possible Chinese character results are comprehensively considered by adopting an HMM algorithm, conversion results of longer pinyin strings are perfected by combining information such as transition probability among words, and finally, all possible Chinese character results are scored by combining the existing statistical information, and candidate results are screened and returned according to the scores.
The process of obtaining the HMM model based on the statistical learning method avoids the problems caused by the dictionary obtaining process in the method based on the medical special dictionary, basically does not need to consume extra labor cost, carries out training based on words, and also avoids the dependence on word segmentation accuracy.
In the conversion process, the method of the invention comprehensively calculates the probability and the score of various possible Chinese character results through probability information in the model to obtain candidate results, thereby avoiding the dependence on the size and coverage of a word stock in the word stock-based method.
Meanwhile, the method based on the HMM model takes the information such as the co-occurrence probability of the context into consideration when calculating the Chinese character combination probability and the score, so that when long string input is encountered, relatively more reasonable candidate results can be given by combining the context, and the accuracy of the long string input results is improved.
The medical special input method based on the HMM fully utilizes probability information existing in large-scale training data, and avoids introducing extra labor cost. In the process of phonetic Chinese character conversion, the problems of strong dependence on word stock, low accuracy of long string input results and the like caused by a method based on medical word stock are avoided.
Because various statistical information such as the corresponding relation between pinyin and Chinese characters, the co-occurrence relation between Chinese characters and the like are fully considered by the model in the training process, compared with a general input method and a special word stock, the obtained candidate result can be more in line with the expectations of doctors and is more close to the term characteristics of training data, and in the experimental process, the effect is obviously better than that of the traditional medical input method adopting the special word stock.
Based on the characteristics, the method provided by the invention not only improves the accuracy of the Chinese character result, but also improves the working efficiency of inputting the medical document by doctors while saving the labor cost.
The invention has the technical key point that the phonetic Chinese character conversion algorithm based on the HMM is adopted in the implementation process of the medical special input method. In the training process, various probability information is acquired through statistical learning of a large-scale medical document, and in the actual use process, known various probability information is comprehensively considered to perform probability calculation and comparison on possible Chinese character combinations. The mature HMM algorithm is applied to the statistical learning of the medical document, replaces the traditional method of adding a special word stock by a general input method, saves labor cost, greatly improves the accuracy of inputting medical terms, improves the input efficiency of long-string medical documents, and saves the time of inputting medical documents of doctors so as to improve the working efficiency of the doctors.
The above schemes describe in detail the training method, the input method model and the input method of the HMM input method model of the medical document.
In one scheme, for the steps of cutting and completing the pinyin string (of course, the steps of cutting and completing the pinyin string in the application of the invention) described in the application stage of each case in the embodiment, an HMM-based pinyin completion method is provided, and the main difference in the completion process of other common pinyin input methods is that in the information acquisition process of training data, the HMM method is adopted for statistical learning.
The technical problem is put forward based on the following background:
along with the progress of informatization construction of hospitals, doctors inevitably need to input various text contents such as electronic medical records, examination reports and the like in the office process. The input method is used as a main entrance for interaction between doctors and computers, and the accuracy and the applicability of the input method have great influence on the working efficiency of the doctors.
At present, most doctors use a pinyin input method which is realized aiming at the general field. Meanwhile, since most doctors type at a relatively slow speed, the general input method tends to obtain the expected Chinese character result with less input.
When a user inputs some pinyin contents, such as pinyin initial, some pinyin prefix, etc., a relatively perfect pinyin input method program should infer complete pinyin contents according to a predetermined rule algorithm and give corresponding candidate Chinese character results.
The conventional pinyin input method is to perform the completion based on phrase frequency when performing the pinyin completion, and when the input content is long, it is difficult to give a relatively accurate candidate result, for example, when "yxxt" is input, the completion result obtained based on phrase frequency may be "yi xia xi yang", so that the given optimal candidate result may be "under system", which is obviously not the expected result. The result of the method not only influences the efficiency of inputting the text of the doctor, but also improves the probability of inputting the error information.
In a common pinyin input method, the pinyin is completed based on frequency information in the pinyin completion process. The main process is as follows: and counting the pinyin and the frequency thereof in the training data, storing by adopting a corresponding data structure such as a Trie, when the input content is incomplete pinyin, performing prefix matching in the stored information, taking the complete pinyin with the highest frequency in the matched result as a complement result, and giving out corresponding Chinese character candidates.
In a specific implementation process, the storage structure of the data and the matching process are optimized according to different strategies, and the above process is generally described by taking one method as an example. Firstly, cleaning and word segmentation is carried out on training data, and if the original content is "pancreas morphology is normal", the possible word segmentation result is "pancreas morphology is normal"; then, the pinyin content corresponding to the phrase in the training data is obtained, and if the original phrase is 'pancreas', the pinyin result 'yi xin' corresponding to the original phrase is obtained; counting the frequency of all phrases, and storing the information of 'phrases + pinyin + frequency' according to a specified data structure. The phrase library based on training data is obtained through the steps, and the phrase library comprises corresponding information such as pinyin, frequency and the like. When "yxxt" is input, the input method divides the pinyin component according to a specified pinyin component specification to obtain a result of "y x t", wherein "/represents that one or more arbitrary letters can be matched; matching "yxxxχ" in an existing phrase library may match to several combinations such as "yi xia xi teng", "yi xin xing tai", etc.; and sequencing the frequencies of the matched phrase pinyin combinations, selecting one result with the highest frequency as a complement result, and carrying out pinyin-to-Chinese character operation based on the result. The above is the general flow of pinyin complement in the conventional pinyin input method.
In the pinyin complement process of the common pinyin input method, the following problems mainly exist:
firstly, in the matching process, as a similar fuzzy matching method is adopted, all possible complement results and combinations need to be traversed in each matching process, and when input content is short and phrase library scale is large, matching efficiency is affected;
secondly, because the method screens candidate results based on phrase frequency, the candidate results can be completely dependent on training data, and if word segmentation errors exist in the training data or phrases which do not exist in the training data, the method cannot be completed well;
meanwhile, when the input incomplete pinyin strings are longer, if the input strings are not subjected to additional processing, proper candidate results are difficult to be directly matched in the phrase library, if the input strings are subjected to processing and then are subjected to unmatching, additional processing procedures are added, the completion efficiency is affected, and the selection of the candidate results depends on the frequency of each phrase and the co-occurrence probability among the phrases is not considered, so that the accuracy of the long pinyin string completion results is affected.
HMM is a basic statistical model that introduces a set of hidden states in a standard markov process, and observes some probabilistic relationships between states and hidden states, describing a markov process that contains hidden states.
Based on the background, in the pinyin completion process, the incomplete pinyin strings to be completed are used as observation states, the completed complete pinyin results are used as hidden states, and the pinyin completion process is converted into the decoding problem of the HMM (given the observation sequence, searching the most likely corresponding hidden state sequence).
The pinyin completion process of the invention is mainly divided into two parts of a training stage and a using stage.
The main steps of the training phase are as follows:
in one embodiment, the training method is applied in a specific example as follows:
the pinyin complement model comprises the following steps:
s1, acquiring legal complete single-word pinyin strings;
s2, obtaining a possible pinyin string to be complemented corresponding to the legal complete single-word pinyin string;
s3, obtaining the corresponding relation between the to-be-completed pinyin string and the complete pinyin string;
s4, acquiring complete pinyin content corresponding to the training data, wherein the pinyin content is separated according to single words;
s5, carrying out statistical learning on the training data to obtain initial probability, emission probability and transition probability.
In one embodiment, the training method is applied in a specific example as follows:
1. acquiring all legal complete single-word pinyin strings;
according to the spelling formation rule, all legal complete single-word spelling strings can be obtained. For example, "yi" and "xin" are legal complete single-word pinyin strings, and "y" and "bia" are incomplete pinyin strings.
2. Acquiring all possible pinyin strings to be complemented corresponding to all legal complete single-word pinyin strings;
in the invention, the single word pinyin is added with all prefixes of the single word pinyin after the rest part after legal pinyin is removed, and the single word pinyin is used as a possible pinyin string to be complemented corresponding to the single word pinyin. For example, for single word pinyin "xian", all prefixes include "xxi xia", where "xi xia" is a legal pinyin string, so that in the present invention, the pinyin string to be completed of "xian" is "xxian", that is, when the input method obtains two inputs of "x" or "xian", the result of "xian" may be obtained.
3. Acquiring the corresponding relation between the pinyin string to be complemented and the complete pinyin string;
and based on the to-be-completed pinyin strings obtained in the last step, arranging and obtaining the corresponding relation of all to-be-completed pinyin strings and the corresponding to-be-completed complete pinyin strings. For example, "bia" means "bian bio" and "bio" are the possible complete pinyin strings corresponding to the string to be completed "bia".
4. Acquiring complete pinyin content corresponding to training data (namely the existing medical text);
and acquiring complete pinyin content corresponding to the training data through a Chinese character to pinyin conversion tool. Wherein the pinyin content is separated by words. For the text "pancreas morphology is normal", the corresponding pinyin content is "yi xian xing tai zheng chang".
5. And carrying out statistical learning on the training data to obtain the following contents:
initial probability: the probability that the complete pinyin a appears in the data. For example, "bing" 0.0032817207121452023 indicates that the probability of occurrence of the pinyin "bing" is 0.0032817207121452023.
Emission probability: and the probability that the complete pinyin string is completed by a certain to-be-completed pinyin string in all to-be-completed pinyin strings corresponding to the complete pinyin. For example, "bao: {" b ":0.004975124378109453," bao: 0.9950248756218906}, means that the probability of "bao" being complemented by "b" is 0.004975124378109453 and the probability of "bao" being complemented by "bao" is 0.9950248756218906.
Transition probability: the probability of complete pinyin a followed by complete pinyin B. For example, "yi" is { "an" is 3.47512365946466e-07, "chang" is 0.009643713457860983}, indicating that the probability of "yi" followed by the pinyin string "an" is 3.47512365946466e-07, and the probability of "yi" followed by the pinyin string "chang" is 0.009643713457860983.
The training process is finished, and the contents of the to-be-completed string and complete string corresponding table (step 3), the initial probability (step 5), the emission probability (step 5) and the transition probability (step 5) are obtained.
Through the process, the basic content of a pinyin complement model can be obtained, and the model is reserved. A pinyin completion model comprising:
the pinyin string acquisition module is provided with legal complete single-word pinyin strings;
the to-be-completed pinyin string module is provided with possible to-be-completed pinyin strings corresponding to legal complete single-word pinyin strings;
the corresponding relation module is used for establishing a corresponding relation between the to-be-completed pinyin string and the complete pinyin string;
the separation module is used for training complete pinyin contents corresponding to the data, and separating the pinyin contents according to single words;
and the statistical learning module performs statistical learning on the training data to acquire initial probability, emission probability and transition probability.
The main steps of the using stage are as follows:
the pinyin completion method based on the HMM comprises the following steps of
And performing pinyin complementation on the split pinyin strings according to the pinyin complementation model:
and calculating the probability of occurrence of all possible completion results and probability information of each completion result of one pinyin string, which occurs after each completion result of the other pinyin string, calculating the score of each complete completion result, and taking one item with the highest score as the final completion result.
1. Acquiring an input string of a user, and dividing according to the composition principle of pinyin;
if the user inputs "yxxt", the result of "y x x t" can be obtained after the segmentation.
2. Using the segmented input strings, and combining the trained model results to complete pinyin;
according to the trained model result, using the input string after segmentation, such as 'y x t', calculating the probability of occurrence of all possible complement results corresponding to 'y', 'x','t', and the probability of occurrence of each complement result of 'x' after each complement result of 'y', and the like, comprehensively calculating the score of each complete complement result, and finally taking one item with the highest score as the final complement result.
3. Performing subsequent processes of pinyin to Chinese characters according to the completion result;
according to the invention, the statistical probability information in the large-scale training data is obtained by learning, the probability information corresponding to all possible completion results is comprehensively considered by adopting an HMM algorithm in the pinyin completion process, the comprehensive score of each result is calculated by combining the context, and finally the candidate result is screened according to the score.
Because the score calculation can be carried out by adopting a corresponding algorithm based on the pre-trained model information in the process of searching the candidate results, the process of traversing all the complement results is avoided, and the matching efficiency is improved. Meanwhile, in the matching process, the score is comprehensively calculated by combining the information such as the occurrence probability of the single word, the co-occurrence probability of the context and the like, so that a completion result can be given under the condition of encountering word segmentation errors of training data. When the input incomplete pinyin string is longer, the completion method based on the HMM can comprehensively consider the co-occurrence probability of the context, so that the completion result is more in line with expectations.
The pinyin completion method based on the HMM fully utilizes probability information such as occurrence probability and co-occurrence probability existing in large-scale training data, and solves the problems of inaccurate completion, low efficiency and the like caused by the fact that only phrase frequency information is considered in the traditional method in the pinyin completion process. Because various statistical information is fully considered in the process of calculating the score of each candidate completion result, compared with the traditional method, the obtained result is more in line with the user's expectations and is closer to the term characteristics of training data, and therefore the effect is obviously better than the completion accuracy of the traditional method under the condition of using medical texts as the training data.
Meanwhile, according to the method, since only each item of statistical information between single-word pinyin is reserved, the size of the generated model file is much smaller than that of phrase frequency information, and the effect of saving the memory space can be achieved.
In the completion process, the completion method based on the HMM provided herein automatically discards partial candidate results with lower scores when calculating the scores of the candidates, so that the efficiency is superior to that of the traditional method.
The invention has the technical key point that the method adopts a completion algorithm based on HMM in the pinyin completion process of the pinyin input method. In the training process, various probability information is obtained through statistical learning of large-scale medical training data, in the actual completion process, known various probability information is comprehensively considered, score calculation and comparison are carried out on each possible candidate completion result, and finally the result with the highest score is taken as the completion result. The HMM method is applied to the pinyin completion process, so that the accuracy of the pinyin completion result is greatly improved while the completion efficiency is improved, the completion result is more in accordance with the term habit of training data, and the user experience is improved.
In one embodiment, the invention discloses a pinyin completion input method based on an HMM, which is different from the existing input method in that the completion input method is used for completing input strings according to weights, and the completion result is subjected to pinyin to Chinese characters and the Chinese character strings are output.
While the invention has been described with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (1)

1. The training method for pinyin completion based on the HMM is characterized by comprising the following steps of:
s1, acquiring legal complete single-word pinyin strings;
s2, obtaining a possible pinyin string to be complemented corresponding to the legal complete single-word pinyin string;
s3, obtaining the corresponding relation between the to-be-completed pinyin string and the complete pinyin string;
s4, acquiring complete pinyin content corresponding to the training data, wherein the pinyin content is separated according to single words;
s5, carrying out statistical learning on the training data to obtain initial probability, emission probability and transition probability;
the step of S1 is as follows: obtaining legal complete single-word pinyin strings according to the pinyin composition rules;
the step of S2 is as follows: removing the rest part of the prefix of the single word pinyin after legal pinyin, and adding the single word pinyin as a possible pinyin string to be complemented corresponding to the single word pinyin;
the step of S3 is as follows: the to-be-completed pinyin strings are arranged and obtained, and the corresponding relation between the to-be-completed pinyin strings and the corresponding complete pinyin strings which can be completed;
the pinyin is regarded as a hidden state in the HMM, and a character result obtained by pinyin conversion is regarded as an observation state;
in the HMM model, the following parameters are included:
hiding state set, with q= { Q 1 ,q 2 ,…,q N -N represents the number of possible hidden states, corresponding to the number of states of all possible chinese characters;
observe state set, with v= { V 1 ,v 2 ,…,v M -representing, where M is the number of possible observation states, the number of states corresponding to all possible pinyins;
hidden state sequence with length T is represented by I= { I 1 ,i 2 ,…,i T The expression is corresponding to the Chinese character sequence corresponding to the pinyin sequence input by the user;
observed state sequence of length T, with o= { O 1 ,o 2 ,…,o T -representing a pinyin sequence corresponding to the user input;
hidden state transition probability distribution, denoted by A, describing the state q at time t i Under the condition of (1), transition to state q occurs at time t+1 j Corresponding to the probability of medical articleTransition probability between adjacent Chinese characters in the book;
observing a state emission probability distribution, denoted by B, describing the hidden state q at time t i To the observation state v k The emission probability of the corresponding Chinese characters in the medical document to the pinyin;
the initial state probability is expressed by pi, and describes the probability of being in a certain hidden state at the time t=1, and corresponds to the occurrence probability of the first Chinese character in the Chinese characters corresponding to the input pinyin sequence;
based on the parameters, the related pinyin string-to-Chinese character process is a process of searching the most likely corresponding hidden state sequence for a given observation sequence;
the pinyin Chinese character conversion process is divided into a training stage and a using stage;
the training phase is as follows:
the training method of the HMM input method model of the medical document comprises the following steps:
acquiring all legal pinyin strings as an observation state set of the HMM model, and acquiring all Chinese character results corresponding to the pinyin strings as a hidden state set of the HMM model;
the training medical document content is divided into words according to the single words, and the occurrence probability of each word is counted to be used as the initial state probability of the HMM model;
converting all Chinese characters in the medical document for training into corresponding pinyin, and counting the Chinese characters corresponding to each pinyin and the occurrence probability of each Chinese character as the observation state emission probability of the HMM model;
the probability of other Chinese characters appearing behind each Chinese character in the medical document for training is counted and used as the hidden state transition probability of the HMM model;
the HMM input method model comprises the following steps:
the observation state collection module is used for obtaining all legal pinyin strings,
the hidden state collection module of the HMM model is used for obtaining all Chinese character results corresponding to the pinyin strings,
the initial state probability distribution module divides the training medical document content into words according to single characters and counts the probability distribution of each word,
the observation state emission probability distribution module is used for converting all Chinese characters in the medical document for training into corresponding pinyin, and counting the Chinese characters corresponding to each pinyin and the probability distribution of each occurrence;
the hidden state transition probability distribution module is used for counting probability distribution of other Chinese characters appearing behind each Chinese character in the training medical document;
the using stage is as follows:
the medical document content-based input method comprises the following steps:
acquiring a pinyin string, and cutting and completing the pinyin string;
cutting the completed pinyin strings as an observation state sequence of the HMM model, and inputting the HMM input method model;
the method comprises the steps of outputting Chinese character strings, transferring an observation state sequence into a hidden state sequence by an HMM input method model, searching the most likely corresponding hidden state sequence, converting the pinyin strings into Chinese characters, and returning the first Chinese character strings with the maximum probability.
CN201911265387.2A 2019-12-11 2019-12-11 Pinyin completion training method, completion model, completion method and completion input method based on HMM Active CN111144096B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911265387.2A CN111144096B (en) 2019-12-11 2019-12-11 Pinyin completion training method, completion model, completion method and completion input method based on HMM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911265387.2A CN111144096B (en) 2019-12-11 2019-12-11 Pinyin completion training method, completion model, completion method and completion input method based on HMM

Publications (2)

Publication Number Publication Date
CN111144096A CN111144096A (en) 2020-05-12
CN111144096B true CN111144096B (en) 2023-09-29

Family

ID=70518009

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911265387.2A Active CN111144096B (en) 2019-12-11 2019-12-11 Pinyin completion training method, completion model, completion method and completion input method based on HMM

Country Status (1)

Country Link
CN (1) CN111144096B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101067780A (en) * 2007-06-21 2007-11-07 腾讯科技(深圳)有限公司 Character inputting system and method for intelligent equipment
CN102915122A (en) * 2012-07-19 2013-02-06 上海交通大学 Intelligent mobile platform Pinyin (phonetic transcriptions of Chinese characters) input method based on language models
CN105718070A (en) * 2016-01-16 2016-06-29 上海高欣计算机系统有限公司 Pinyin long sentence continuous type-in input method and Pinyin long sentence continuous type-in input system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7478033B2 (en) * 2004-03-16 2009-01-13 Google Inc. Systems and methods for translating Chinese pinyin to Chinese characters

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101067780A (en) * 2007-06-21 2007-11-07 腾讯科技(深圳)有限公司 Character inputting system and method for intelligent equipment
CN102915122A (en) * 2012-07-19 2013-02-06 上海交通大学 Intelligent mobile platform Pinyin (phonetic transcriptions of Chinese characters) input method based on language models
CN105718070A (en) * 2016-01-16 2016-06-29 上海高欣计算机系统有限公司 Pinyin long sentence continuous type-in input method and Pinyin long sentence continuous type-in input system

Also Published As

Publication number Publication date
CN111144096A (en) 2020-05-12

Similar Documents

Publication Publication Date Title
TWI664540B (en) Search word error correction method and device, and weighted edit distance calculation method and device
US8117026B2 (en) String matching method and system using phonetic symbols and computer-readable recording medium storing computer program for executing the string matching method
CN106649783B (en) Synonym mining method and device
CN102298582B (en) Data search and matching process and system
CN110210029A (en) Speech text error correction method, system, equipment and medium based on vertical field
CN107704102B (en) Text input method and device
CN107741928A (en) A kind of method to text error correction after speech recognition based on field identification
CN108231066B (en) Speech recognition system and method thereof and vocabulary establishing method
CN101131706A (en) Query amending method and system thereof
US11645447B2 (en) Encoding textual information for text analysis
GB2248328A (en) Conversion of phonetic Chinese to character Chinese
EP2705443A1 (en) Statistical spell checker
WO2016095645A1 (en) Stroke input method, device and system
Mandal et al. Clustering-based Bangla spell checker
CN109033066A (en) A kind of abstract forming method and device
CN111090338B (en) Training method of HMM (hidden Markov model) input method model of medical document, input method model and input method
JP5152918B2 (en) Named expression extraction apparatus, method and program thereof
CN113033204A (en) Information entity extraction method and device, electronic equipment and storage medium
CN111144096B (en) Pinyin completion training method, completion model, completion method and completion input method based on HMM
JP5189413B2 (en) Voice data retrieval system
CN112883718B (en) Spelling error correction method and device based on Chinese character sound-shape similarity and electronic equipment
CN111881678B (en) Domain word discovery method based on unsupervised learning
CN112199461A (en) Document retrieval method, device, medium and equipment based on block index structure
JP4985096B2 (en) Document analysis system, document analysis method, and computer program
JP5057916B2 (en) Named entity extraction apparatus, method, program, and recording medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant