CN111144096A - HMM-based pinyin completion training method, completion model, completion method and completion input method - Google Patents

HMM-based pinyin completion training method, completion model, completion method and completion input method Download PDF

Info

Publication number
CN111144096A
CN111144096A CN201911265387.2A CN201911265387A CN111144096A CN 111144096 A CN111144096 A CN 111144096A CN 201911265387 A CN201911265387 A CN 201911265387A CN 111144096 A CN111144096 A CN 111144096A
Authority
CN
China
Prior art keywords
pinyin
completion
string
probability
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911265387.2A
Other languages
Chinese (zh)
Other versions
CN111144096B (en
Inventor
王兴维
邰从越
刘龙
史黎鑫
刘慧芳
王慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Senyint International Digital Medical System Dalian Co ltd
Original Assignee
Senyint International Digital Medical System Dalian Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Senyint International Digital Medical System Dalian Co ltd filed Critical Senyint International Digital Medical System Dalian Co ltd
Priority to CN201911265387.2A priority Critical patent/CN111144096B/en
Publication of CN111144096A publication Critical patent/CN111144096A/en
Application granted granted Critical
Publication of CN111144096B publication Critical patent/CN111144096B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

A pinyin completion training method, a completion model, a completion method and a completion input method based on an HMM (hidden Markov model), which belong to the field of medical information and aim to solve the problems of improving the completion speed and accuracy, and are characterized by comprising the following steps of S1, obtaining a legal complete single-character pinyin string; s2, acquiring possible pinyin strings to be complemented corresponding to legal complete single character pinyin strings; s3, acquiring the corresponding relation between the pinyin string to be complemented and the complete pinyin string; s4, acquiring complete pinyin contents corresponding to the training data, wherein the pinyin contents are separated according to single characters; s5, statistical learning is carried out on the training data to obtain the initial probability, the emission probability and the transition probability, and the effect is to avoid the process of traversing all completion results, so that the matching efficiency is improved.

Description

HMM-based pinyin completion training method, completion model, completion method and completion input method
Technical Field
The invention belongs to the field of medical information, and relates to a pinyin completion training method, a completion model, a completion method and a completion input method based on an HMM (hidden Markov model).
Background
With the progress of information-based construction of hospitals, doctors inevitably need to input various text contents such as electronic medical records, examination reports and the like in the office process. The input method is used as a main entrance for interaction between a doctor and a computer, and the accuracy and the applicability of the input method have great influence on the working efficiency of the doctor.
At present, most doctors use a pinyin input method realized aiming at the general field. Also, because most physicians are relatively slow to type, when using the universal input method, there is a tendency to obtain the intended Chinese character results with less input.
A relatively perfect phonetic input method program is that when the user inputs some phonetic contents, such as the first letter of the phonetic alphabet, the prefix of the phonetic alphabet, etc., the complete phonetic contents should be deduced according to the established rule algorithm, and the corresponding candidate results of Chinese characters are given.
When the pinyin completion is performed by the common pinyin input method, multiple bases are performed on phrase frequency for completion, when the input content is long, a relatively accurate candidate result is difficult to give, for example, "yxxt" is input, the completion result obtained based on the phrase frequency may be "yi xia xi tong", and thus the given optimal candidate result may be "one-system", which is obviously not an expected result. As a result, the efficiency of text entry of doctors is influenced, and the probability of entering wrong information is improved.
In a common pinyin input method, the pinyin completion process is completed based on frequency information. The main process is as follows: and counting pinyin and frequency thereof in the training data, storing the pinyin by adopting a corresponding data structure, such as a Trie tree, when the input content is incomplete pinyin, performing prefix matching in the stored information, taking the complete pinyin with the highest frequency in the matched result as a complementary result, and giving corresponding Chinese character candidates.
In the specific implementation process, the storage structure of the data and the matching process may be optimized according to different strategies, and the above process is roughly described by taking one of the methods as an example. Firstly, cleaning and segmenting training data, wherein if the original content is 'normal pancreas form', the possible segmentation result is 'normal pancreas form'; then, obtaining the pinyin content corresponding to the phrase in the training data, and if the original phrase is 'pancreas', then obtaining the corresponding pinyin result 'yi xian'; and counting the frequency of all phrases, and storing the information of the phrases, the pinyin and the frequency according to a specified data structure. Through the steps, a phrase library based on training data is obtained and contains information such as corresponding pinyin and frequency. When inputting "yxxt", the input method will cut it according to the specified pinyin composition specification to obtain the result of "y x t", wherein "-" indicates that one or more arbitrary letters can be matched; matching 'y x t' in an existing phrase library, and possibly matching to a plurality of combinations such as 'yi xia xi tong' and 'yi xi an xing tai' and the like; and sequencing the frequencies of the matched phrase pinyin combinations, selecting a result with the highest frequency as a completion result, and performing pinyin-to-Chinese character operation on the basis of the completion result. The above process is the rough process of pinyin completion in the conventional pinyin input method.
In the pinyin completion process of the common pinyin input method, the following problems mainly exist:
firstly, in the matching process, because a method similar to fuzzy matching is adopted, all possible completion results and combinations need to be traversed in each matching process, and when the input content is short and the scale of a phrase library is large, the matching efficiency is influenced;
secondly, because the method screens candidate results based on phrase frequency, the candidate results completely depend on training data, and if word segmentation errors exist in the training data or phrases which do not exist in the training data occur, the method cannot complete the results well;
meanwhile, when the input incomplete pinyin string is long, if the input string is not additionally processed, it is difficult to directly match a suitable candidate result in the phrase library, if the input string is processed and then is matched, an additional processing process is added, the completion efficiency is affected, and the accuracy of the long pinyin string completion result is affected because the selection of the candidate result depends on the frequency of each phrase and the co-occurrence probability among the phrases is not considered.
Disclosure of Invention
In order to solve the problem of improving completion speed and accuracy, the invention provides a pinyin completion method based on HMM (Hidden Markov model), statistical learning is carried out on a large number of medical texts, statistical information such as pronunciation probability of polyphones in specific contexts, probability of word formation and word formation is obtained, pinyin completion is carried out based on the statistical information, completion speed is improved, and meanwhile, accuracy of completion results is greatly improved.
In order to achieve the purpose, the technical scheme of the invention is as follows: an HMM-based pinyin completion training method comprises the following steps:
s1, obtaining a legal complete single character pinyin string;
s2, acquiring possible pinyin strings to be complemented corresponding to legal complete single character pinyin strings;
s3, acquiring the corresponding relation between the pinyin string to be complemented and the complete pinyin string;
s4, acquiring complete pinyin contents corresponding to the training data, wherein the pinyin contents are separated according to single characters;
s5, statistical learning is carried out on the training data, and initial probability, emission probability and transition probability are obtained.
Further:
the step of S1 is: obtaining legal complete single character pinyin strings according to pinyin composition rules;
the step of S2 is: removing the residual part after legal pinyin from all prefixes of the single-character pinyin, and adding the single-character pinyin to serve as a possible pinyin string to be supplemented corresponding to the single-character pinyin;
the step of S3 is: and for the pinyin string to be complemented, sorting and obtaining the pinyin string to be complemented and the corresponding relation between the pinyin string to be complemented and the complete pinyin string which can be complemented.
The invention also relates to a pinyin completion model obtained by the training method.
The invention also relates to a pinyin completion model, which comprises:
the pinyin string acquisition module is provided with a legal complete single word pinyin string;
the module of phonetic string to be complemented has possible phonetic string to be complemented corresponding to the legal complete single character phonetic string;
the corresponding relation module is used for establishing the corresponding relation between the pinyin string to be complemented and the complete pinyin string;
the separation module is used for separating the complete pinyin contents corresponding to the training data according to the single characters;
and the statistical learning module is used for performing statistical learning on the training data to obtain an initial probability, an emission probability and a transition probability.
The invention also relates to a pinyin completion method based on the HMM, which comprises the following steps
Acquiring an input string of a user, and segmenting according to a pinyin composition principle;
and the pinyin completion model carries out pinyin completion by using the input string after segmentation.
Further, the pinyin completion method comprises the following steps: and calculating the probability of all possible completion results by using the input string after segmentation, calculating the score of each complete completion result by using the probability information of each input completion result after the completion result of another input completion result, and finally taking the item with the highest score as the final completion result.
The invention also relates to an HMM-based pinyin completion input method, which is characterized in that an input string is completed according to the completion method, and a completion result is subjected to pinyin conversion to Chinese characters and a Chinese character string is output.
Has the advantages that: according to the method, large-scale training data are learned to obtain statistical probability information, in the pinyin completion process, probability information corresponding to all possible completion results is comprehensively considered by adopting an HMM algorithm, comprehensive scores of all the results are calculated by combining context conditions, and finally candidate results are screened according to the scores. In the process of searching the candidate result, the score can be calculated by adopting a corresponding algorithm based on the pre-trained model information, so that the process of traversing all the completion results is avoided, and the matching efficiency is improved. Meanwhile, in the matching process, the score is comprehensively calculated by combining information such as the occurrence probability of the single character and the co-occurrence probability of the context, so that a completion result can be given under the condition of wrong word segmentation of the training data. When the input incomplete pinyin string is long, the HMM-based completion method can comprehensively consider the co-occurrence probability of the context, so that the completion result is more expected.
The pinyin completion method based on the HMM fully utilizes probability information, such as occurrence probability, co-occurrence probability and the like, in large-scale training data, and solves the problems of inaccurate completion, low efficiency and the like caused by the fact that only phrase frequency information is considered in the traditional method in the pinyin completion process. In the process of calculating the score of each candidate completion result, various statistical information is fully considered, so that the obtained result is more in line with the user expectation and closer to the word characteristics of training data compared with the traditional method, and the effect is obviously better than the completion accuracy of the traditional method under the condition of using medical texts as the training data. Meanwhile, the method provided by the text only keeps all the statistical information among the pinyin of the single characters, so that the size of the generated model file is much smaller than that of the phrase frequency information, and the effect of saving the memory space can be achieved. In the completion process, the HMM-based completion method provided herein automatically discards partial candidate results with lower scores when calculating the score of each candidate, thereby being superior to the conventional method in efficiency.
Drawings
FIG. 1 is a grid diagram of observed sequence transitions;
fig. 2 is a path diagram.
Detailed Description
Defining: HMM: hidden Markov Models, Hidden Markov Models; viterbi is as follows: viterbi algorithm, a dynamic programming algorithm.
In the embodiment, each technical scheme is a method and a device realized by software.
The embodiment provides a medical special input method based on an HMM, which is mainly different from other common medical input methods in that an HMM algorithm is adopted for statistical learning of training data, and pinyin is converted into Chinese characters based on statistical information.
The proposal of the proposal is based on the following background:
with the development of medical informatization and intellectualization, doctors need to perform more and more text input operations in the office process, such as medical documents like medical records and medical orders, and in the process, doctors need to use a certain input method to input text contents.
There are two main types of input methods currently used by most physicians: one is a general input method such as a dog search input method, a Baidu input method, etc., and the other is a thesaurus-based medical input method.
Because a large amount of special terms, nouns, special knowledge and other contents need to be input in the working process of a doctor, a large amount of time is consumed in the process of manually selecting candidate entries by using the conventional universal input method, so that the input time is long, the efficiency is low, and errors are easy to make. Although the medical input method based on the word stock can solve the problem of input efficiency of proper nouns to a certain extent, the word stock is mostly fixed phrases with limited length, so that a doctor still has the problem of low efficiency because candidate entries are far different from the context when inputting long contents at one time and a large amount of manual selection operations are needed.
At present, most of common medical input methods are realized by adding a medical special dictionary through a specific method on the basis of a general input method, and the dictionary is generally stored in a local hard disk or a cloud server. The main process is as follows: firstly, obtaining a pinyin string input by a user; then, matching in a special word bank according to the input pinyin string; if the phrase content which accords with the input pinyin string is matched in the medical special word stock, the phrase content is returned as a candidate result with higher priority; continuing to use the general algorithm to convert the input pinyin string and the candidate Chinese characters, wherein the result priority is lower than the retrieval result in the medical special word bank; and combining and sorting the two candidate results, and returning the results to an input method interface for selection by a user.
In the above common medical input methods, the basic process in the process of converting pinyin into chinese characters may be added to some medical input methods, but the basic process is not changed, such as rule verification. In the pinyin completion process of the common pinyin input method, the problem of low probability of outputting an expected result in long-string input is solved, and the working efficiency of a doctor is influenced.
The HMM is a basic statistical model, which introduces a set of hidden states in a standard markov process, and observes some probabilistic relationships between states and hidden states, describing a markov process with hidden states.
When using HMM models, the general problem requires two main features:
1. the problem is sequence-based, such as time series, state series;
2. the problem is that there are two kinds of data, one kind of data is observable, namely observation sequence, and the other kind of data is not observable, namely hidden state sequence, abbreviated as state sequence;
in the HMM-based medical special input method, pinyin is regarded as a hidden state in the HMM, and a character result obtained by pinyin conversion is regarded as an observation state.
In the HMM model used in the present invention, there are mainly the following parameters:
hidden state set, with Q ═ Q1,q2,…,qNRepresenting, wherein N is the number of possible hidden states, corresponding to the number of states of all possible Chinese characters in the invention;
observe the state set with V ═ V1,v2,…,vMWhere M is the number of possible observation states, corresponding to all possible pinyins in the present inventionThe number of states of (1);
hidden state sequences of length T, with I ═ I1,i2,...,iTExpressing the Chinese character sequence corresponding to the pinyin sequence input by the user;
observation state sequences of length T, using O ═ O1,o2,...,oTRepresenting the pinyin sequence input by the user in the invention;
hidden state transition probability distribution, denoted A, describing the state q at time tiUnder the condition of (1), transition to the state q at the time t +1jThe probability of (2) corresponds to the transition probability between adjacent Chinese characters in the medical document of the invention;
observing the state emission probability distribution, denoted by B, describing the hidden state q at time tiTo observe the state vkThe emission probability of (2) corresponds to the emission probability between Chinese characters and pinyin in the medical document of the invention;
the initial state probability is represented by pi, describes the probability of being in a certain hidden state when the time t is equal to 1, and corresponds to the occurrence probability of the first Chinese character in the Chinese characters corresponding to the input pinyin sequence.
Based on the parameters, the process of converting pinyin string to Chinese character, which is the process of searching the most possible corresponding hidden state sequence for the given observation sequence, is provided.
Based on the above background, the pinyin-chinese character conversion process of the present invention is mainly divided into two parts, a training stage and a using stage.
The training phase is as follows:
the training method of the HMM input method model of the medical document comprises the following steps:
acquiring all legal pinyin strings as an observation state set of an HMM model, and acquiring all Chinese character results corresponding to the pinyin strings as a hidden state set of the HMM model;
dividing words of the medical document contents for training according to the single words, and counting the probability of each word as the initial state probability of the HMM model;
converting all Chinese characters in the medical document for training into corresponding pinyin, and counting the Chinese characters corresponding to each pinyin and the respective occurrence probability as the observation state emission probability of the HMM model;
and (4) counting the probability of other Chinese characters appearing behind each Chinese character in the medical document for training, and taking the probability as the hidden state transition probability of the HMM model.
In one embodiment, the training method is applied to the following specific example:
1. acquiring all legal pinyin strings (observation state sets) and all Chinese character results (hidden state sets) corresponding to the legal pinyin strings;
for example, "ceng" the respective frames of the layers were rubbed against the miso , representing all the Kanji characters corresponding to the pinyin string "ceng".
2. Dividing words of the medical document contents for training according to single words, and counting the probability of each word as an initial probability pi;
such as "one": 0.0040433090858546655, "seven": 6.855110316782822e-06, indicating the initial probability magnitudes of "one" and "seven".
3. Converting all Chinese characters in the medical document for training into corresponding pinyin, and counting the Chinese characters corresponding to each pinyin and the respective occurrence probability to serve as an emission probability B;
if 'and' { 'ju': 0.0006393861892583121, 'qie': 0.9993606138107417}, several pinyin strings corresponding to the 'and' word and the probability thereof are represented.
4. Counting the probability of other Chinese characters appearing behind each Chinese character in the medical document for training, and taking the probability as the hidden state transition probability;
if the nose is used, the probability of the 'sinus', 'sticky' and 'longitudinal' after the 'nose' is shown, wherein the 'sinus', 'sticky' and '0.006450261318279049' are respectively 0.09741548503759896, 2.2052175447107856e-05, and the 'longitudinal' and the 'sticky' and the 'longitudinal' are respectively shown.
Through the process, the basic content of an HMM model can be obtained, and the model is kept. The HMM input method model comprises:
the observation state collection module is used for acquiring all legal pinyin strings,
the hidden state set module of the HMM model is used for acquiring all Chinese character results corresponding to the pinyin strings,
the initial state probability distribution module divides the medical document content for training into words according to the single characters and counts the probability distribution of each word,
the observation state emission probability distribution module is used for converting all Chinese characters in the medical document for training into corresponding pinyin and counting the Chinese characters corresponding to each pinyin and the probability distribution of the occurrence of each pinyin;
the hidden state transition probability distribution module is used for counting the probability distribution of other Chinese characters appearing behind each Chinese character in the medical document for training.
The use stages are as follows:
the input method based on the medical document content comprises the following steps:
obtaining a pinyin string, and segmenting and completing the pinyin string;
segmenting the complemented pinyin string as an observation state sequence of the HMM model, and inputting the HMM input method model;
and outputting Chinese character strings, wherein the HMM input method model transfers the observation state sequence into a hidden state sequence, searches the most possible corresponding hidden state sequence, converts the pinyin strings into Chinese characters, and returns the first Chinese character strings with the maximum probability.
In one embodiment, the application of the method of use in the specific example is as follows:
1. obtaining a pinyin string input by a user, and segmenting and completing the pinyin string according to a specific method;
for example, the user input method "houniaoguan" has a segmentation-completed result of "shu niao guan".
2. Using the segmented and complemented pinyin string (observation state sequence) and combining the trained model result to perform conversion from pinyin to Chinese characters (hidden state sequence);
in the invention, the process of pinyin and Chinese character conversion based on the trained HMM model mainly adopts a Viterbi algorithm to solve. The Viterbi algorithm is a general dynamic programming algorithm for finding a shortest path in a sequence, and in brief, when each state transition is performed after a state is started, the maximum probability value in all paths corresponding to each state when each state reaches the moment is recorded, the maximum probability value is taken as a reference to continue to advance until the maximum probability value is ended, and finally, the whole path is traced back, namely the required shortest path in the sequence.
As shown in fig. 1, the lattice in fig. 1 represents the transition of a chinese character to an observed sequence (pinyin sequence). For each intermediate and end state in the trellis, there is a most probable path to reach the state, for example, for each of the three states at time t-3, there is a most probable path to reach the path, as may be shown in fig. 2, based on which, in the pinyin chinese character conversion process, each state at the end time has a local probability and a corresponding best path, and therefore, the global best path, that is, the final chinese character string, may be determined by selecting the state (and its corresponding best path) with the highest local probability value at the time. The obtained Chinese character string is returned, and the first few values with the maximum probability are usually returned for the selection of the doctor.
The above is the complete training and using process of the HMM-based medical input method proposed by the present invention.
The method acquires probability information by performing statistical learning on a large-scale medical document, comprehensively considers all possible Chinese character results by adopting an HMM algorithm in the process of pinyin conversion of Chinese characters, perfects the conversion result of a longer pinyin string by combining information such as transfer probability between characters and the like, scores all possible Chinese character results by combining the existing statistical information, and performs screening and returning of candidate results according to the scores.
The process of obtaining the HMM model based on the statistical learning method avoids the problems brought by the dictionary obtaining process in the method based on the medical special dictionary, basically does not need to consume extra labor cost, trains based on words and avoids the dependence on the word segmentation accuracy.
In the conversion process, the method of the invention comprehensively calculates the probability and the score of various possible Chinese character results through the probability information in the model to obtain candidate results, thereby avoiding the dependence on the size and the coverage of the word stock in the method based on the word stock.
Meanwhile, the HMM model-based method adopted by the invention considers information such as the co-occurrence probability of the context when calculating the Chinese character combination probability and the score, so that a relatively more reasonable candidate result can be given by combining the context when long-string input is encountered, and the result accuracy of the long-string input is improved.
The medical special input method based on the HMM fully utilizes probability information existing in large-scale training data, and avoids introducing extra labor cost. In the process of pinyin and Chinese character conversion, the problems of strong dependence on word stock, low accuracy of long-string input results and the like caused by a medical word stock-based method are solved.
In the training process, the model fully considers various statistical information, such as the corresponding relation between pinyin and Chinese characters, the co-occurrence relation between the Chinese characters and the like, so that the obtained candidate result can better meet the expectation of doctors compared with the means of a general input method and a special word bank, is closer to the expression characteristics of training data, and has an effect obviously superior to the traditional medical input method adopting the special word bank in the experimental process.
Based on the characteristics, the method provided by the invention not only improves the accuracy of Chinese character results, but also improves the working efficiency of doctors for inputting medical documents while saving labor cost.
The technical key point of the invention is that the HMM-based pinyin Chinese character conversion algorithm is adopted in the implementation process of the medical special input method. In the training process, various probability information is obtained through statistical learning of large-scale medical documents, and in the actual use process, the probability calculation and comparison are carried out on possible Chinese character combinations by comprehensively considering known various probability information. The mature HMM algorithm is applied to the statistical learning of the medical documents, the traditional method of adding a special word bank by a general input method is replaced, the labor cost is saved, meanwhile, the accuracy of medical term input is greatly improved, the input efficiency of long-string medical documents is improved, the time for inputting the medical documents by doctors is saved, and therefore the working efficiency of the doctors is improved.
The above schemes describe in detail the training method, the input method model and the input method of the HMM input method model of the medical document.
In one scheme, for the steps of segmenting and completing the pinyin string (which is certainly applied to the steps of segmenting and completing the pinyin string in the invention content) in the use stage of each scheme in the embodiment, an HMM-based pinyin completing method is provided, and the main difference from other common pinyin input method completing processes is that the HMM method is adopted for statistical learning in the information acquisition process of training data.
The technical problem is proposed based on the following background:
with the progress of information-based construction of hospitals, doctors inevitably need to input various text contents such as electronic medical records, examination reports and the like in the office process. The input method is used as a main entrance for interaction between a doctor and a computer, and the accuracy and the applicability of the input method have great influence on the working efficiency of the doctor.
At present, most doctors use a pinyin input method realized aiming at the general field. Also, because most physicians are relatively slow to type, when using the universal input method, there is a tendency to obtain the intended Chinese character results with less input.
A relatively perfect phonetic input method program is that when the user inputs some phonetic contents, such as the first letter of the phonetic alphabet, the prefix of the phonetic alphabet, etc., the complete phonetic contents should be deduced according to the established rule algorithm, and the corresponding candidate results of Chinese characters are given.
When the pinyin completion is performed by the common pinyin input method, multiple bases are performed on phrase frequency for completion, when the input content is long, a relatively accurate candidate result is difficult to give, for example, "yxxt" is input, the completion result obtained based on the phrase frequency may be "yi xia xi tong", and thus the given optimal candidate result may be "one-system", which is obviously not an expected result. As a result, the efficiency of text entry of doctors is influenced, and the probability of entering wrong information is improved.
In a common pinyin input method, the pinyin completion process is completed based on frequency information. The main process is as follows: and counting pinyin and frequency thereof in the training data, storing the pinyin by adopting a corresponding data structure, such as a Trie tree, when the input content is incomplete pinyin, performing prefix matching in the stored information, taking the complete pinyin with the highest frequency in the matched result as a complementary result, and giving corresponding Chinese character candidates.
In the specific implementation process, the storage structure of the data and the matching process may be optimized according to different strategies, and the above process is roughly described by taking one of the methods as an example. Firstly, cleaning and segmenting training data, wherein if the original content is 'normal pancreas form', the possible segmentation result is 'normal pancreas form'; then, obtaining the pinyin content corresponding to the phrase in the training data, and if the original phrase is 'pancreas', then obtaining the corresponding pinyin result 'yi xian'; and counting the frequency of all phrases, and storing the information of the phrases, the pinyin and the frequency according to a specified data structure. Through the steps, a phrase library based on training data is obtained and contains information such as corresponding pinyin and frequency. When inputting "yxxt", the input method will cut it according to the specified pinyin composition specification to obtain the result of "y x t", wherein "-" indicates that one or more arbitrary letters can be matched; matching 'y x t' in an existing phrase library, and possibly matching to a plurality of combinations such as 'yi xia xi tong' and 'yi xi an xing tai' and the like; and sequencing the frequencies of the matched phrase pinyin combinations, selecting a result with the highest frequency as a completion result, and performing pinyin-to-Chinese character operation on the basis of the completion result. The above process is the rough process of pinyin completion in the conventional pinyin input method.
In the pinyin completion process of the common pinyin input method, the following problems mainly exist:
firstly, in the matching process, because a method similar to fuzzy matching is adopted, all possible completion results and combinations need to be traversed in each matching process, and when the input content is short and the scale of a phrase library is large, the matching efficiency is influenced;
secondly, because the method screens candidate results based on phrase frequency, the candidate results completely depend on training data, and if word segmentation errors exist in the training data or phrases which do not exist in the training data occur, the method cannot complete the results well;
meanwhile, when the input incomplete pinyin string is long, if the input string is not additionally processed, it is difficult to directly match a suitable candidate result in the phrase library, if the input string is processed and then is matched, an additional processing process is added, the completion efficiency is affected, and the accuracy of the long pinyin string completion result is affected because the selection of the candidate result depends on the frequency of each phrase and the co-occurrence probability among the phrases is not considered.
The HMM is a basic statistical model, which introduces a set of hidden states in a standard markov process, and observes some probabilistic relationships between states and hidden states, describing a markov process with hidden states.
Based on the above background, in the pinyin completing process, the incomplete pinyin string to be completed is used as the observation state, the completed complete pinyin result is used as the hidden state, and the pinyin completing process is converted into the decoding problem of the HMM (the observation sequence is given, and the most likely corresponding hidden state sequence is searched).
The pinyin completion process of the invention is mainly divided into a training stage and a using stage.
The main steps of the training phase are as follows:
in one embodiment, the training method is applied to the following specific example:
the pinyin completion model training method comprises the following steps:
s1, obtaining a legal complete single character pinyin string;
s2, acquiring possible pinyin strings to be complemented corresponding to legal complete single character pinyin strings;
s3, acquiring the corresponding relation between the pinyin string to be complemented and the complete pinyin string;
s4, acquiring complete pinyin contents corresponding to the training data, wherein the pinyin contents are separated according to single characters;
s5, statistical learning is carried out on the training data, and initial probability, emission probability and transition probability are obtained.
In one embodiment, the training method is applied to the following specific example:
1. acquiring all legal complete single character pinyin strings;
according to the spelling forming rule, all legal complete single word spelling strings can be obtained. For example, "yi" and "xian" are legal complete single character pinyin strings, and "y" and "bia" are incomplete pinyin strings.
2. Acquiring all possible pinyin strings to be complemented corresponding to all legal complete single character pinyin strings;
in the invention, the remaining part after legal pinyin is removed from all prefixes of the single-character pinyin, and the single-character pinyin itself is added as a possible pinyin string to be complemented corresponding to the single-character pinyin. For example, for a single character pinyin "xian", all prefixes of the pinyin "xian" include "x xi a", where "xi xia" is a legal pinyin string, so that the pinyin string to be complemented by "xian" in the present invention is "x xian", that is, when the input method obtains two inputs of "x" or "xian", a result of "xian" may be obtained.
3. Acquiring the corresponding relation between the pinyin string to be complemented and the complete pinyin string;
and based on the pinyin strings to be complemented obtained in the last step, sorting and obtaining all the pinyin strings to be complemented and the corresponding relation between the pinyin strings to be complemented and the complete pinyin strings corresponding to the pinyin strings to be complemented. For example, "bian bio" in "bia indicates that the possible complete pinyin strings corresponding to the string to be complemented" bia "are" bian "and" biao ".
4. Acquiring complete pinyin content corresponding to training data (namely the existing medical text);
and acquiring the complete pinyin content corresponding to the training data through a Chinese character pinyin conversion tool. Wherein, the pinyin content is divided according to single characters. For example, for the text "pancreas morphology is normal", the corresponding pinyin content is "yi xian xing tai zhengchang".
5. Performing statistical learning on the training data to obtain the following contents:
initial probability: the probability that the complete pinyin a appears in the data. For example, "bin" 0.0032817207121452023, indicates that the probability of occurrence of the pinyin "bin" is 0.0032817207121452023.
Emission probability: the probability that the complete pinyin string is complemented by a certain complete pinyin string to be complemented in all the complete pinyin strings corresponding to the complete pinyin. For example, "bao" { "b":0.004975124378109453, "bao": 0.9950248756218906}, which means that the probability of "bao" complemented by "b" is 0.004975124378109453 and the probability of "bao" complemented by "bao" is 0.9950248756218906.
Transition probability: the probability that the complete pinyin a is followed by the complete pinyin B. For example, "yi" { "an":3.47512365946466e-07, "chang":0.009643713457860983}, the probability that the pinyin string "an" is followed by "yi" is 3.47512365946466e-07, and the probability that the pinyin string "chang" is followed by "yi" is 0.009643713457860983.
At this point, the training process is finished, and the corresponding table of the string to be completed and the complete string (step 3), the initial probability (step 5), the transmission probability (step 5) and the transition probability (step 5) are obtained.
Through the process, the basic content of a pinyin completion model can be obtained, and the model is reserved. The pinyin completion model comprises:
the pinyin string acquisition module is provided with a legal complete single word pinyin string;
the module of phonetic string to be complemented has possible phonetic string to be complemented corresponding to the legal complete single character phonetic string;
the corresponding relation module is used for establishing the corresponding relation between the pinyin string to be complemented and the complete pinyin string;
the separation module is used for separating the complete pinyin contents corresponding to the training data according to the single characters;
and the statistical learning module is used for performing statistical learning on the training data to obtain an initial probability, an emission probability and a transition probability.
The main steps of the use stage are as follows:
the HMM-based pinyin completion method comprises the following steps
And performing pinyin completion on the split pinyin string according to the pinyin completion model:
calculating the probability of all possible completion results, the probability information of each completion result of one Pinyin string after each completion result of the other Pinyin string, calculating the score of each complete completion result, and taking the item with the highest score as the final completion result.
1. Acquiring an input string of a user, and segmenting according to a pinyin composition principle;
if the user inputs "yxxt", the result of "yxxt" can be obtained after the segmentation.
2. Using the input string after segmentation, and combining the trained model result to perform pinyin completion;
and according to the trained model result, calculating the probability of the occurrence of all possible completion results corresponding to the segmented input string, such as the information of the probability of the occurrence of all possible completion results corresponding to the segmented input string, such as the 'y x t', the 'x' and the't', and the probability of the occurrence of all the completion results after all the completion results of the 'x' and the 'y x', and comprehensively calculating the score of all the complete completion results, and finally, taking the item with the highest score as the final completion result.
3. Performing subsequent process of converting pinyin into Chinese characters according to the completion result;
according to the method, large-scale training data are learned to obtain statistical probability information, in the pinyin completion process, probability information corresponding to all possible completion results is comprehensively considered by adopting an HMM algorithm, comprehensive scores of all the results are calculated by combining context conditions, and finally candidate results are screened according to the scores.
In the process of searching the candidate result, the score can be calculated by adopting a corresponding algorithm based on the pre-trained model information, so that the process of traversing all the completion results is avoided, and the matching efficiency is improved. Meanwhile, in the matching process, the score is comprehensively calculated by combining information such as the occurrence probability of the single character and the co-occurrence probability of the context, so that a completion result can be given under the condition of wrong word segmentation of the training data. When the input incomplete pinyin string is long, the HMM-based completion method can comprehensively consider the co-occurrence probability of the context, so that the completion result is more expected.
The pinyin completion method based on the HMM fully utilizes probability information, such as occurrence probability, co-occurrence probability and the like, in large-scale training data, and solves the problems of inaccurate completion, low efficiency and the like caused by the fact that only phrase frequency information is considered in the traditional method in the pinyin completion process. In the process of calculating the score of each candidate completion result, various statistical information is fully considered, so that the obtained result is more in line with the user expectation and closer to the word characteristics of training data compared with the traditional method, and the effect is obviously better than the completion accuracy of the traditional method under the condition of using medical texts as the training data.
Meanwhile, the method provided by the text only keeps all the statistical information among the pinyin of the single characters, so that the size of the generated model file is much smaller than that of the phrase frequency information, and the effect of saving the memory space can be achieved.
In the completion process, the HMM-based completion method provided herein automatically discards partial candidate results with lower scores when calculating the score of each candidate, thereby being superior to the conventional method in efficiency.
The technical key point of the invention is that a completion algorithm based on HMM is adopted in the pinyin completion process of the pinyin input method. In the training process, various probability information is obtained through statistical learning of large-scale medical training data, in the actual completion process, known various probability information is comprehensively considered, score calculation and comparison are carried out on possible candidate completion results, and finally the result with the highest score is used as the completion result. The HMM method is applied to the pinyin completion process, the completion efficiency is improved, meanwhile, the accuracy of the pinyin completion result is greatly improved, the completion result is made to better accord with the word habits of training data, and the user experience is improved.
In one embodiment, the invention discloses an input method of pinyin completion based on HMM, which is different from the existing input method in that an input string is completed according to the completion method, and the completion result is pinyin converted into Chinese characters and a Chinese character string is output.
The above description is only for the purpose of creating a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can substitute or change the technical solution and the inventive concept of the present invention within the technical scope of the present invention.

Claims (7)

1. An HMM-based pinyin completion training method is characterized by comprising the following steps:
s1, obtaining a legal complete single character pinyin string;
s2, acquiring possible pinyin strings to be complemented corresponding to legal complete single character pinyin strings;
s3, acquiring the corresponding relation between the pinyin string to be complemented and the complete pinyin string;
s4, acquiring complete pinyin contents corresponding to the training data, wherein the pinyin contents are separated according to single characters;
s5, statistical learning is carried out on the training data, and initial probability, emission probability and transition probability are obtained.
2. The training method for HMM-based pinyin completion as claimed in claim 1, wherein:
the step of S1 is: obtaining legal complete single character pinyin strings according to pinyin composition rules;
the step of S2 is: removing the residual part after legal pinyin from all prefixes of the single-character pinyin, and adding the single-character pinyin to serve as a possible pinyin string to be supplemented corresponding to the single-character pinyin;
the step of S3 is: and for the pinyin string to be complemented, sorting and obtaining the pinyin string to be complemented and the corresponding relation between the pinyin string to be complemented and the complete pinyin string which can be complemented.
3. A pinyin completion model obtained by the training method of claim 1 or 2.
4. A pinyin completion model, comprising:
the pinyin string acquisition module is provided with a legal complete single word pinyin string;
the module of phonetic string to be complemented has possible phonetic string to be complemented corresponding to the legal complete single character phonetic string;
the corresponding relation module is used for establishing the corresponding relation between the pinyin string to be complemented and the complete pinyin string;
the separation module is used for separating the complete pinyin contents corresponding to the training data according to the single characters;
and the statistical learning module is used for performing statistical learning on the training data to obtain an initial probability, an emission probability and a transition probability.
5. A pinyin completion method based on HMM is characterized by comprising the following steps
Acquiring an input string of a user, and segmenting according to a pinyin composition principle;
and the pinyin completion model carries out pinyin completion by using the input string after segmentation.
6. The HMM-based pinyin completion method of claim 5, wherein the pinyin completion method: using the segmented input string, calculating probabilities of occurrence of all possible completion results, an
And calculating the scores of all the completion results according to the probability information of all the completion results of one input after all the completion results of the other input, and finally taking the item with the highest score as the final completion result.
7. An input method of pinyin completion based on HMM, characterized in that the input string is completed according to the completion method of claim 5 or 6, and the completion result is pinyin converted to chinese characters and a chinese character string is output.
CN201911265387.2A 2019-12-11 2019-12-11 Pinyin completion training method, completion model, completion method and completion input method based on HMM Active CN111144096B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911265387.2A CN111144096B (en) 2019-12-11 2019-12-11 Pinyin completion training method, completion model, completion method and completion input method based on HMM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911265387.2A CN111144096B (en) 2019-12-11 2019-12-11 Pinyin completion training method, completion model, completion method and completion input method based on HMM

Publications (2)

Publication Number Publication Date
CN111144096A true CN111144096A (en) 2020-05-12
CN111144096B CN111144096B (en) 2023-09-29

Family

ID=70518009

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911265387.2A Active CN111144096B (en) 2019-12-11 2019-12-11 Pinyin completion training method, completion model, completion method and completion input method based on HMM

Country Status (1)

Country Link
CN (1) CN111144096B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050209844A1 (en) * 2004-03-16 2005-09-22 Google Inc., A Delaware Corporation Systems and methods for translating chinese pinyin to chinese characters
CN101067780A (en) * 2007-06-21 2007-11-07 腾讯科技(深圳)有限公司 Character inputting system and method for intelligent equipment
CN102915122A (en) * 2012-07-19 2013-02-06 上海交通大学 Intelligent mobile platform Pinyin (phonetic transcriptions of Chinese characters) input method based on language models
CN105718070A (en) * 2016-01-16 2016-06-29 上海高欣计算机系统有限公司 Pinyin long sentence continuous type-in input method and Pinyin long sentence continuous type-in input system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050209844A1 (en) * 2004-03-16 2005-09-22 Google Inc., A Delaware Corporation Systems and methods for translating chinese pinyin to chinese characters
CN101067780A (en) * 2007-06-21 2007-11-07 腾讯科技(深圳)有限公司 Character inputting system and method for intelligent equipment
CN102915122A (en) * 2012-07-19 2013-02-06 上海交通大学 Intelligent mobile platform Pinyin (phonetic transcriptions of Chinese characters) input method based on language models
CN105718070A (en) * 2016-01-16 2016-06-29 上海高欣计算机系统有限公司 Pinyin long sentence continuous type-in input method and Pinyin long sentence continuous type-in input system

Also Published As

Publication number Publication date
CN111144096B (en) 2023-09-29

Similar Documents

Publication Publication Date Title
CN106649783B (en) Synonym mining method and device
JP5870790B2 (en) Sentence proofreading apparatus and proofreading method
US5649023A (en) Method and apparatus for indexing a plurality of handwritten objects
CN107704102B (en) Text input method and device
CN102298582B (en) Data search and matching process and system
JP5587493B2 (en) Method and system for assigning actionable attributes to data representing personal identification
US20080059146A1 (en) Translation apparatus, translation method and translation program
US20050278292A1 (en) Spelling variation dictionary generation system
CN101131706A (en) Query amending method and system thereof
CN103365925A (en) Method for acquiring polyphone spelling, method for retrieving based on spelling, and corresponding devices
CN108231066B (en) Speech recognition system and method thereof and vocabulary establishing method
US20120284308A1 (en) Statistical spell checker
EP3726401A1 (en) Encoding textual information for text analysis
Mandal et al. Clustering-based Bangla spell checker
CN111460170A (en) Word recognition method and device, terminal equipment and storage medium
US20120254190A1 (en) Extracting method, computer product, extracting system, information generating method, and information contents
CN109033066A (en) A kind of abstract forming method and device
CN109614493B (en) Text abbreviation recognition method and system based on supervision word vector
JP2007156545A (en) Symbol string conversion method, word translation method, its device, its program and recording medium
CN111625621A (en) Document retrieval method and device, electronic equipment and storage medium
CN115033773A (en) Chinese text error correction method based on online search assistance
CN111090338B (en) Training method of HMM (hidden Markov model) input method model of medical document, input method model and input method
JPH11328317A (en) Method and device for correcting japanese character recognition error and recording medium with error correcting program recorded
CN111368547A (en) Entity identification method, device, equipment and storage medium based on semantic analysis
CN111144096B (en) Pinyin completion training method, completion model, completion method and completion input method based on HMM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant