WO2013007210A1 - 文字输入方法、装置及系统 - Google Patents

文字输入方法、装置及系统 Download PDF

Info

Publication number
WO2013007210A1
WO2013007210A1 PCT/CN2012/078591 CN2012078591W WO2013007210A1 WO 2013007210 A1 WO2013007210 A1 WO 2013007210A1 CN 2012078591 W CN2012078591 W CN 2012078591W WO 2013007210 A1 WO2013007210 A1 WO 2013007210A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
word
language model
probability
candidate
Prior art date
Application number
PCT/CN2012/078591
Other languages
English (en)
French (fr)
Inventor
肖镜辉
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201110197062.2A external-priority patent/CN102880611B/zh
Priority claimed from CN201110209014.0A external-priority patent/CN102902362B/zh
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2014519401A priority Critical patent/JP5926378B2/ja
Priority to US14/232,737 priority patent/US9176941B2/en
Priority to EP12811503.7A priority patent/EP2733582A4/en
Publication of WO2013007210A1 publication Critical patent/WO2013007210A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/02Input arrangements using manually operated switches, e.g. using keyboards or dials
    • G06F3/023Arrangements for converting discrete items of information into a coded form, e.g. arrangements for interpreting keyboard generated codes as alphanumeric codes, operand codes or instruction codes
    • G06F3/0233Character input methods
    • G06F3/0237Character input methods using prediction or retrieval techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • G06F40/129Handling non-Latin characters, e.g. kana-to-kanji conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/274Converting codes to words; Guess-ahead of partial word inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language

Definitions

  • Embodiments of the present invention relate to the field of text input, and in particular, to a text input method, apparatus, and system. Background of the invention
  • the input method software is a common text input system.
  • the usual operation flow is as follows:
  • the input method software receives the code sequence (such as pinyin or five strokes) input by the user through the keyboard, and then uses the code sequence as a parameter to find out by using the common language model.
  • a sequence of candidate sentences corresponding to the code sequence and calculating a probability of the upper screen of each candidate sentence in the candidate sentence sequence, then sorting the candidate sentence sequence according to the magnitude of the upper screen probability, and finally presenting the candidate sentence sequence to the user.
  • the user only needs to select the desired word in the candidate sentence sequence to complete the input.
  • the traditional text input method generally uses the common language model to construct the core of the input method.
  • This universal language model is obtained by statistical analysis of large-scale training corpus.
  • the large-scale training corpus is usually automatically obtained from the Internet, representing most of the The user's general input requirements, that is, this universal language model is established based on the universal selection of words when most people input text.
  • users use input method software to input text, they often want to quickly get the text they use and habitually use.
  • Each user is different in their choice of words, because of different identities, hobbies and text input fields. It is hoped that the sequence of candidate statements that are ranked first is different. For example, researchers and bank staff often want to be at the forefront of their domain when entering text.
  • the standard Ngram language model modeling method has obvious shortcomings.
  • the standard Ngram language model is a single model, and in practical applications, the user's Chinese input, handwriting recognition, speech recognition and the like are Variable and unlimited, for example, users sometimes need to write technical reports, sometimes online chats.
  • Embodiments of the present invention provide a text input method to improve text input speed.
  • the embodiment of the invention further provides a text input device to improve the recognition accuracy of the text input.
  • a text input method includes the following steps:
  • a text input method includes the following steps:
  • the client obtains the user identifier, and searches for a corresponding user language model from the server according to the user identifier;
  • the client obtains user input, uploads the user input to a server, and the server generates a candidate statement list according to the user input;
  • the server acquires a common language model, and calculates an upper screen probability of the candidate statement in the candidate sentence list according to the user language model and the common language model;
  • the server sorts the candidate statements in the candidate statement list in order of the size of the upper screen probability, and sends the sorted candidate statement list to the client; the client receives the sorted A list of candidate statements is output.
  • a text input method includes the following steps:
  • the client obtains the user identifier, and searches for the corresponding user language model according to the user identifier;
  • the client obtains a user input, and generates a candidate statement list according to the user input; the client itself acquires a common language model, and calculates a screen of the candidate statement in the candidate sentence list according to the user language model and the common language model. Probability
  • the client sorts the candidate statements in the candidate sentence list in order of magnitude of the upper screen probability, and outputs the sorted candidate sentence list.
  • a text input system including: a search module, configured to obtain a user identifier, and search for a corresponding user language model according to the user identifier;
  • a candidate statement list generating module configured to acquire user input, and generate a candidate statement list according to the user input
  • a probability calculation module configured to generate an upper screen probability of the candidate statement in the candidate statement list according to the user language model and the common language model;
  • a sorting module configured to sort the candidate statements in the candidate statement list according to the size order of the upper screen probability
  • An output module configured to output a sorted list of candidate statements.
  • a word processing system including a client and a server, wherein:
  • a client configured to obtain a user identifier, search for a corresponding user language model from the server according to the user identifier, obtain user input, upload the user input to the server, receive a list of candidate statements sorted by the server, and output the same;
  • a server configured to generate a candidate sentence list according to the user input, obtain a common language model, and calculate a screen probability of the candidate statement in the candidate sentence list according to the user language model and a common language model, according to the upper screen probability
  • the size order sequentially sorts the candidate statements in the candidate statement list, and delivers the sorted candidate statement list to the client.
  • a word processing device comprising: a universal language model module, a cache module, a cache-based language modeling module, and a hybrid model module, wherein
  • a universal language model module for receiving user input, respectively calculating a standard conditional probability of each word in the user input, and outputting to the hybrid model module;
  • a cache module configured to cache a statement output by the hybrid model module
  • a cache-based language modeling module for calculating user input according to a preset cache-based language modeling strategy, based on user input and cached cached statements
  • the cache condition probability of each word in the middle is output to the hybrid model module;
  • the hybrid model module is configured to calculate a fusion condition probability according to a standard condition probability of each word and a cache condition probability, obtain a statement probability of each output sentence based on the fusion condition probability, and select an output statement output with the highest probability.
  • the above text input method, device and system combine a user language model and a general language model, and the user language model can be trained according to user input, so that the ranked candidate statements in the sorted candidate list are more in line with the user's language. Habits, enabling users to get the required candidate statements faster, improve the accuracy of text input, and improve the speed of text input.
  • BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are incorporated in the drawings In terms of personnel, other drawings can be obtained based on these drawings without paying for creative labor.
  • FIG. 1 is a schematic flow chart of a text input method in an embodiment
  • FIG. 2 is a schematic flow chart of a text input method in another embodiment
  • FIG. 3 is a schematic flow chart of a text input method in another embodiment
  • FIG. 4 is a schematic structural diagram of a character input system in an embodiment
  • FIG. 5 is a schematic structural diagram of a character input system in another embodiment
  • FIG. 6 is a schematic flowchart of a language modeling method according to an embodiment of the present invention.
  • FIG. 7 is a schematic flowchart of a language modeling method according to an embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of a language modeling apparatus according to an embodiment of the present invention. Mode for carrying out the invention
  • a text input method includes the following steps:
  • Step S102 Acquire a user identifier, and search for a corresponding user language model according to the user identifier.
  • the user identifier is used to uniquely identify the user, and may be an account registered by the user on the input method software, an identification number assigned to the user, and an IP address, a MAC address, and the like associated with the device used by the user.
  • a user language model corresponding to the user identifier needs to be established, and the user language model is updated according to the term information input by the user each time the user inputs the term. Since the user language model is trained according to the entry information input by the user, it conforms to the user's personal language habits. After training to get the user language model, you can store the user language model locally or upload it to the server for storage.
  • Step S104 Acquire a user input, and generate a candidate statement list according to the user input.
  • the user input can be a voice, a handwritten character, an optical character or a character string, etc., and a candidate sentence sentence matching the user input can be found from the thesaurus by a conventional text input method, and a candidate sentence list can be generated.
  • Step S106 Acquire a common language model, and calculate an upper screen probability of the candidate statement in the candidate sentence list according to the user language model and the general language model.
  • the general language model can be a traditional statistical language model obtained by statistical analysis of large-scale training corpora, which can be obtained from a large number of user-entered statements via the Internet.
  • the user language model is corresponding to the user's individual, and the user language model corresponding to different users is different.
  • the generic language model can be stored on the server or on the client.
  • the user language model is trained based on user input. It should be noted that When using the input method software for the first time input, since the user language model is not updated, the upper screen probability of the candidate sentence list can be calculated only by the common language model, and the method principle is the same as the traditional input method using the common language model. I will not repeat them here.
  • the entry entered by the user is recorded, and the language model is updated according to the entry information input by the user, and the user language model is correspondingly stored with the user identifier.
  • the established user language model and the generic language model can be used together to calculate the upper panel probability of the candidate statements in the candidate list of statements.
  • the common language model and the user language model are stored together in the local client, and the user language model and the common language model can be directly obtained from the local client, and used to calculate candidate statements in the candidate statement list. The probability of the screen.
  • the client does not need to send any request to the server, and the method is also called "local input method".
  • the common language model and the user language model are stored in a server, and the server obtains a common language model and a user language model for calculating a screen probability of the candidate sentence in the candidate sentence list.
  • the input method The processing is performed by the server, also known as the "cloud input method.”
  • Step S108 sorting the candidate statements in the candidate sentence list in order of the size of the upper screen probability.
  • the candidate sentences in the candidate sentence list are sorted according to the order of the upper screen probability. The more the candidate sentences are ranked, the more the user's language habits are met, and the user is more likely to be required by the user. It can select the candidate sentences more quickly, improve the accuracy of text input, and improve the speed of text input.
  • Step S110 outputting the sorted candidate statement list.
  • step S110 outputting a candidate statement with the highest probability of the upper screen, the candidate statement with the highest probability of the upper screen is located at the foremost position of the output list, and the user can quickly select the candidate statement with the highest probability of the upper screen.
  • step S110 is: outputting a first candidate statement with the highest probability of the upper screen processed by the local input method, and outputting a second candidate statement with the highest probability of the upper screen processed by the cloud input method.
  • the first candidate statement and the second preferred statement are output in the output list, and the first candidate statement is ranked first, and the second candidate is sorted after the first candidate statement. In this way, the user can quickly select the candidate statement with the highest probability of the screen obtained by the two input methods.
  • the text input method further includes the step of establishing a user language model corresponding to the user identifier and updating the user language model according to the entry information input by the user each time the user inputs the entry.
  • a user vocabulary corresponding to the user identifier is established, and each time the user inputs the vocabulary, the vocabulary information and the word frequency information input by the user are added to the user vocabulary.
  • the user language model is updated, the term information and the word frequency information are obtained from the user vocabulary, and the vocabulary is segmented.
  • the word vocabulary is sorted according to the word segmentation, according to the word after the word segmentation.
  • the word frequency is updated to update the user language model.
  • the word frequency is the number of times the word appears in the user's lexicon.
  • the user language model and the common language model use the same language model, such as the Ngram language model, but the training sets are different.
  • the training set of the user language model is a set of all word sequences in the user lexicon, corresponding to a certain user;
  • the training set of the universal language model is a set of word sequences input by a large number of users, which can be obtained through the Internet.
  • the probability calculation formula of the user language model is:
  • the number of times the word sequence « +1 ⁇ appears in the training set indicates the number of times the word sequence Wn appears in the training set.
  • the training set is a set of all word sequences in the user lexicon.
  • the user language model employs a lower-order language model, such as the Unigram language model, which occupies less storage space than the Ngram language model, and is particularly suitable for use on mobile terminals.
  • a lower-order language model such as the Unigram language model
  • the probability calculation formula of the user language model is:
  • P user ( S ) is a statement containing m words .. probability of W m .
  • the user language model may also adopt a Bigram language model, which is faster to model than the above two language models, and is particularly suitable for cloud input methods.
  • the step of calculating the upper screen probability of the candidate statement in the candidate sentence list according to the user language model and the common language model is specifically: linearly interpolating the user language model and the common language model to generate a hybrid model according to the hybrid model The upper screen probability of the candidate statement in the candidate list is calculated.
  • the general language model can adopt the traditional Ngram language model, and then combine the conditional probability in the user language model with the conditional probability in the common language model to calculate the conditional probability after fusion.
  • the calculation formula is:
  • the upper screen probability of the candidate statement in the candidate statement list is the probability that the candidate statement calculated by the hybrid model may be selected by the user. The higher the probability of the upper screen, the higher the ranking of the candidate statements in the candidate statement list, and the user can quickly select the desired statement and improve the text input speed.
  • a text input method including the following steps:
  • Step S202 The client obtains the user identifier, and searches for a corresponding user language model from the server according to the user identifier.
  • the user identifier is used to uniquely identify the user, and may be an account registered by the user on the input method software, an identification number assigned to the user, and an IP address, a MAC address, and the like associated with the device used by the user.
  • the client After the user authenticates, the user logs in to the input method software, and the client obtains the user identifier, uploads the user ID to the server, and the server searches for the corresponding user language model.
  • the user language model corresponding to the user identifier is established on the server in advance, and each time the user inputs the term, the server acquires the term information input by the user and updates the user language model according to the term information input by the user. Since the user language model is stored on the server corresponding to the user ID, the user language model on the server can be continuously updated according to the user input, so the user language model on the server is more and more accurate, and the user uses the input method software on different clients. The server delivers the latest user language model to the client, so that the user language model can be synchronized and applied to different terminal devices.
  • Step S204 The client obtains user input, uploads the user input to the server, and the server generates a candidate statement list according to the user input.
  • the user input can be a voice, a handwritten character, an optical character or a string, etc.
  • the client uploads the user input to the server, and the server uses the traditional text input method to find a candidate statement matching the user input from the thesaurus, and generates a candidate statement list.
  • the processing of the text input method is performed by the server. This text input method is also called "cloud input method”.
  • Step S206 The server acquires a common language model, and calculates a screen probability of the candidate statement in the candidate sentence list according to the user language model and the general language model.
  • the general language model can be a traditional statistical language model. It is obtained through statistical analysis of large-scale training corpora. Large-scale training corpus can be obtained from a large number of user-entered sentences through the Internet.
  • the user language model corresponds to the user's individual, and the user language model corresponding to different users is different.
  • the user language model is trained based on user input. It should be noted that When the input method software is used for the first time input, since the user language model is not updated, only the common language model is used to calculate the upper screen probability of the candidate statement list, and the method principle is the same as the traditional input method using the common language model. This will not go into details.
  • the entry entered by the user is recorded, and the user language model is updated according to the entry information input by the user, and the user language model is stored correspondingly with the user identifier, and the next time the text is input, the user may adopt The established user language model and the common language model are used together to calculate the upper panel probability of the candidate statements in the candidate statement list.
  • the text input method further includes the step of establishing a user language model corresponding to the user identifier on the server and updating the user language model according to the term information input by the user each time the user inputs the term.
  • a user vocabulary corresponding to the user identifier is established on the server, and each time the user inputs the vocabulary, the vocabulary information and the word frequency information input by the user are added to the user vocabulary.
  • the user language model is updated, the term information and the word frequency information are obtained from the user vocabulary, and the vocabulary is segmented.
  • the word vocabulary is sorted after the word segmentation, according to the vocabulary after the word segmentation.
  • the word frequency is updated to update the user language model.
  • the word frequency is the number of times the entry appears in the user's lexicon.
  • the user language model may adopt a Bigram language model, and the modeling method is as described above, and will not be described herein.
  • the step of the server calculating the upper screen probability of the candidate statement in the candidate sentence list according to the user language model and the common language model is specifically: the server linearly interpolates the user language model and the common language model to generate a hybrid model, according to The hybrid model calculates the upper panel probability of the candidate statements in the candidate list of statements.
  • the upper screen probability of the candidate statement in the candidate statement list is the probability that the candidate statement calculated by the hybrid model may be selected by the user. The greater the probability of the upper screen, the higher the ranking of the candidate statements in the candidate statement list, and the user can quickly select the desired sentence and improve the text input speed.
  • the server sorts the candidate statements in the candidate statement list in order of the upper screen probability, and delivers the sorted candidate statement list to the client.
  • Step S210 the client receives the sorted candidate statement list and outputs it.
  • the user can select the desired candidate statement from the list of preferred words, and the selected candidate statements are output from the input method software to different applications, such as text files, notepads, and presentation documents.
  • a text input method is also proposed.
  • FIG. 3 is a schematic flow chart of a text input method in another embodiment. As shown in Figure 3, the following steps are included:
  • Step S202 The client obtains the user identifier, and searches for the corresponding user language model according to the user identifier.
  • the user identifier is used to uniquely identify the user, and may be an account registered by the user on the input method software, an identification number assigned to the user, and an IP address, a MAC address, and the like associated with the device used by the user.
  • the user logs in to the input method software, and the client obtains the user identifier, and searches for the corresponding user language model according to the user identifier.
  • Step S204 The client obtains user input, and generates a candidate statement list according to the user input.
  • User input can be speech, handwriting, optical characters or strings.
  • the common language model and the user language model are stored together in the local client, and the user language model and the common language model can be directly obtained from the local client for calculating the upper screen probability of the candidate statement in the candidate statement list.
  • the client does not need to send any request to the server, and the method is also referred to as a "local input method.”
  • Step S206 The client itself acquires a common language model, and calculates a screen probability of the candidate sentence in the candidate sentence list according to the user language model and the common language model.
  • the general language model can be a traditional statistical language model. It is obtained through statistical analysis of large-scale training corpora.
  • the large-scale training corpus can be input from a large number of users through the Internet. Get in the sentence.
  • the user language model is corresponding to the user, and the user language model corresponding to different users is different.
  • the user language model is trained according to user input. It should be noted that, for the first input using the input method software, since the user language model is not updated, only the common language model is used to calculate the upper screen probability of the candidate statement list of the candidate sentence list.
  • the method principle is the same as the traditional input method using the common language model, and will not be described here.
  • the entry entered by the user is recorded, and the user language model is updated according to the entry information input by the user, and the user language model is stored correspondingly with the user identifier, and the next time the text is input, the user may adopt The established user language model and the common language model are used together to calculate the upper panel probability of the candidate statements in the candidate statement list.
  • the user language model may adopt a Bigram language model, and the modeling method is as described above, and will not be described herein.
  • the step of the client calculating the upper screen probability of the candidate statement in the candidate sentence list according to the user language model and the common language model is specifically: the client linearly interpolates the user language model and the common language model to generate a hybrid model. Calculating the upper screen probability of the candidate statement in the candidate statement list according to the hybrid model.
  • the upper screen probability of the candidate statement in the candidate statement list is the probability that the candidate statement calculated by the hybrid model may be selected by the user. The higher the probability of the upper screen, the higher the ranking of the candidate statements in the candidate statement list, and the user can quickly select the desired statement and improve the text input speed.
  • Step S208 The client sorts the candidate statements in the candidate statement list according to the order of the upper screen probability, and outputs the sorted candidate statement list.
  • a text input system includes a lookup module 102, a candidate sentence list generation module 104, a probability calculation module 106, a sorting module 108, and an output module 110, where:
  • the searching module 102 is configured to obtain a user identifier, and search for a corresponding user language model according to the user identifier.
  • the user identifier is used to uniquely identify the user, and may be an account registered by the user on the input method software, an identification number assigned to the user, and an IP address, a MAC address, and the like associated with the device used by the user.
  • the text input system further includes a user language model building module 112 and a user language model updating module 114, wherein:
  • the user language model building module 112 is configured to establish a user language model corresponding to the user identification.
  • the user language model building module 112 can be located at the client or at the server, and the established user language model can be stored on the client or in the server.
  • the user language model update module 114 is configured to update the user language model according to the entry information entered by the user each time the user inputs the entry.
  • the user language model update module 114 can be located at the client or at the server, and the updated user language model can be stored on the client or uploaded to the server for storage by the client. In this way, the user language model on the server can be continuously updated according to user input, so the user language model on the server is more and more accurate.
  • the server delivers the latest user language model. To the client, it is possible to synchronize the user's language model and apply to different terminal devices.
  • the candidate sentence list generation module 104 is configured to acquire user input and generate a candidate statement list based on the user input.
  • the user input can be a voice, a handwritten character, an optical character or a character string, etc., and a candidate sentence sentence matching the user input can be found from the thesaurus by a conventional text input method, and a candidate sentence list can be generated.
  • the candidate statement list generation module 104 can be located at the server end.
  • the server uses a conventional text input method to find a candidate statement matching the user input from the vocabulary, and generates a candidate statement list.
  • the processing of the text input method is performed by the server.
  • This text input method is also called "cloud input method”.
  • the probability calculation module 106 is configured to obtain a common language model, and calculate a screen probability of a candidate sentence in the candidate sentence list according to the user language model and the general language model.
  • the general language model can be a traditional statistical language model. It is obtained through statistical analysis of large-scale training corpora. Large-scale training corpus can be obtained from a large number of user-entered sentences through the Internet.
  • the user language model corresponds to the user's individual, and the user language model corresponding to different users is different.
  • the user language model is trained according to user input. It should be noted that, for the first input using the input method software, since the user language model is not updated, only the common language model is used to calculate the upper screen probability of the candidate statement list of the candidate sentence list.
  • the method principle is the same as the traditional input method using the common language model, and will not be described here.
  • the entry entered by the user is recorded, and the language model is updated according to the entry information input by the user, and the user language model is stored correspondingly with the user identifier, and the next time the text is input, the user can use the
  • the established user language model is used together with the general language model to calculate the upper panel probability of the candidate statements in the candidate list of statements.
  • the sorting module 108 is configured to sort the candidate statements in the candidate sentence list in order of magnitude of the upper screen probability.
  • the candidate sentences in the candidate sentence list are sorted according to the order of the upper screen probability. The more the candidate sentences are ranked, the more the user's language habits are met, and the user is more likely to be required by the user. It can select the candidate sentences more quickly, improve the accuracy of text input, and improve the speed of text input.
  • the output module 110 is configured to output a sorted list of candidate statements.
  • the user can select the desired candidate statement from the list of preferred words, and the selected candidate statement is selected from Input method software output to different applications, such as text files, notepads, presentation documents, etc.
  • the user language model updating module 114 is configured to record the term information and the word frequency information input by the user, obtain the term information and the word frequency information, perform word segmentation on the term, and segment the word according to the word frequency information.
  • the term is used to sort the words, and the user language model is updated according to the words after the word segmentation and the word frequency after the sorting.
  • the word frequency is the number of times the entry appears in the user's lexicon.
  • the user language model and the common language model adopt the same language model, that is, the Ngram language model is used, but the training set is different, and the training set of the user language model is all word sequences in the user lexicon.
  • a collection, corresponding to a certain user, a training set of a common language model is a collection of a sequence of words input by a large number of users, which can be obtained through the Internet.
  • the probability calculation formula of the user language model is:
  • the maximum likelihood method can be used for probability statistics.
  • the calculation formula is:
  • c - w - represents the number of occurrences of the word sequence Wn ⁇ - in the training set, indicating the number of times the word sequence Wn " appears in the training set.
  • the training set is a collection of all word sequences in the user's thesaurus.
  • the user language model adopts a lower-order language model.
  • the Unigram language model which occupies less space than the Ngram language model, is particularly suitable for use on mobile terminals.
  • the probability calculation formula of the user language model is:
  • the user language model may also adopt a Bigram language model, which is faster to model than the above two language models, and is particularly suitable for use in a cloud input method.
  • the probability calculation formula for the language model is:
  • the upper screen probability generation module 106 is configured to linearly interpolate the user language model and the common language model, generate a hybrid model, and calculate a screen probability of the candidate sentence in the candidate sentence list according to the hybrid model.
  • the common language model adopts the traditional Ngram language model, and the conditional probability in the user language model is merged with the conditional probability in the common language model to calculate the conditional probability after fusion.
  • the calculation formula is:
  • the upper screen probability of the candidate statement in the candidate statement list is the probability that the candidate statement calculated by the hybrid model may be selected by the user. The higher the probability of the upper screen, the higher the ranking of the candidate statements in the candidate statement list, and the user can quickly select the desired statement and improve the text input speed.
  • the embodiment of the present invention also proposes a text input method and apparatus, which will be described in detail below.
  • the most commonly used language modeling methods include statistical language model modeling methods and Ngram language model modeling methods.
  • the statistical language model is based on probability theory and mathematical statistics theory, and is used to calculate the probability of a Chinese sentence, so that the probability of outputting the correct statement is greater than the probability of the wrong statement.
  • the Chinese sentence can be decomposed into several words, such as: Description ⁇ here..., for an inclusion (m is Natural number) Chinese sentence
  • the probability of the Chinese sentence (the probability of outputting the correctness) can be decomposed into the product of the conditional probabilities of multiple words, namely:
  • Piw w ⁇ .. ⁇ is the probability of the word W in the Chinese sentence... It can be seen from the above formula that the parameter space of the conditional probability P /W ⁇ --; increases exponentially with the increase of the variable ''. When the variable is large, the probability of the existing training corpus cannot be accurately estimated.
  • the value of ⁇ 1 ⁇ 1 ⁇ ... 1 ⁇ ) the training corpus refers to a set of ordered texts organized by a statistical method from a large-scale training text according to a certain category, and the training corpus can be processed by a computer to perform scale processing.
  • the conditional probability ⁇ ' ⁇ 1 ⁇ 1 ⁇ ) is subjected to different degrees of cylinderization, and a standard Ngram language model modeling method is proposed.
  • the standard Ngram language model is currently the most commonly used statistical language model. It treats the Chinese sentence as a Markov sequence and satisfies the Markov property. Specifically, the standard Ngram language model makes the following basic assumptions for the conditional probability ⁇ 1 ⁇ ... 1 ⁇ ) in the statistical language model:
  • Time homogeneity 4 The conditional probability of the current word has nothing to do with its position in the Chinese sentence.
  • the value of the conditional probability 7 ⁇ ' ⁇ ... ⁇ is estimated by the method of maximum likelihood estimation.
  • the estimation formula is as follows:
  • the standard Ngram language model modeling method also has obvious shortcomings.
  • the standard Ngram language model is a single model, but in practical applications, the user's Chinese input, handwriting recognition, speech recognition and other requirements are changeable. Unlimited, for example, users sometimes need to write technical reports, sometimes online chats. In both situations, the user's Chinese input needs are different; for example, users of different ages, due to different life experiences, speaking habits There are big differences, which are reflected in the Chinese input, which is the difference between the content that these people often input. Therefore, a single model can not meet the different needs of users of different ages and the same user in different input scenarios.
  • the standard Ngram language model itself does not have an automatic learning mechanism.
  • the parameters in the standard Ngram language model are determined once trained, and cannot be learned and intelligently adjusted according to the user's input habits, so that the recognition accuracy of the user input is better. low.
  • the character input method and device proposed by the embodiments of the present invention can meet the needs of different users for Chinese input and improve the recognition accuracy.
  • the existing standard Ngram language model for language modeling is a single model, which cannot meet the different needs of different users for sentence input, and because it does not have an automatic learning mechanism, it cannot learn and intelligently adjust according to the user's input habits. Making identification of user input The accuracy is low.
  • a cache-based language modeling method is proposed for the shortcomings of the standard Ngram language model, and the current input content of the user is stored by using the cache data structure, and the cached content is mathematically processed.
  • Analysis thereby establishing a mathematical model of user input, and continuously updating the contents of the cached data structure, continuously learning the user's input habits, and adapting to the user's input requirements as the user inputs continuously, thereby making the human-computer interaction more intelligent.
  • the established user input mathematical model is more and more accurate, and more and more meets the user's real input requirements, so that the user's input content can be more accurately identified by the established mathematical model in the user's subsequent input process. , to achieve dynamic learning and adapt to the user's input needs.
  • FIG. 6 is a schematic flowchart of a language modeling method according to an embodiment of the present invention. See Figure 6, the process includes:
  • Step 601 Receive user input, and calculate conditional probabilities of each word in the user input according to a pre-established standard Ngram language model
  • the user's input includes: input method input, handwriting recognition input, and voice recognition input. among them,
  • the keyboard processing program receives the characters input by the user, performs recognition processing according to the adopted input method, obtains pinyin information, and outputs the information to the mapper, which is mapped by the mapper.
  • the preset handwriting recognition program extracts the information of the user's handwriting, acquires the handwriting information, and maps the handwriting information to the corresponding candidate Chinese characters through the mapping process of the mapper to form a user input;
  • the preset audio processing program samples, quantizes, filters, and denoises the input user audio to obtain audio information, and maps the audio information to corresponding candidate Chinese characters through the mapping process of the mapper. User input.
  • mapping processing of the mapper to the pinyin information, the handwriting information, and the audio information may be referred to the related technical documents, and details are not described herein again.
  • conditional probability of each word in the user input is calculated according to the standard Ngram language model, which is the same as the prior art, and will not be described herein.
  • Step 602 Determine that user input is pre-cached, and calculate a conditional probability of each word in the user input according to a preset cache-based language modeling policy according to the user input and the pre-cached user input.
  • the pre-set cache-based language modeling strategy formula is:
  • Pcacke I is the cache condition probability of the first word ⁇ ;
  • c (w ⁇ ⁇ ⁇ ⁇ w ⁇ ) represents the sequence of words ⁇ ⁇ ⁇ w ⁇ appears in the cached training corpus, that is, contains the '' word and The number of times the word sequence of the predetermined constant ( " ) word appears in the cached training corpus before the ''word;
  • C(W ⁇ ⁇ ⁇ ⁇ W ⁇ ) indicates the sequence of words ⁇ ⁇ ⁇ W ⁇ in the cached training corpus
  • the current number of times that is, the number of times a sequence of words containing a predetermined constant word before the first word appears in the cached training corpus
  • /( ' ⁇ ) is a function of time.
  • Step 603 Calculate a fusion conditional probability according to a conditional probability of each term calculated according to a standard Ngram language model and a conditional probability of each term calculated by the cache-based modeling strategy, and obtain a statement probability of each output statement based on the fusion conditional probability;
  • the formula for calculating the fusion condition probability is:
  • ⁇ ⁇ / ⁇ +1 +2 ⁇ is the standard conditional probability of the first word calculated based on the standard Ngmm language model.
  • Step 604 Select the output statement with the highest probability to output and cache the output statement.
  • FIG. 7 is a schematic flowchart of a language modeling method according to an embodiment of the present invention. See Figure 7, the process includes:
  • Step 701 pre-establishing a standard Ngram language model
  • the standard Ngram language model can be established by referring to the prior art, and the standard Ngram language model is trained using the training corpus according to the maximum likelihood estimation method.
  • Step 702 Receive user input content according to a pre-established standard Ngram language.
  • the model calculates the statement probability of each output statement (ie, the probability of the upper screen) on the input content of the user; in this step, the user can input the content through voice, handwritten characters, optical characters or keyboard keys, and when the user starts inputting the content, the mapper is used.
  • the mapping process is mapped to the candidate text, and then the candidate characters of the mapping are processed according to the standard Ngram language model, that is, the kernel computing process of the input content is performed, and the probability of calculating various possible output sentences according to the standard Ngram language model is compared with the existing
  • the technology is the same and will not be described here.
  • Step 703 selecting an output statement output with the highest probability
  • the output statement with the highest probability is selected as the user output, that is, the output statement with the highest probability is taken as the recognized Chinese sentence, and a Chinese sentence may include One or more words.
  • Step 704 correcting the output statement, and outputting the corrected output statement to a preset buffer for buffering
  • the user can check whether the output statement matches its own input requirement. If it does not match, the correction is performed. For example, the user desires the input sentence as "this is a fact”, and the probability calculated according to the standard Ngram language model is the largest.
  • the output statement that is, the recognition result of the input method to the user input statement is "this is real time”, and does not match the input requirement expected by the user. At this time, the user needs to correct "real time” to "fact” according to the candidate of the input method. And output to a preset buffer for caching.
  • the contents of the cache cache can be identified by the user.
  • Step 705 Using a cached statement as a training corpus to establish a cache-based language model; in this step, the cache-based language model is based on the stored content in the cache area.
  • the content in the buffer is based on the user's most recent input and can be thought of as a user-specific, small-scale training corpus.
  • the cache-based language model is the same as the standard Ngram language model. Calculate the conditional probability between words and words to describe the statistical characteristics currently input by the user.
  • the probability of a Chinese sentence can be calculated by the following formula: Wherein, the probability value of the Chinese sentence counted according to the cached content in the buffer area is represented;
  • m is the number of words contained in the Chinese sentence
  • w' is the i-th word contained in the Chinese sentence
  • P ⁇ w ⁇ w ⁇ .. ⁇ is the conditional probability of the word W in the Chinese sentence; "is a preset constant.
  • the user's input has the characteristics of "short-term stability", wherein “short-time” is the dimension that represents the time, that is, the user's current input content is only input with the user for the most recent period of time.
  • Content related regardless of the input content of the user 4 years ago. That is to say, the content currently input by the user is generally stable, and the current input content of the user is related to the current input topic. After a period of time, after the topic input by the user is transferred, the user's input content is not related to the previous topic of the user. Big.
  • the words currently entered by the user are most closely related to the words that have recently entered the cache, and are less relevant to words that have entered the cache before a longer time.
  • conditional probability of the current word in the buffer area is not only related to the context word of the current word, but also related to the time when the current word enters the buffer area.
  • statement probability calculation formula can be modified to be based on the cached language model:
  • the probability of the statement in the revised formula considers the time variable parameter, that is, the conditional probability of the current word appearing is not only related to the context word - WWI ⁇ , but also related to the time of the last entry into the buffer.
  • conditional probability of each word in the cache-based language model is not only related to the context of the word, but also to the time the word was last entered into the buffer.
  • the maximum likelihood estimation method in the standard Ngram language model considers only context-dependent vocabulary and does not take into account time information. Therefore, it cannot be directly used to train the parameters required by the cache-based language model.
  • In order to estimate the conditional probability of words in the cache-based language model by increasing the maximum likelihood estimation method, adding time information to it, using the following formula to calculate the conditional probability
  • the time function is considered to describe the influence of time factors on the conditional probability of the statement. Define the time function as follows:
  • ⁇ ' is the time variable parameter, that is, the time interval between the time point in the word ⁇ entering the buffer and the time point of the current user input statement.
  • the value of the time variable parameter ⁇ ' can be the position of the word in the cache queue. For example, for words that first enter the buffer, if the word is arranged at the head of the queue, the position number is assumed to be
  • the time variable parameter ⁇ ' corresponding to the word in the formula has a value of 1.
  • s is a preset constant for adjusting the weight of the time variable parameter information in conditional probability estimation. It can be seen from the above formula that if the time point of the word 1 ⁇ entering the buffer area is earlier, the longer the time interval from the current user input sentence, the larger the value of the time variable parameter ⁇ ', so that the value of the time function / ( the value) The smaller the value, the smaller the value of the conditional probability ⁇ / 1 ⁇ " 1 ⁇ 2 ... 1 ⁇ , ); conversely, the later the word enters the buffer, the shorter the time interval between the input and the current user. The smaller the value of the time variable parameter is, the larger the value of the time function / ( the value of the conditional probability »: » +2 ⁇ ⁇ ⁇ ) is larger.
  • Step 706 Receive input content of the user, and calculate a statement probability of each output statement according to a pre-established standard Ngram language model and a cache-based language model for the input content of the user;
  • the standard Ngram language model and the newly created cache-based language model form a hybrid model, and the mixed model processes the user input and comprehensively produces the processing result.
  • the linear interpolation method is adopted, and the conditional probability in the language model based on the cache is used.
  • ⁇ - ⁇ ... ⁇ is merged with the conditional probability P - discipline + -oeuvre +2 — W- in the standard Ngram language model to calculate the conditional probability after fusion
  • the cache-based language model is continuously established based on the user's current input. On the one hand, it reflects the user's current input scene information, and on the other hand, it reflects the user's own input habits.
  • the standard Ngram language model combined with a cache-based language model, effectively learns and adapts to user input scenarios and input habits.
  • Step 707 selecting an output statement output with the highest probability
  • Step 708 updating the cached statement in the cache according to the output statement.
  • the language modeling method of the embodiment of the present invention can be applied not only to the Chinese input method, but also to the input methods of other Asian languages such as Japanese, Korean, and Cambodia, and the language modeling method and Chinese language construction.
  • the modular method is similar and will not be described here.
  • FIG. 8 is a schematic structural diagram of a language modeling apparatus according to an embodiment of the present invention.
  • the device includes: a standard Ngram language model module, a cache module, and a cache-based language modeling module.
  • Block and hybrid model module wherein
  • a standard Ngram language model module for receiving user input, respectively calculating a standard conditional probability of each word in the user input, and outputting to the hybrid model module;
  • the formula for calculating the conditional probability of the word by the standard Ngram language model module is:
  • C ⁇ —country + ⁇ date indicates the number of times the word sequence appears in the training corpus of the standard Ngram language model
  • w' is the i-th word contained in the Chinese sentence
  • a cache module configured to cache a statement output by the hybrid model module
  • a cache-based language modeling module configured to calculate a conditional probability of each word in the user input according to a preset cache-based language modeling strategy according to a user input and a cached statement of the cache module, and output the result to the hybrid model module;
  • the formula for calculating the conditional probability of the word based on the language modeling module of the cache is:
  • w' is the i-th word contained in the Chinese sentence
  • /( ' ⁇ ) is a function of time.
  • Hybrid model module for standard conditional probabilities and cache conditional probabilities based on individual terms Calculate the fusion condition probability, obtain the statement probability of each output statement based on the fusion condition probability, and select the output statement output with the highest probability.
  • the calculation formula of the fusion condition probability is:
  • the statement probability formula for the output statement is:
  • m is the number of words contained in the Chinese sentence.
  • the standard Ngram language model module includes: a first word sequence frequency counting unit, a second word sequence frequency counting unit, and a standard condition probability calculation unit (not shown), wherein
  • a first word sequence frequency counting unit configured to acquire a number of occurrences of a sequence of words including the ith word and the preset constant word before the ith word in a training corpus of a standard Ngram language model, and output to a standard condition Probability calculation unit;
  • a second word sequence frequency counting unit configured to obtain a number k of occurrences of a sequence of words including a preset constant word before the first word in a training corpus of a standard Ngram language model, and output to a standard conditional probability calculation unit;
  • the standard condition probability calculation unit is configured to calculate a ratio of the number of times ⁇ to the number of times ⁇ -1, and use the calculated ratio as the standard conditional probability of the first word in the user input.
  • the cache-based language modeling module includes: a third word sequence frequency counting unit, a fourth word sequence frequency counting unit, a time function value acquiring unit, and a buffer condition probability calculating unit (not shown), wherein a third word sequence frequency counting unit, configured to obtain a number of times the word sequence including the ith word and the preset constant word before the ith word appears in the buffered training corpus is output to the buffer condition probability calculation unit;
  • a fourth word sequence frequency counting unit configured to obtain a number of occurrences of a sequence of words including a preset number of words before the first word in the cached training corpus, and output to a buffer condition probability calculation unit;
  • a time function value obtaining unit configured to obtain a time function value of the first word, and output the result to a cache condition probability calculation unit;
  • the buffer condition probability calculation unit is configured to calculate a ratio of the number of times to the number of times, and multiply the calculated ratio by the time function value of the first word to obtain a cache condition probability of the first word in the user input.
  • the hybrid model module includes: an interpolation coefficient storage unit, a first product unit, a second product unit, a fusion condition probability calculation unit, a statement probability calculation unit, and an output sentence selection unit (not shown), wherein
  • An interpolation coefficient storage unit configured to store an interpolation coefficient preset between 0 and 1; a first product unit, configured to calculate the interpolation condition and the standard conditional probability of the first word according to the interpolation coefficient stored by the interpolation coefficient storage unit Product, output to the fusion condition probability calculation unit;
  • a second product unit configured to calculate a product of a difference between the interpolation coefficient and a buffer condition probability of the first word, and output the result to the fusion condition probability calculation unit;
  • a fusion condition probability calculation unit configured to add the received products related to the ''words' as a fusion condition probability of the i-th word
  • a statement probability calculation unit configured to sequentially multiply a fusion condition probability of each word acquired by the fusion condition probability calculation unit to obtain a statement probability of the output statement;
  • the output sentence selection unit is configured to select a maximum sentence probability calculated by the sentence probability calculation unit, and output an output statement corresponding to the maximum statement probability.
  • the language modeling method and the language modeling device of the embodiment of the present invention cache the user input, so that the cached user input is related to the historical information input by the user and the user input scenario, so that the language based on the cache is established.
  • the modeling model has the function of self-learning, which improves the intelligence of the language model.
  • the human-computer interaction software can adapt to different user groups and applications. Scenes. Specifically, it has the following beneficial technical effects:
  • the invention improves the performance of the language model, can meet the needs of different users for Chinese input, and improves the prediction accuracy, and can be applied to the fields of speech recognition, handwritten character recognition, Chinese keyboard input method, optical character recognition, etc. System accuracy rate;
  • an information retrieval system based on a language model can be established to improve the performance of the information retrieval system, for example, accuracy, recall rate, and the like.
  • the invention is not to be construed as limiting the scope of the invention. It should be noted that a number of variations and modifications may be made by those skilled in the art without departing from the spirit and scope of the invention. Therefore, the scope of the invention should be determined by the appended claims.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

一种文字输入方法,包括以下步骤:获取用户标识,根据用户标识查找对应的用户语言模型;获取用户输入,根据所述用户输入生成候选语句列表;获取通用语言模型,根据所述用户语言模型和通用语言模型计算所述候选语句列表中的候选语句的上屏概率;按照所述上屏概率的大小顺序对所述候选语句列表中的候选语句进行排序;输出排序后的候选语句列表。采用上述文字输入方法,能够提高文字输入的准确率和文字输入的速度。此外,还提供了一种文字输入系统和装置。

Description

文字输入方法、 装置及系统
技术领域
本发明实施方式涉及文字输入领域, 特别涉及一种文字输入方法、 装置及系统。 发明背景
输入法软件是一种常见的文字输入系统, 通常的操作流程为: 输入 法软件接收用户通过键盘输入的代码序列 (如拼音或五笔等 ), 然后将 代码序列作为参数, 利用通用语言模型找出与代码序列对应的候选语句 序列, 并计算出每个候选语句在候选语句序列中的上屏概率, 然后根据 上屏概率的大小将候选语句序列排序, 最后将候选语句序列展现给用 户。 用户只需要在候选语句序列中选出想要的词语即可完成输入。
传统的文字输入方法, 一般采用通用语言模型构建输入法的核心, 这种通用语言模型是通过对大规模训练语料统计分析后得到的, 大规模 训练语料通常从互联网上自动获取, 代表了大多数用户的一般性输入需 求, 即这种通用语言模型根据大多数人输入文字时的具有普遍性的选词 习惯建立。 而用户在使用输入法软件输入文字时, 往往希望能够快速获 取自己常用以及习惯性使用的文字, 每个用户在进行选词时, 由于身 份不一样, 兴趣爱好和文字输入的领域不一样, 所希望排序靠前的候选 语句序列也不一样。 例如, 科研工作者和银行职员在输入文字时, 往往 希望自己领域的专业术语排在最前面。 再例如, 东北人和四川人在输入 文字时, 也往往希望自己的方言词汇能排在候选语句序列的前列。 而传 统的这种仅采用通用语言模型的文字输入方法并不能满足不同用户的 输入需求, 使得输入的准确率不高, 从而影响用户输入文字的速度。 而且, 在现有技术中, 标准 Ngram语言模型建模方法存在明显的缺 点, 一方面, 标准 Ngram语言模型是单一的模型, 而实际应用中, 用户 的汉语输入、 手写识别、语音识别等需求是多变的、 也是无限的, 例如, 用户有时需要撰写技术报告, 有时在网上聊天, 在这两种情境下, 用户 的汉语输入需求是不同的; 再例如, 不同年龄段的用户, 由于生活经历 的不同, 说话习惯存在很大不同, 反映在汉语输入上, 就是这些人群经 常输入的内容差别很大。 因而, 单一模型无法满足不同年龄段的用户、 以及同一用户在不同输入场景下对汉语输入的不同需求, 不同的输入需 求采用同一模型, 使得对用户不同需求的输入, 影响了识别的准确性; 另一方面, 标准 Ngram语言模型本身没有自动学习的机制, 标准 Ngram 语言模型中的参数一经训练便被确定下来, 无法根据用户的输入习惯进 行学习和智能调整, 使得对用户输入的识别准确率较低。 发明内容
本发明实施方式提供一种文字输入方法, 以提高文字输入速度。 本发明实施方式还提供一种文字输入装置, 以提高文字输入的识别 准确率。
本发明实施方式还提供一种文字输入系统, 以提高文字输入速度。 一种文字输入方法, 包括以下步骤:
获取用户标识, 根据用户标识查找对应的用户语言模型; 获取用户 输入, 根据所述用户输入生成候选语句列表; 获取通用语言模型, 根据 所述用户语言模型和通用语言模型计算所述候选语句列表中的候选语 句的上屏概率; 或
根据通用语言模型, 分别计算用户输入中各词语的标准条件概率; 按照预先设置的基于緩存的语言建模策略, 根据所述用户输入以及预先 緩存的用户输入, 分别计算所述用户输入中各词语的緩存条件概率; 根 据各词语的标准条件概率以及緩存条件概率计算融合条件概率, 基于融 合条件概率获取各候选语句的上屏概率;
按照所述上屏概率的大小顺序对所述候选语句列表中的候选语句进 行排序;
输出排序后的候选语句列表。
一种文字输入方法, 包括以下步骤:
客户端获取用户标识, 根据用户标识从服务器查找对应的用户语言 模型;
所述客户端获取用户输入, 将所述用户输入上传到服务器, 所述服 务器根据所述用户输入生成候选语句列表;
所述服务器获取通用语言模型, 根据所述用户语言模型和通用语言 模型计算所述候选语句列表中的候选语句的上屏概率;
所述服务器按照所述上屏概率的大小顺序对所述候选语句列表中的 候选语句进行排序, 将排序后的候选语句列表下发到所述客户端; 所述客户端接收所述排序后的候选语句列表并输出。
一种文字输入方法, 包括以下步骤:
客户端获取用户标识, 根据用户标识在自身查找对应的用户语言模 型;
所述客户端获取用户输入, 并根据用户输入生成候选语句列表; 所述客户端自身获取通用语言模型, 根据所述用户语言模型和通用 语言模型计算所述候选语句列表中的候选语句的上屏概率;
所述客户端按照所述上屏概率的大小顺序对所述候选语句列表中的 候选语句进行排序, 并输出排序后的候选语句列表。
一种文字输入系统, 包括: 查找模块, 用于获取用户标识, 根据用户标识查找对应的用户语言 模型;
候选语句列表生成模块, 用于获取用户输入, 根据所述用户输入生 成候选语句列表;
概率计算模块, 用于根据所述用户语言模型和通用语言模型生成所 述候选语句列表中的候选语句的上屏概率;
排序模块, 用于按照所述上屏概率的大小顺序对所述候选语句列表 中的候选语句进行排序;
输出模块, 用于输出排序后的候选语句列表。
一种文字处理系统, 包括客户端和服务器, 其中:
客户端, 用于获取用户标识, 根据用户标识从服务器查找对应的用 户语言模型; 获取用户输入, 将所述用户输入上传到服务器; 接收由服 务器排序后的候选语句列表并输出;
服务器, 用于根据所述用户输入生成候选语句列表, 获取通用语言 模型, 根据所述用户语言模型和通用语言模型计算所述候选语句列表中 的候选语句的上屏概率, 按照所述上屏概率的大小顺序对所述候选语句 列表中的候选语句进行排序, 并将排序后的候选语句列表下发到所述客 户端。
一种文字处理装置, 其特征在于, 该装置包括: 通用语言模型模块、 緩存模块、 基于緩存的语言建模模块以及混合模型模块, 其中,
通用语言模型模块, 用于接收用户的输入, 分别计算用户输入中各 词语的标准条件概率, 输出至混合模型模块;
緩存模块, 用于緩存混合模型模块输出的语句;
基于緩存的语言建模模块, 用于按照预先设置的基于緩存的语言建 模策略, 根据用户的输入以及緩存模块緩存的语句, 分别计算用户输入 中各词语的緩存条件概率, 输出至混合模型模块;
混合模型模块, 用于根据各词语的标准条件概率以及緩存条件概率 计算融合条件概率, 基于融合条件概率获取各输出语句的语句概率, 选 择概率最大的输出语句输出。
上述文字输入方法、 装置及系统, 结合了用户语言模型和通用语言 模型, 由于用户语言模型可根据用户输入进行训练得到, 使得排序后的 候选语句列表中排序靠前的候选语句更符合用户的语言习惯, 使得用户 能够更快的获取到所需要的候选语句, 提高了文字输入的准确率, 也提 高了文字输入速度。 附图简要说明 将对本发明实施方式或现有技术描述中所需要使用的附图作筒单地介 绍, 显而易见地, 下面描述中的附图仅仅是本发明的一些实施方式, 对 于本领域普通技术人员来讲, 在不付出创造性劳动性的前提下, 还可以 根据这些附图获得其他的附图。
图 1为一个实施方式中文字输入方法的流程示意图;
图 2为另一个实施方式中文字输入方法的流程示意图;
图 3为另一个实施方式中文字输入方法的流程示意图;
图 4为一个实施方式中文字输入系统的结构示意图;
图 5为另一个实施方式中文字输入系统的结构示意图;
图 6为本发明实施例的语言建模方法流程示意图。
图 7为本发明实施例的语言建模方法具体流程示意图;
图 8为本发明实施例的语言建模装置结构示意图。 实施本发明的方式
在一个实施方式中, 如图 1所示, 一种文字输入方法, 包括以下步 骤:
步骤 S102, 获取用户标识, 根据所述用户标识查找对应的用户语言 模型。
用户标识用于唯一标识用户, 可以是用户在输入法软件上注册的帐 号、 为用户分配的标识号码、 以及与用户所使用的设备关联的 IP地址、 MAC地址等。
在一个实施方式中,在步骤 S102之前需建立与用户标识对应的用户 语言模型, 在每次用户输入词条后则根据用户输入的词条信息更新用户 语言模型。 由于用户语言模型是根据用户输入的词条信息进行训练得到 的, 符合用户个人的语言习惯。 训练得到用户语言模型后, 可以将用户 语言模型存储在本地, 也可以上传到服务器中存储。
步骤 S104, 获取用户输入, 根据用户输入生成候选语句列表。
用户输入可以是语音、 手写体、 光学字符或字符串等, 可采用传统 的文字输入方法从词库中找到与用户输入匹配的候选语句, 生成候选语 句列表。
步骤 S106, 获取通用语言模型, 根据用户语言模型和通用语言模型 计算候选语句列表中的候选语句的上屏概率。
通用语言模型可以是传统的统计语言模型, 通过对大规模训练语料 进行统计分析得到, 所述大规模训练语料可通过互联网从大量用户输入 的语句中获取。 用户语言模型是与用户个人对应的, 不同的用户所对应 的用户语言模型不同。 通用语言模型可以存储在服务器, 也可以存储在 客户端。
用户语言模型根据用户输入进行训练得到, 应当说明的是, 对于使 用输入法软件进行首次输入时, 由于用户语言模型未更新, 则可以仅采 用通用语言模型计算候选语句列表的候选语句的上屏概率, 其方法原理 与传统的采用通用语言模型的输入方法相同, 在此则不再赘述。
在用户每次使用输入法软件输入文字后, 记录用户输入的词条, 根 据用户输入的词条信息更新语言模型, 用户语言模型与用户标识进行对 应存储。 在下一次输入文字时, 则可采用所建立的用户语言模型和通用 语言模型一起用于计算候选语句列表中的候选语句的上屏概率。
在一个实施方式中, 通用语言模型和用户语言模型一起存储在本地 客户端中, 则可直接从本地客户端中获取到用户语言模型和通用语言模 型, 用于计算候选语句列表中的候选语句的上屏概率。 该实施方式中, 客户端不需要向服务器发送任何请求, 该方法也称为 "本地输入法"。
在另一个实施方式中,通用语言模型和用户语言模型存储在服务器, 服务器获取通用语言模型和用户语言模型, 用于计算候选语句列表中的 候选语句的上屏概率, 该实施方式中, 输入法的处理过程都交由服务器 来执行, 也称为 "云输入法"。
步骤 S108 ,按照上屏概率的大小顺序对候选语句列表中的候选语句 进行排序。
本实施方式中, 按照上屏概率从大到小的顺序对候选语句列表中的 候选语句进行排序, 排序越靠前的候选语句就越符合用户的语言习惯, 更可能为用户所需求, 因此用户可以更加快速的选择到所需要的候选语 句, 提高了文字输入的准确率, 也提高了文字输入的速度。
步骤 S110, 输出排序后的候选语句列表。
用户可以从优选词列表中选择所需候选语句, 所选择的候选语句从 输入法软件输出到不同应用程序中, 如文本文件、 记事本、 演示文档中 等。 在一个实施方式中, 步骤 S110的具体过程为:输出上屏概率最大的 候选语句, 该上屏概率最大的候选语句位于输出列表的最前位置, 用户 可以快速选择到该上屏概率最大的候选语句。
在另一个实施方式中, 步骤 S110的具体过程为: 输出采用本地输入 法处理得到的上屏概率最大的第一候选语句, 以及输出采用云输入法处 理得到的上屏概率最大的第二候选语句, 在输出列表中输出第一候选语 句和第二优选语句, 并且第一候选语句的排序最靠前, 第二候选语句排 序在第一候选语句后面。 这样, 用户可以快速选择两种输入法得到的上 屏概率最大的候选语句。
在一个实施方式中, 上述文字输入方法还包括建立与用户标识对应 的用户语言模型并在每次用户输入词条后根据用户输入的词条信息更 新用户语言模型的步骤。 具体的, 建立与用户标识对应的用户词库, 在 每次用户输入词条后, 将用户输入的词条信息和词频信息加入到用户词 库中。 更新用户语言模型时, 从用户词库中获取词条信息和词频信息, 对词条进行分词,根据原有词条的词频,对分词后的词条进行词频整理, 根据分词后的词条和整理后的词频更新用户语言模型。 其中, 词频为词 条在用户词库中出现的次数。
在一个实施方式中, 用户语言模型与通用语言模型采用相同的语言 模型, 比如可以采用 Ngram语言模型建模, 但训练集合是不相同的。 用 户语言模型的训练集合是用户词库中的所有词语序列集合, 与某一个用 户对应; 通用语言模型的训练集合是大量用户输入的词语序列集合, 可 通过互联网获取。
其中, 用户语言模型的概率计算公式为:
P USer(S =Yl 1 其中, s ^为包含 m个词语的语句 S =WiW2... Wm的概率; 语句 S 由词语序歹 ll ww2...wm组成, 其中, 为语句 S中的词语, 语句 S由 个词语组成, 例如"你今天吃饭了么"可分解为" /你 /今天 /吃饭 /了 /么";
D 1 wi-n+i-w^可采用最大似然方法进行概率统计, 计算公式为:
Figure imgf000010_0001
其中, 表示词语序列 «+1·Ί^·在训练集合中出现的次 数, 表示词语序列 Wn "在训练集合中出现的次数。 训练 集合是用户词库中是所有词语序列集合。
在一个优选的实施方式中, 用户语言模型采用更低阶的语言模型, 例如 Unigram语言模型,其相对于 Ngram语言模型所占用的存储空间更 小, 特别适用于在移动终端上使用。 本实施方式中, 用户语言模型的概 率计算公式为:
Figure imgf000010_0002
其中, P user (S)为包含 m个词语的语句
Figure imgf000010_0003
.. Wm的概率。
在另一个优选的实施方式中,用户语言模型还可采用 Bigram语言模 型, 该语言模型相对于上述两种语言模型, 其建模的速度更快, 特别适 用于云输入法中。
本实施方式中, 用于语言模型的概率计算公式为: i=l
其中, Us、为包含 m个词语的语句 S=ww2...w„的概率; P- ) 表示语句 S被分词为两个词语 ¼^和 Ww, P^ ^i-i)的计算公式为
p(w. \Wi i)= 其中, 表示语句 S在训练集合中出现的次数, C(W'- 表示词 语 WW在训练集合中出现的次数。
在一个实施方式中, 根据用户语言模型和通用语言模型计算候选语 句列表中的候选语句的上屏概率的步骤具体为: 对用户语言模型和通用 语言模型进行线性插值, 生成混合模型, 根据混合模型计算候选语句列 表中的候选语句的上屏概率。
本实施方式中, 通用语言模型可以采用传统的 Ngram语言模型, 则 将用户语言模型中的条件概率与通用语言模型中的条件概率进行融合, 计算融合后的条件概率, 其计算公式为:
P ) = a x P (wi I Wi_n+1 · ..w^ ) + (l- a) x Puser {w{ I w{_n+l · ..w^ ) 其中, υ ^· ·-»+1 ···^ )表示融合后的条件概率, Ρ · —„+1·Ί 表 示通用语言模型的条件概率, I W7'-wW'-i )表示用户语言模型的条件 概率, "为插值系数, 取值在 0到 1之间。
根据融合后的条件概率, 生成的混合模型为:
其中, pO为包含 m个词语的语句 S =WiW2. . . WOT的概率
候选语句列表中的候选语句的上屏概率为混合模型计算得到的候选 语句可能被用户选择的概率。 上屏概率越大, 则候选语句在候选语句列 表中排序越靠前, 用户则能够快速选择到所需要的语句, 提高了文字输 入速度。
在一个实施方式中, 如图 2所示, 提出了一种文字输入方法, 包括 以下步骤:
步骤 S202, 客户端获取用户标识, 根据用户标识从服务器上查找对 应的用户语言模型。 用户标识用于唯一标识用户, 可以是用户在输入法软件上注册的帐 号、 为用户分配的标识号码、 以及与用户所使用的设备关联的 IP地址、 MAC 地址等。 用户进行身份验证后登录到输入法软件, 客户端获取到 用户标识, 将用户标识上传到服务器, 由服务器查找对应的用户语言模 型。
在一个实施方式中, 事先在服务器上建立与用户标识对应的用户语 言模型, 每次用户输入词条后, 服务器获取用户输入的词条信息并根据 用户输入的词条信息来更新用户语言模型。 由于用户语言模型对应用户 标识在服务器上存储, 服务器上的用户语言模型可以根据用户输入进行 不断更新, 因此服务器上的用户语言模型越来越精确, 用户在不同的客 户端上使用输入法软件时, 服务器将最新的用户语言模型下发到客户 端, 因此能够实现用户语言模型的同步, 适用于不同的终端设备。
步骤 S204, 客户端获取用户输入, 将用户输入上传到服务器, 服务 器根据用户输入生成候选语句列表。
用户输入可以是语音、 手写体、 光学字符或字符串等, 客户端将用 户输入上传到服务器, 由服务器采用传统的文字输入方法从词库中找到 与用户输入匹配的候选语句, 生成候选语句列表。 将文字输入方法的处 理交由服务器来执行, 这种文字输入法也称为"云输入法"。
步骤 S206, 服务器获取通用语言模型, 根据用户语言模型和通用语 言模型计算候选语句列表中的候选语句的上屏概率。
通用语言模型可以是传统的统计语言模型, 通过对大规模训练语料 进行统计分析得到, 大规模训练语料可通过互联网从大量用户输入的语 句中获取。 用户语言模型是与用户个人对应的, 不同的用户所对应的用 户语言模型不同。
用户语言模型根据用户输入进行训练得到, 应当说明的是, 对于使 用输入法软件进行首次输入时, 由于用户语言模型未更新, 则仅采用通 用语言模型计算候选语句列表的候选语句的上屏概率, 其方法原理与传 统的采用通用语言模型的输入方法相同, 在此则不再赘述。
在用户每次使用输入法软件输入文字后, 记录用户输入的词条, 根 据用户输入的词条信息更新用户语言模型, 用户语言模型与用户标识进 行对应存储, 在下一次输入文字时, 则可采用所建立的用户语言模型和 通用语言模型一起用于计算候选语句列表中的候选语句的上屏概率。
在一个实施方式中, 上述文字输入方法还包括在服务器上建立与用 户标识对应的用户语言模型并在每次用户输入词条后根据用户输入的 词条信息更新用户语言模型的步骤。 具体的, 在服务器上建立与用户标 识对应的用户词库, 在每次用户输入词条后, 将用户输入的词条信息和 词频信息加入到用户词库中。 更新用户语言模型时, 从用户词库中获取 词条信息和词频信息, 对词条进行分词, 根据原有词条的词频, 对分词 后的词条进行词频整理, 根据分词后的词条和整理后的词频更新用户语 言模型。 其中, 词频为词条在用户词库中出现的次数。
在一个实施方式中,用户语言模型可采用 Bigram语言模型, 其建模 方法如上所述, 在此则不再赘述。
在一个实施方式中, 服务器根据用户语言模型和通用语言模型计算 候选语句列表中的候选语句的上屏概率的步骤具体为: 服务器对用户语 言模型和通用语言模型进行线性插值, 生成混合模型, 根据混合模型计 算候选语句列表中的候选语句的上屏概率。
候选语句列表中的候选语句的上屏概率为混合模型计算得到的候选 语句可能被用户选择的概率。 上屏概率越大, 则候选语句在候选语句列 表中排序越靠前, 用户则能够快速选择到所需要的语句, 提高了文字输 入速度。 步骤 S208,服务器按照上屏概率的大小顺序对候选语句列表中的候 选语句进行排序, 将排序后的候选语句列表下发到客户端。
步骤 S210, 客户端接收排序后的候选语句列表并输出。 用户可以从 优选词列表中选择所需候选语句, 所选择的候选语句从输入法软件输出 到不同应用程序中, 如文本文件、 记事本、 演示文档中等。
在一个实施方式中, 还提出了一种文字输入方法。
图 3为另一个实施方式中文字输入方法的流程示意图。如图 3所示, 包括以下步骤:
步骤 S202: 客户端获取用户标识, 根据用户标识在自身查找对应的 用户语言模型。
用户标识用于唯一标识用户, 可以是用户在输入法软件上注册的帐 号、 为用户分配的标识号码、 以及与用户所使用的设备关联的 IP地址、 MAC地址等。 用户进行身份验证后登录到输入法软件, 客户端获取到 用户标识, 根据用户标识在自身查找对应的用户语言模型。
步骤 S204: 客户端获取用户输入, 并根据用户输入生成候选语句列 表。
用户输入可以是语音、 手写体、 光学字符或字符串等。 通用语言模 型和用户语言模型一起存储在本地客户端中, 则可直接从本地客户端中 获取到用户语言模型和通用语言模型, 用于计算候选语句列表中的候选 语句的上屏概率。该实施方式中,客户端不需要向服务器发送任何请求, 该方法也称为 "本地输入法"。
步骤 S206: 客户端自身获取通用语言模型, 根据所述用户语言模型 和通用语言模型计算所述候选语句列表中的候选语句的上屏概率。
通用语言模型可以是传统的统计语言模型, 通过对大规模训练语料 进行统计分析得到, 大规模训练语料可通过互联网从大量用户输入的语 句中获取。 用户语言模型是与用户个人对应的, 不同的用户所对应的用 户语言模型不同。
用户语言模型根据用户输入进行训练得到, 应当说明的是, 对于使 用输入法软件进行首次输入时, 由于用户语言模型未更新, 则仅采用通 用语言模型计算候选语句列表的候选语句的上屏概率, 其方法原理与传 统的采用通用语言模型的输入方法相同, 在此则不再赘述。
在用户每次使用输入法软件输入文字后, 记录用户输入的词条, 根 据用户输入的词条信息更新用户语言模型, 用户语言模型与用户标识进 行对应存储, 在下一次输入文字时, 则可采用所建立的用户语言模型和 通用语言模型一起用于计算候选语句列表中的候选语句的上屏概率。
在一个实施方式中,用户语言模型可采用 Bigram语言模型, 其建模 方法如上所述, 在此则不再赘述。
在一个实施方式中, 客户端根据用户语言模型和通用语言模型计算 候选语句列表中的候选语句的上屏概率的步骤具体为: 客户端对用户语 言模型和通用语言模型进行线性插值, 生成混合模型, 根据混合模型计 算候选语句列表中的候选语句的上屏概率。
候选语句列表中的候选语句的上屏概率为混合模型计算得到的候选 语句可能被用户选择的概率。 上屏概率越大, 则候选语句在候选语句列 表中排序越靠前, 用户则能够快速选择到所需要的语句, 提高了文字输 入速度。
步骤 S208: 客户端按照所述上屏概率的大小顺序对所述候选语句列 表中的候选语句进行排序, 并输出排序后的候选语句列表。
在一个实施方式中, 如图 4所示, 一种文字输入系统, 包括查找模 块 102、 候选语句列表生成模块 104、 概率计算模块 106、 排序模块 108 和输出模块 110, 其中: 查找模块 102用于获取用户标识, ^据所述用户标识查找对应的用 户语言模型。
用户标识用于唯一标识用户, 可以是用户在输入法软件上注册的帐 号、 为用户分配的标识号码、 以及与用户所使用的设备关联的 IP地址、 MAC地址等。
在一个实施方式中, 如图 5所示, 上述文字输入系统还包括用户语 言模型建立模块 112和用户语言模型更新模块 114, 其中:
用户语言模型建立模块 112用于建立与用户标识对应的用户语言模 型。
用户语言模型建立模块 112可位于客户端也可位于服务器, 所建立 的用户语言模型可存储在客户端, 也可存储在服务器。
用户语言模型更新模块 114用于在每次用户输入词条后根据用户输 入的词条信息更新用户语言模型。
用户语言模型更新模块 114可位于客户端也可位于服务器, 更新后 的用户语言模型可存储在客户端, 也可由客户端上传到服务器进行存 储。 这样, 服务器上的用户语言模型可以根据用户输入进行不断更新, 因此服务器上的用户语言模型越来越精确, 用户在不同的客户端上使用 输入法软件时, 服务器将最新的用户语言模型下发到客户端, 因此能够 实现用户语言模型的同步, 适用于不同的终端设备。
候选语句列表生成模块 104用于获取用户输入, 根据用户输入生成 候选语句列表。
用户输入可以是语音、 手写体、 光学字符或字符串等, 可采用传统 的文字输入方法从词库中找到与用户输入匹配的候选语句, 生成候选语 句列表。
在一个实施方式中, 候选语句列表生成模块 104可位于服务器端, 由服务器采用传统的文字输入方法从词库中找到与用户输入匹配的候 选语句,生成候选语句列表。将文字输入方法的处理交由服务器来执行, 这种文字输入法也称为"云输入法"。
概率计算模块 106用于获取通用语言模型, 根据用户语言模型和通 用语言模型计算候选语句列表中的候选语句的上屏概率。
通用语言模型可以是传统的统计语言模型, 通过对大规模训练语料 进行统计分析得到, 大规模训练语料可通过互联网从大量用户输入的语 句中获取。 用户语言模型是与用户个人对应的, 不同的用户所对应的用 户语言模型不同。
用户语言模型根据用户输入进行训练得到, 应当说明的是, 对于使 用输入法软件进行首次输入时, 由于用户语言模型未更新, 则仅采用通 用语言模型计算候选语句列表的候选语句的上屏概率, 其方法原理与传 统的采用通用语言模型的输入方法相同, 在此则不再赘述。
在用户每次使用输入法软件输入文字后, 记录用户输入的词条, 根 据用户输入的词条信息更新语言模型, 用户语言模型与用户标识进行对 应存储, 在下一次输入文字时, 则可采用所建立的用户语言模型和通用 语言模型一起用于计算候选语句列表中的候选语句的上屏概率。
排序模块 108用于按照上屏概率的大小顺序对候选语句列表中的候 选语句进行排序。
本实施方式中, 按照上屏概率从大到小的顺序对候选语句列表中的 候选语句进行排序, 排序越靠前的候选语句就越符合用户的语言习惯, 更可能为用户所需求, 因此用户可以更加快速的选择到所需要的候选语 句, 提高了文字输入的准确率, 也提高了文字输入的速度。
输出模块 110用于输出排序后的候选语句列表。
用户可以从优选词列表中选择所需候选语句, 所选择的候选语句从 输入法软件输出到不同应用程序中, 如文本文件、 记事本、 演示文档中 等。
在一个实施方式中, 用户语言模型更新模块 114用于记录用户输入 的词条信息和词频信息, 获取所述词条信息和词频信息, 对词条进行分 词, 根据所述词频信息对分词后的词条进行词频整理, 根据分词后的词 条和整理后的词频更新用户语言模型。 其中, 词频为词条在用户词库中 出现的次数。
在一个实施方式中, 用户语言模型与通用语言模型采用相同的语言 模型, 即采用 Ngram语言模型建模, 但训练集合是不相同的, 用户语言 模型的训练集合是用户词库中的所有词语序列集合, 与某一个用户对 应, 通用语言模型的的训练集合是大量用户输入的词语序列集合, 可通 过互联网获取。
其中, 用户语言模型的概率计算公式为:
Figure imgf000018_0001
其中, p r (^)为包含 m个词语的语句 S= w,W2... wm的概率; 语句 S 由词语序歹 ll S=w w2...wm组成, 其中, 为语句 S中的词语,语句 S由 个词语组成, 例如"你今天吃饭了么"可分解为" /你 /今天 /吃饭 /了 /么";
Ρ · -„+1^;- 可采用最大似然方法进行概率统计, 计算公式为:
Figure imgf000018_0002
其中, c -w - )表示词语序列 Wn^- 在训练集合中出现的次 数, 表示词语序列 Wn "在训练集合中出现的次数。 训练 集合是用户词库中是所有词语序列集合。
在一个优选的实施方式中, 用户语言模型采用更低阶的语言模型, 例如 Unigram语言模型,其相对于 Ngram语言模型所占用的存储空间更 小, 特别适用于在移动终端上使用。 本实施方式中, 用户语言模型的概 率计算公式为:
'■=1
其中, p user (s)为包含 m个词语的语句
Figure imgf000019_0001
在另一个优选的实施方式中,用户语言模型还可采用 Bigram语言模 型, 该语言模型相对于上述两种语言模型, 其建模的速度更快, 特别适 用于云输入法中。 本实施方式中, 用于语言模型的概率计算公式为:
Figure imgf000019_0002
其中, p usJ.s、为包含 m个词语的语句 =ν ,·ν 2...ν„的概率; 1 表示语句 S被分词为两个词语 ¼^和 ww, ρ^ \^-ι)的计算公式为: p(w. \Wi i)= 其中, "^1^表示语句 S在训练集合中出现的次数, c(w'- 表示词 语 ww在训练集合中出现的次数。
在一个实施方式中, 上屏概率生成模块 106用于对用户语言模型和 通用语言模型进行线性插值, 生成混合模型, 根据混合模型计算候选语 句列表中的候选语句的上屏概率。
本实施方式中, 通用语言模型采用传统的 Ngram语言模型, 则将用 户语言模型中的条件概率与通用语言模型中的条件概率进行融合, 计算 融合后的条件概率, 其计算公式为:
P ) = axP(wi I Wi_n+1 · ..w^ ) + (l-a)x Puser {w{ I w{_n+l · ..w^ ) 其中, U w-w ")表示融合后的条件概率, Ρ · —„+1··Ί 表 示通用语言模型的条件概率, 表示用户语言模型的条件 概率, "为插值系数, 取值在 0到 1之间。
根据融合后的条件概率, 生成的混合模型为:
Figure imgf000020_0001
其中, ρΟ为包含 m个词语的语句 S =WiW2. . . Wm的概率
候选语句列表中的候选语句的上屏概率为混合模型计算得到的候选 语句可能被用户选择的概率。 上屏概率越大, 则候选语句在候选语句列 表中排序越靠前, 用户则能够快速选择到所需要的语句, 提高了文字输 入速度。
本发明实施方式还提出了一种文字输入方法和装置, 下面进行详细 描述。
目前最常用的语言建模方法包括统计语言模型建模方法以及 Ngram 语言模型建模方法, 下面进行筒要说明。
统计语言模型以概率论和数理统计理论为基础, 用来计算汉语语句 的概率, 使得输出的正确语句的概率大于错误语句的概率。 例如, 对于 汉语输入的汉语语句 "说明此处汉语语句的概率", 在统计语言模型中, 该汉语语句可以分解为若干个词语, 如: 说明 \此处..., 对于一个包含 ( m为自然数)个词的汉语语句
Figure imgf000020_0002
, 根据 Bayes理论, 该汉语 语句概率(输出正确的概率)可以分解为包含多个词语的条件概率的乘 积, 即:
P(S) = P(wlw2 ... wm ) = U P(.wi I wiw2 · · · wi-i ) 式中, 为汉语语句中包含的第 个词语;
piw w^ .. ^ )为词语 W在该汉语语句 … 中的奈件概率。 由上述公式可见, 条件概率 P /W^…"; 的参数空间随着变量''的 增加呈指数级增长, 当变量 较大时, 以现有训练语料的规模, 还无法 准确地估计出概率 ^1^ 1^…1^)的值, 训练语料是指采用统计的方法 从大规模训练文本中, 按照一定的类别进行组织形成的有序文本集合, 训练语料可以由计算机执行规模处理。 因而, 目前实用化的语言模型建 模方法中, 均对条件概率 Ρ^'^ι1^··1^)进行了不同程度的筒化, 提出 了标准 Ngram语言模型建模方法。
标准 Ngram语言模型是目前最常用的统计语言模型。它将汉语语句 看作是一个马尔科夫序列, 满足马尔科夫属性。 具体来讲, 标准 Ngram 语言模型对统计语言模型中的条件概率 ^ 1^…1^)作如下基本假 设:
( 1 )有限历史假设: 当前输入语句中词语的条件概率仅仅与它前 "-1个词相关, 而与整个汉语语句无关, 其中, "为预先设置的自然数;
( 2 )时齐性 4叚设: 当前词语的条件概率与它在汉语语句句子中出现 的位置无关。
基于上述两个假设,标准 Ngram语言模型的语句概率计算公式可以 简化为:
Figure imgf000021_0001
I w;_„+1u._„†2... )
i=l
可见, 基于上述两个假设, 标准统计语言模型中的条件概率
/H /VV …!^ 被简化成了标准 Ngram 语言模型中的条件概率
Ρ ^Λ^···^) , 新概率的计算公式中, 与当前词语相关的历史词 语的个数固定为常数 " - 1 ,而不是标准统计语言模型中的变数 - 1。这样, 整体降低了语言模型参数空间的大小, 使得在现有训练语料的基础上, 能够正确地估计出 Ngram概率的值, 从而使得标准 Ngram语言模型可 以实用化。
在标准 Ngram语言模型中, 条件概率 7 ^^'^…^)的值采用 最大似然估计的方法进行估计, 估计公式如下:
P I · · · ) =― Γ 式中, (语句中的一部分词
Figure imgf000022_0001
语)在标准 Ngram语言模型的训练语料中出现的次数。
然而, 标准 Ngram语言模型建模方法也存在明显的缺点, 一方面, 标准 Ngram语言模型是单一的模型, 而实际应用中, 用户的汉语输入、 手写识别、 语音识别等需求是多变的、 也是无限的, 例如, 用户有时需 要撰写技术报告, 有时在网上聊天, 在这两种情境下, 用户的汉语输入 需求是不同的; 再例如, 不同年龄段的用户, 由于生活经历的不同, 说 话习惯存在很大不同, 反映在汉语输入上, 就是这些人群经常输入的内 容差别很大。 因而, 单一模型无法满足不同年龄段的用户、 以及同一用 户在不同输入场景下对汉语输入的不同需求, 不同的输入需求采用同一 模型, 使得对用户不同需求的输入, 影响了识别的准确性; 另一方面, 标准 Ngram语言模型本身没有自动学习的机制, 标准 Ngram语言模型 中的参数一经训练便被确定下来, 无法根据用户的输入习惯进行学习和 智能调整, 使得对用户输入的识别准确率较低。
本发明实施方式提出的文字输入方法和装置, 能够满足不同用户对 汉语输入的需求、 提高识别准确率。
现有的用于语言建模的标准 Ngram语言模型, 是单一的模型, 无法 满足不同用户对语句输入的不同需求, 且由于自身没有自动学习机制, 无法根据用户的输入习惯进行学习和智能调整, 使得对用户输入的识别 准确率较低。 以下以用户输入为汉语为例进行说明。
实际应用中, 通过统计分析发现, 用户当前输入的内容(语句 )具 有短时稳定性的特点, 即用户在一段时间内的输入, 一般围绕着同一个 话题进行或展开的。 因此, 用户当前的输入内容, 在接下来的输入中, 存在较大的可能性再次出现、 或者出现类似的输入内容。 也就是说, 无 论以何种输入方式, 例如, 语音、 手写或键盘输入, 用户在一段时间内 的输入是围绕着同一个话题进行的, 其当前输入的话题或内容具有 "短 时稳定性"。
因而, 基于上述统计分析, 本发明实施例中, 针对标准 Ngram语言 模型的缺点, 提出基于緩存的语言建模方法, 通过利用緩存数据结构, 存储用户当前的输入内容, 并对緩存的内容进行数学分析, 从而建立用 户输入的数学模型, 并随着用户的不断输入, 通过不断更新緩存数据结 构中的内容, 实时学习用户的输入习惯、 适应用户的输入需求, 从而使 人机交互变得更加智能, 使建立的用户输入数学模型也越来越精确, 越 来越符合用户的真实输入需求, 从而在用户接下来的输入过程中, 利用 建立的数学模型对用户的输入内容做出更准确的识别, 实现动态学习和 适应用户的输入需求。
图 6为本发明实施例的语言建模方法流程示意图。 参见图 6, 该流 程包括:
步骤 601 , 接收用户的输入, 根据预先建立的标准 Ngram语言模型 分别计算用户输入中各词语的条件概率;
本步骤中, 用户的输入包括: 输入法输入、 手写识别输入以及语音 识别输入等。 其中,
对于输入法输入, 键盘处理程序接收用户输入的字符, 根据采用的 输入法进行识别处理, 得到拼音信息, 输出至映射器, 经过映射器的映 射处理, 将拼音信息映射为相应的候选汉字, 形成用户输入;
对于手写识别输入,预置的笔迹识别程序提取用户手写笔迹的信息, 获取笔迹信息, 经过映射器的映射处理, 将笔迹信息映射为相应的候选 汉字, 形成用户输入;
对于语音识别输入, 预置的音频处理程序对输入的用户音频进行采 样、 量化、 滤波及去噪等处理, 获取音频信息, 经过映射器的映射处理, 将音频信息映射为相应的候选汉字, 形成用户输入。
上述示例中, 映射器对于拼音信息、 笔迹信息以及音频信息的映射 处理, 具体可参见相关技术文献, 在此不再赘述。
根据标准 Ngram语言模型分别计算用户输入中各词语的条件概率, 与现有技术相同, 在此不再赘述。
步骤 602, 确定预先緩存有用户输入, 根据用户的输入以及预先緩 存的用户输入, 按照预先设置的基于緩存的语言建模策略分别计算用户 输入中各词语的条件概率;
本步骤中, 如果用户的输入为首次输入, 则预先緩存的用户输入为 空, 计算各输出语句的语句概率与现有技术相同。
预先设置的基于緩存的语言建模策略公式为:
Figure imgf000024_0001
式中,
Pcacke I , )为第 个词语 ^的緩存条件概率; c(w^ · · · w^ )表示词语序列 · · · w^在緩存的训练语料中出现的 次数, 即包含第''个词语及该第''个词语之前预设常数( " )个词语的词 语序列在緩存的训练语料中出现的次数;
C(W^ · · · W^ )表示词语序列 · · · W^在緩存的训练语料中出 现的次数, 即包含该第 个词语之前预设常数个词语的词语序列在緩存 的训练语料中出现的次数;
/( '·)为时间函数。
关于该公式, 后续再进行详细描述。
步骤 603 , 根据基于标准 Ngram语言模型计算得到的各词语的条件 概率以及基于緩存的建模策略计算得到的各词语的条件概率计算融合 条件概率, 基于融合条件概率获取各输出语句的语句概率;
本步骤中, 融合条件概率的计算公式为:
Pmbture (Wi 1 Wi-„+lWi-„+2 · · · Wi-l ) = I W^W^ · .. ) + (1 - «)¾c¾e (W, I U,.—„+2... 式中, "为插值系数, 是一个常数, 取值在 0和 1之间, 可以根据 实际需要确定;
Ρ · /^+1 +2 ···^)为基于标准 Ngmm语言模型计算得到的第 个 词语 的标准条件概率。
输出语句的语句概率 (即为上屏概率) 的计算公式为:
P(S) = ]J Pmixture {w{ I „+2 . . . )
i=l
步骤 604 , 选择概率最大的输出语句输出并緩存该输出语句。
图 7为本发明实施例的语言建模方法具体流程示意图。 参见图 7, 该流程包括:
步骤 701 , 预先建立标准 Ngram语言模型;
本步骤中, 建立标准 Ngram语言模型可以参照现有技术, 根据最大 似然估计的方法利用训练语料训练出标准 Ngram语言模型。
此时,用户还没有进行输入,预先设置的緩存区中緩存的内容为空。
步骤 702, 接收用户的输入内容, 根据预先建立的标准 Ngram语言 模型对用户的输入内容计算各输出语句的语句概率(即上屏概率); 本步骤中, 用户可以通过语音、 手写字符、 光学字符或键盘按键输 入内容, 当用户开始输入内容时, 通过映射器的映射处理, 映射为候选 文字, 再根据标准 Ngram语言模型对映射的候选文字进行处理, 即进行 输入内容的内核计算过程,根据标准 Ngram语言模型计算各种可能的输 出语句的概率, 与现有技术相同, 在此不再赘述。
步骤 703 , 选择概率最大的输出语句输出;
本步骤中,根据标准 Ngram语言模型计算得到的各种可能的输出语 句的概率, 从中选择概率最大的输出语句作为用户输出, 即将概率最大 的输出语句作为识别出的汉语语句, 一条汉语语句可以包括一个或多个 词语。
步骤 704, 对输出语句进行修正, 并将修正的输出语句输出至预先 设置的緩存区进行緩存;
本步骤中, 用户可以查验输出语句是否与自身的输入需求相匹配, 如果不匹配, 则进行修正, 例如, 用户期望的输入语句为 "这是事实", 根据标准 Ngram语言模型计算得到的概率最大的输出语句,即输入法对 用户输入语句的识别结果为 "这是实时", 则与用户期望的输入需求不 相匹配, 此时用户需要根据输入法的候选将 "实时" 修正为 "事实", 并输出至预先设置的緩存区进行緩存。
实际应用中, 緩存区緩存的内容可以以用户为标识。
步骤 705 , 以緩存区的语句为训练语料, 建立基于緩存的语言模型; 本步骤中, 基于緩存的语言模型是建立在緩存区中存储内容的基础 之上。 緩存区中的内容是根据用户的最近输入得到的, 可以看作是一个 用户特定的、 小规模的训练语料库。
一方面, 基于緩存的语言模型同标准 Ngram语言模型一样, 通过计 算词语和词语之间的条件概率, 用以描述用户当前输入的统计特征, 汉 语语句的概率可以用如下公式计算:
Figure imgf000027_0001
式中, 表示根据緩存区中緩存的内容统计出的汉语语句的概 率值;
m为汉语语句包含的词语个数;
w'为汉语语句中包含的第 i个词语;
P^ w^w^ .. ^ )为词语 W在该汉语语句中的条件概率; "为预先设置的常数。
另一方面, 由前述的统计分析可知, 用户的输入具有 "短时稳定性" 特点, 其中, "短时" 为表征时间的维度, 即用户当前的输入内容仅仅 与该用户最近一段时间的输入内容相关, 而与该用户 4艮久以前的输入内 容无关。 也就是说, 用户当前输入的内容通常比较稳定, 用户当前的输 入内容同当前的输入话题相关, 经过一段时间, 当用户输入的话题转移 之后, 用户的输入内容与该用户从前的话题关联性不大。 因而, 对于基 于緩存的语言模型来说, 用户当前输入的词语, 与最近进入緩存区中的 词语关系最密切, 而与较长时间前进入緩存区的词语的关联度较低。
与标准 Ngram语言模型不同的是, 緩存区中当前词语的条件概率, 不仅与该当前词语的上下文词语相关, 而且与该当前词语进入緩存区的 时间相关。 因而, 考虑时间因素, 则基于緩存的语言模型中, 可以将语 句概率计算公式修正为:
Figure imgf000027_0002
可见, 与前述的条件概率 相比, 修正后的公式中的语句概率 ^ 考虑了 时间变量参数 , 即当前词语 出现的条件概率不仅与上下文词语 - W W I^相关, 而且与 上一次进入緩存区的时间相关。
由于基于緩存的语言模型中, 每个词语的条件概率不仅与该词语的 上下文相关,而且与该词语上一次进入緩存区的时间相关。而标准 Ngram 语言模型中的最大似然估计方法, 只考虑了上下文相关的词汇, 没有考 虑到时间信息, 因而, 不能够直接用于训练基于緩存的语言模型所需的 参数。 为了估计基于緩存的语言模型中词语的条件概率, 通过改进最大 似然估计方法, 在其中加入时间信息, 采用如下公式来计算条件概率
-„+2 · · · , ^ )的值:
Figure imgf000028_0001
与最大似然估计方法不同的是, 上式中, 考虑了时间函数 用 以描述时间因素对语句条件概率的影响。 定义时间函数 如下:
, , d 式中, ^ '为时间变量参数, 即词语 ^进入緩存区中的时间点与当前 用户输入语句的时间点之间的时间间隔。
实际应用中, 如果緩存区的底层数据结构采用队列来实现, 则时间 变量参数^ '的取值可以为词语 在緩存队列中的位置。 例如, 对于首次 进入緩存区中的词语, 如果该词语 排列在队列首端, 假设位置序号为
1 , 则公式中的该词语 对应的时间变量参数 ^ '的取值为 1。
s为预先设置的常数, 用以调节时间变量参数信息在条件概率估计 时的权重。 由上述公式可知, 如果词语1^进入緩存区的时间点越早, 则与当前 用户输入语句的时间间隔越长, 则时间变量参数^ '的取值越大, 使得时 间函数 /( 的取值越小, 从而使得条件概率 ^^ /1^"1^^21^, )的 取值也就越小; 反之, 词语 进入緩存区越晚, 则与当前用户输入的时 间间隔越短, 则时间变量参数 的取值越小, 时间函数 /( 的取值越大, 从而使得条件概率 Ρ:» +2 · · Ί )的取值越大。
步骤 706, 接收用户的输入内容, 根据预先建立的标准 Ngram语言 模型以及基于緩存的语言模型对用户的输入内容分别计算各输出语句 的语句概率;
本步骤中, 在用户接下来的输入过程中, 由标准 Ngram语言模型和 新建立的基于緩存的语言模型共同组成混合模型, 由混合模型对用户的 输入进行处理, 并综合产生处理结果。
本发明实施例中, 采用线性插值的方法, 将基于緩存的语言模型中 的条件概率
Figure imgf000029_0001
^^^-^…^ 与标准 Ngram语言模型中的条件概 率 P -„+ -„+2— W- 相融合, 计算得出 融合后的条件概率
+2 · · · ) , 公式: ¾口下:
-„+2 · · · -„+2 · · · 0- -„+2 · · · 式中, "为插值系数, 是一个常数, 取值在 0和 1之间, 用于调节 基于緩存的语言模型中的条件概率和标准 Ngram语言模型中的条件概 率在最终混合模型中概率的比重。
依据上述混合模型, 一个包含 个词语的汉语语句 5 = · ·^的概 率可以由以下公式计算得出: P(S) = \ pmixture (wt I „+1w;—„+2 . . . ) 举例来说,如果用户在前输入了 "肖镜辉^ 讯员工",经标准 Ngram 语言模型识别后, 緩存区中緩存了 "肖"、 "镜"、 "辉" 三个单字词以及 词语 "是"、 "腾讯员工", 当用户再输入 "肖镜辉写了一篇专利" 时, 基于緩存的语言模型中緩存区存储的 "肖"、 "镜"、 "辉" 三个单字词就 对当前的输入语句发生作用:如果没有緩存区存储的 "肖"、 "镜"、 "辉" 三个单字词, 在用户新输入时, "肖镜辉" 被转换错误的概率相对就较 高, 而根据緩存区緩存的信息, "肖镜辉" 被正确转换的概率就较高, 因而, 使得输入的 "肖镜辉写了一篇专利" 被输入法正确转换出来的概 率就较大。
从上述过程中可以看到, 基于緩存的语言模型是根据用户的当前输 入不断建立起来的, 一方面反映了用户当前的输入场景信息, 另一方面 也反映了用户本身的输入习惯。标准 Ngram语言模型结合基于緩存的语 言模型, 能够有效地对用户的输入场景和输入习惯进行学习和自适应。
步骤 707, 选择概率最大的输出语句输出;
步骤 708, 根据输出语句更新緩存区中緩存的语句。
实验表明, 同标准 Ngram语言模型相比, 本发明实施例的基于緩存 的语言模型建模方法, 对用户输入的识别具有更高的准确率, 并且, 在 此基础之上构建的汉语输入软件具有更高的智能性。
所应说明的是, 本发明实施例的语言建模方法, 不仅可应用于汉语 输入法, 也可应用于日语、 韩语、 柬埔寨等其它亚洲语言的输入法, 其 语言建模方法与汉语语言建模方法相类似, 在此不再赘述。
图 8为本发明实施例的语言建模装置结构示意图。 参见图 8, 该装 置包括: 标准 Ngram语言模型模块、 緩存模块、 基于緩存的语言建模模 块以及混合模型模块, 其中,
标准 Ngram语言模型模块, 用于接收用户的输入, 分别计算用户输 入中各词语的标准条件概率, 输出至混合模型模块;
本发明实施例中,标准 Ngram语言模型模块计算词语条件概率的公 式为:
P · · · ) -― Γ 式中, C^—„+^„ 表示词语序列 在标准 Ngram语言模 型的训练语料中出现的次数;
w'为汉语语句中包含的第 i个词语;
"为预先设置的常数。
緩存模块, 用于緩存混合模型模块输出的语句;
基于緩存的语言建模模块, 用于按照预先设置的基于緩存的语言建 模策略, 根据用户的输入以及緩存模块緩存的语句, 分别计算用户输入 中各词语的条件概率, 输出至混合模型模块;
本发明实施例中, 基于緩存的语言建模模块计算词语条件概率的公 式为:
—„+2…^ , = f (ti )x „+1 . ,― 式中, π · · ^^表示词语序列 … 在緩存的训练语料中 出现的次数;
w'为汉语语句中包含的第 i个词语;
"为预先设置的常数;
/( '·)为时间函数。
混合模型模块, 用于根据各词语的标准条件概率以及緩存条件概率 计算融合条件概率, 基于融合条件概率获取各输出语句的语句概率, 选 择概率最大的输出语句输出。
本发明实施例中, 融合条件概率的计算公式为:
Pmbture (Wi 1 wi-„+iwi-„+2 · · · Wi-i ) = axpiw i I w^w^ · .. ) + (1 - a)xPcache (w, I u ,.—„+2 . . . 式中, "为插值系数, 是一个常数, 取值在 0和 1之间。
输出语句的语句概率计算公式为:
m
^)=11 Pm ure (^- I ^ + +2 · · · ^-1 )
i=l
式中, m为汉语语句包含的词语个数。
其中,
标准 Ngram语言模型模块包括: 第一词语序列频次计数单元、 第二 词语序列频次计数单元以及标准条件概率计算单元(图中未示出), 其 中,
第一词语序列频次计数单元, 用于获取包含该第 i个词语及该第 i个 词语之前预设常数个词语的词语序列在标准 Ngram语言模型的训练语 料中出现的次数^ , 输出至标准条件概率计算单元;
第二词语序列频次计数单元,用于获取包含该第 个词语之前预设常 数个词语的词语序列在标准 Ngram语言模型的训练语料中出现的次数 k," , 输出至标准条件概率计算单元;
标准条件概率计算单元, 用于计算次数 ^与次数^ -1的比值, 将计 算得到的比值作为所述用户输入中第 个词语的标准条件概率。
基于緩存的语言建模模块包括: 第三词语序列频次计数单元、 第四 词语序列频次计数单元、 时间函数值获取单元以及緩存条件概率计算单 元(图中未示出), 其中, 第三词语序列频次计数单元, 用于获取包含该第 i个词语及该第 i个 词语之前预设常数个词语的词语序列在緩存的训练语料中出现的次数 输出至緩存条件概率计算单元;
第四词语序列频次计数单元,用于获取包含该第 '·个词语之前预设常 数个词语的词语序列在緩存的训练语料中出现的次数 ,输出至緩存条 件概率计算单元;
时间函数值获取单元,用于获取该第 个词语的时间函数值,输出至 緩存条件概率计算单元;
緩存条件概率计算单元, 用于计算次数 与次数 的比值, 将计算 得到的比值与该第 个词语的时间函数值相乘, 得到所述用户输入中第 个词语的緩存条件概率。
混合模型模块包括: 插值系数存储单元、 第一乘积单元、 第二乘积 单元、 融合条件概率计算单元、 语句概率计算单元以及输出语句选择单 元(图中未示出 ), 其中,
插值系数存储单元, 用于存储预先设置在 0至 1之间的插值系数; 第一乘积单元, 用于根据插值系数存储单元存储的插值系数, 计算 该插值系数与第 个词语的标准条件概率的乘积, 输出至融合条件概率 计算单元;
第二乘积单元,用于计算 1与该插值系数的差与第 个词语的緩存条 件概率的乘积, 输出至融合条件概率计算单元;
融合条件概率计算单元,用于将接收的与第''个词语相关的乘积进行 相加, 作为第 i个词语的融合条件概率;
语句概率计算单元, 用于将融合条件概率计算单元获取的各词语的 融合条件概率依次相乘得到输出语句的语句概率; 输出语句选择单元, 用于选择语句概率计算单元计算得到的最大语 句概率, 将该最大语句概率对应的输出语句输出。
由上述可见, 本发明实施例的语言建模方法及语言建模装置, 通过 对用户输入进行緩存, 使得緩存的用户输入与用户输入的历史信息以及 用户输入场景相关, 这样, 基于緩存建立的语言建模模型一方面具有自 学习的功能, 从而提高了语言模型的智能性; 另一方面, 通过对每个用 户的输入习惯进行学习和适应, 也使得人机交互软件能够适应不同用户 群体和应用场景。 具体来说, 具有如下有益技术效果:
一、 本发明提高了语言模型的性能, 能够满足不同用户对汉语输入 的需求、提高预测准确率, 进而可以应用到语音识别、手写体字符识别、 汉语键盘输入法、 光学字符识别等领域, 提高相关系统的准确率;
二、 在本发明的基础上可以建立基于语言模型的信息检索系统, 提 高信息检索系统的性能, 例如, 准确率、 召回率等。 体和详细, 但并不能因此而理解为对本发明专利范围的限制。 应当指出 的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下, 还可以做出若干变形和改进, 这些都属于本发明的保护范围。 因此, 本 发明专利的保护范围应以所附权利要求为准。

Claims

权利要求书
1、 一种文字输入方法, 其特征在于, 包括以下步骤:
获取用户标识, 根据用户标识查找对应的用户语言模型; 获取用户 输入, 根据所述用户输入生成候选语句列表; 获取通用语言模型, 根据 所述用户语言模型和通用语言模型计算所述候选语句列表中的候选语 句的上屏概率; 或
根据通用语言模型, 分别计算用户输入中各词语的标准条件概率; 按照预先设置的基于緩存的语言建模策略, 根据所述用户输入以及预先 緩存的用户输入, 分别计算所述用户输入中各词语的緩存条件概率; 根 据各词语的标准条件概率以及緩存条件概率计算融合条件概率, 基于融 合条件概率获取各候选语句的上屏概率;
按照所述上屏概率的大小顺序对所述候选语句列表中的候选语句进 行排序;
输出排序后的候选语句列表。
2、根据权利要求 1所述的文字输入方法, 其特征在于, 所述方法还 包括建立与用户标识对应的用户语言模型并在每次用户输入词条后根 据用户输入的词条信息更新用户语言模型的步骤。
3、根据权利要求 2所述的文字输入方法, 其特征在于, 所述更新用 户语言模型的步骤具体为:
记录用户输入的词条信息和词频信息;
获取所述词条信息和词频信息, 对词条进行分词, 根据所述词频信 息对分词后的词条进行词频整理;
根据分词后的词条和整理后的词频更新所述用户语言模型。
4、根据权利要求 1-3中任意一项所述的文字输入方法,其特征在于, 所述根据用户语言模型和通用语言模型计算所述候选语句列表中的候 选语句的上屏概率的步骤为:
对所述用户语言模型和通用语言模型进行线性插值,生成混合模型, 根据所述混合模型计算所述候选语句列表中的候选语句的上屏概率。
5、根据权利要求 1-3中任意一项所述的文字输入方法,其特征在于, 所述用户标识为用户在输入法软件上注册的帐号、 为用户分配的标识号 码、 与用户所使用的设备关联的 IP地址或 MAC地址。
6、根据权利要求 1所述的文字输入方法, 其特征在于, 该方法中选 择上屏概率最大的输出语句输出并緩存该输出语句;
所述根据通用语言模型, 分别计算用户输入中各词语的标准条件概 率为: 根据预先建立的标准 Ngram语言模型, 分别计算用户输入中各词 语的标准条件概率, 具体包括:
计算用户输入中第 i个词语的緩存条件概率包括:
获取包含该第 i个词语及该第 i个词语之前预设常数个词语的词语序 列在緩存的训练语料中出现的次数
获取包含该第 个词语之前预设常数个词语的词语序列在緩存的训 练语料中出现的次数 ;
获取该第 i个词语的时间函数值;
计算次数 与次数 k"的比值, 将计算得到的比值与该第 i个词语的 时间函数值相乘, 得到所述用户输入中第 i个词语的緩存条件概率。
7、根据权利要求 6所述的文字输入方法, 其特征在于, 将预先设置 的常数与第 个词语进入緩存区中的时间点与当前用户输入语句的时间 点之间的时间间隔进行相比得到所述时间函数值。
8、 根据权利要求 7所述的文字输入方法, 其特征在于, 计算用户 输入中第 个词语的标准条件概率包括: 获取包含该第 i个词语及该第 i个词语之前预设常数个词语的词语序 列在标准 Ngram语言模型的训练语料中出现的次数
获取包含该第 i个词语之前预设常数个词语的词语序列在标准
Ngram语言模型的训练语料中出现的次数^ -i;
计算次数^与次数 的比值, 将计算得到的比值作为所述用户输 入中第 个词语的标准条件概率。
9、 根据权利要求 8所述的文字输入方法, 其特征在于, 计算第 个 词语的融合条件概率包括:
Al、 确定取值在 0至 1之间的插值系数;
A2、 计算该插值系数与第 ^个词语的标准条件概率的乘积;
A3、计算 1与该插值系数的差与第 ^个词语的緩存条件概率的乘积; A4、 计算步骤 A2、 A3得到的乘积的和, 作为第 ^个词语的融合条 件概率。
10、 根据权利要求 9所述的文字输入方法, 其特征在于, 计算输出 语句的上屏概率包括:
分别获取语句包含的各词语的融合条件概率;
将获取的各词语的融合条件概率依次相乘得到输出语句的上屏概 率。
11、 根据权利要求 10所述的文字输入方法, 其特征在于, 所述緩 存的用户输入采用队列的数据结构, 所述第 个词语的时间间隔的取值 为第 个词语在緩存队列中的位置。
12、 如权利要求 6-11中任一项所述的文字输入方法, 其特征在于, 在所述选择上屏概率最大的输出语句输出后, 緩存该输出语句前, 进一 步包括: 对输出语句进行修正。
13、如权利要求 12所述的文字输入方法, 其特征在于, 所述用户输 入包括: 输入法输入、 手写识别输入以及语音识别输入。
14、 如权利要求 7所述的文字输入方法, 其特征在于, 在所述预先 緩存的用户输入为空时, 所述用户输入中各词语的緩存条件概率等于该 词语的标准条件概率。
15、 一种文字输入方法, 其特征在于, 包括以下步骤:
客户端获取用户标识, 根据用户标识从服务器查找对应的用户语言 模型;
所述客户端获取用户输入, 将所述用户输入上传到服务器, 所述服 务器根据所述用户输入生成候选语句列表;
所述服务器获取通用语言模型, 根据所述用户语言模型和通用语言 模型计算所述候选语句列表中的候选语句的上屏概率;
所述服务器按照所述上屏概率的大小顺序对所述候选语句列表中的 候选语句进行排序, 将排序后的候选语句列表下发到所述客户端; 所述客户端接收所述排序后的候选语句列表并输出。
16、根据权利要求 15所述的文字输入方法, 其特征在于, 所述方法 还包括在服务器上建立与用户标识对应的用户语言模型并在每次用户 输入词条后根据用户输入的词条信息更新用户语言模型的步骤。
17、根据权利要求 16所述的文字输入方法, 其特征在于, 所述更新 用户语言模型的步骤具体为:
记录用户输入的词条信息和词频信息;
获取所述词条信息和词频信息, 对词条进行分词, 根据所述词频信 息对分词后的词条进行词频整理;
根据分词后的词条和整理后的词频更新所述用户语言模型。
18、 根据权利要求 15-17中任意一项所述的文字输入方法, 其特征 在于, 所述服务器根据所述用户语言模型和通用语言模型计算所述候选 语句列表中的候选语句的上屏概率的步骤为:
对所述用户语言模型和通用语言模型进行线性插值,生成混合模型, 根据所述混合模型计算所述候选语句列表中的候选语句的上屏概率。
19、 一种文字输入方法, 其特征在于, 包括以下步骤:
客户端获取用户标识, 根据用户标识在自身查找对应的用户语言模 型;
所述客户端获取用户输入, 并根据用户输入生成候选语句列表; 所述客户端自身获取通用语言模型, 根据所述用户语言模型和通用 语言模型计算所述候选语句列表中的候选语句的上屏概率;
所述客户端按照所述上屏概率的大小顺序对所述候选语句列表中的 候选语句进行排序, 并输出排序后的候选语句列表。
20、根据权利要求 19所述的文字输入方法, 其特征在于, 所述方法 还包括在客户端上建立与用户标识对应的用户语言模型并在每次用户 输入词条后根据用户输入的词条信息更新用户语言模型的步骤。
21、根据权利要求 20所述的文字输入方法, 其特征在于, 所述更新 用户语言模型的步骤具体为:
记录用户输入的词条信息和词频信息;
获取所述词条信息和词频信息, 对词条进行分词, 根据所述词频信 息对分词后的词条进行词频整理;
根据分词后的词条和整理后的词频更新所述用户语言模型。
22、根据权利要求 19所述的文字输入方法, 其特征在于, 所述客户 端根据所述用户语言模型和通用语言模型计算所述候选语句列表中的 候选语句的上屏概率的步骤为:
客户端对所述用户语言模型和通用语言模型进行线性插值, 生成混 合模型, 根据所述混合模型计算所述候选语句列表中的候选语句的上屏 概率。
23、 一种文字输入系统, 其特征在于, 包括:
查找模块, 用于获取用户标识, ^据用户标识查找对应的用户语言 模型;
候选语句列表生成模块, 用于获取用户输入, 根据所述用户输入生 成候选语句列表;
概率计算模块, 用于根据所述用户语言模型和通用语言模型生成所 述候选语句列表中的候选语句的上屏概率;
排序模块, 用于按照所述上屏概率的大小顺序对所述候选语句列表 中的候选语句进行排序;
输出模块, 用于输出排序后的候选语句列表。
24、根据权利要求 23所述的文字输入系统, 其特征在于, 所述系统 还包括:
用户语言模型建立模块,用于建立与用户标识对应的用户语言模型; 用户语言模型更新模块, 用于在每次用户输入词条后根据用户输入 的词条信息更新用户语言模型。
25、根据权利要求 14所述的文字输入系统, 其特征在于, 所述用户 语言模型更新模块用于记录用户输入的词条信息和词频信息, 获取所述 词条信息和词频信息, 对词条进行分词, 根据所述词频信息对分词后的 词条进行词频整理, 根据分词后的词条和整理后的词频更新所述用户语 言模型。
26、 根据权利要求 23-25 中任意一项所述的文字输入系统, 其特征 在于, 所述上屏概率生成模块用于对所述用户语言模型和通用语言模型 进行线性插值, 生成混合模型, 根据所述混合模型计算所述候选语句列 表中的候选语句的上屏概率。
27、 一种文字处理系统, 其特征在于, 包括客户端和服务器, 其中: 客户端, 用于获取用户标识, 根据用户标识从服务器查找对应的用 户语言模型; 获取用户输入, 将所述用户输入上传到服务器; 接收由服 务器排序后的候选语句列表并输出;
服务器, 用于根据所述用户输入生成候选语句列表, 获取通用语言 模型, 根据所述用户语言模型和通用语言模型计算所述候选语句列表中 的候选语句的上屏概率, 按照所述上屏概率的大小顺序对所述候选语句 列表中的候选语句进行排序, 并将排序后的候选语句列表下发到所述客 户端。
28、 根据权利要求 27所述的文字处理系统, 其特征在于, 服务器, 用于对所述用户语言模型和通用语言模型进行线性插值, 生成混合模型, 根据所述混合模型计算所述候选语句列表中的候选语句 的上屏概率。
29、根据权利要求 27所述的文字处理系统, 其特征在于, 进一步包 括文字处理单元, 其中:
客户端, 用于从服务器排序后的候选语句列表中选择候选语句, 并 将所述候选语句输出到文字处理单元;
所述文字处理单元, 用于对所述候选语句进行文字处理。
30、 根据权利要求 29所述的文字处理系统, 其特征在于, 所述文字处理单元为: 文本文件处理单元、 记事本处理单元、 即时 通讯工具或演示文档处理单元。
31、 一种文字处理装置, 其特征在于, 该装置包括: 通用语言模型 模块、 緩存模块、 基于緩存的语言建模模块以及混合模型模块, 其中, 通用语言模型模块, 用于接收用户的输入, 分别计算用户输入中各 词语的标准条件概率, 输出至混合模型模块;
緩存模块, 用于緩存混合模型模块输出的语句;
基于緩存的语言建模模块, 用于按照预先设置的基于緩存的语言建 模策略, 根据用户的输入以及緩存模块緩存的语句, 分别计算用户输入 中各词语的緩存条件概率, 输出至混合模型模块;
混合模型模块, 用于根据各词语的标准条件概率以及緩存条件概率 计算融合条件概率, 基于融合条件概率获取各输出语句的语句概率, 选 择概率最大的输出语句输出。
32、如权利要求 31所述的装置, 其特征在于, 所述通用语言模型为 标准 Ngram语言模型模块, 并且包括: 第一词语序列频次计数单元、 第 二词语序列频次计数单元以及标准条件概率计算单元, 其中,
第一词语序列频次计数单元, 用于获取包含该第 个词语及该第 个 词语之前预设常数个词语的词语序列在标准 Ngram语言模型的训练语 料中出现的次数^ , 输出至标准条件概率计算单元;
第二词语序列频次计数单元,用于获取包含该第 个词语之前预设常 数个词语的词语序列在标准 Ngram语言模型的训练语料中出现的次数 k," , 输出至标准条件概率计算单元;
标准条件概率计算单元, 用于计算次数 ^与次数^ -1的比值, 将计 算得到的比值作为所述用户输入中第 个词语的标准条件概率。
33、如权利要求 32所述的装置, 其特征在于, 所述基于緩存的语言 建模模块包括: 第三词语序列频次计数单元、 第四词语序列频次计数单 元、 时间函数值获取单元以及緩存条件概率计算单元, 其中,
第三词语序列频次计数单元, 用于获取包含该第 个词语及该第 个 词语之前预设常数个词语的词语序列在緩存的训练语料中出现的次数 s 输出至緩存条件概率计算单元;
第四词语序列频次计数单元,用于获取包含该第 个词语之前预设常 数个词语的词语序列在緩存的训练语料中出现的次数 ^1 ,输出至緩存条 件概率计算单元;
时间函数值获取单元,用于获取该第' '个词语的时间函数值,输出至 緩存条件概率计算单元;
緩存条件概率计算单元, 用于计算次数 与次数 ^1的比值, 将计算 得到的比值与该第 个词语的时间函数值相乘, 得到所述用户输入中第 个词语的緩存条件概率。
34、如权利要求 33所述的装置, 其特征在于, 所述混合模型模块包 括: 插值系数存储单元、 第一乘积单元、 第二乘积单元、 融合条件概率 计算单元、 语句概率计算单元以及输出语句选择单元, 其中,
插值系数存储单元, 用于存储预先设置在 0至 1之间的插值系数; 第一乘积单元, 用于根据插值系数存储单元存储的插值系数, 计算 该插值系数与第 个词语的标准条件概率的乘积, 输出至融合条件概率 计算单元;
第二乘积单元,用于计算 1与该插值系数的差与第 个词语的緩存条 件概率的乘积, 输出至融合条件概率计算单元;
融合条件概率计算单元,用于将接收的与第''个词语相关的乘积进行 相加, 作为第 i个词语的融合条件概率;
语句概率计算单元, 用于将融合条件概率计算单元获取的各词语的 融合条件概率依次相乘得到输出语句的语句概率;
输出语句选择单元, 用于选择语句概率计算单元计算得到的最大语 句概率, 将该最大语句概率对应的输出语句输出。
PCT/CN2012/078591 2011-07-14 2012-07-13 文字输入方法、装置及系统 WO2013007210A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2014519401A JP5926378B2 (ja) 2011-07-14 2012-07-13 テキスト入力方法、装置、およびシステム
US14/232,737 US9176941B2 (en) 2011-07-14 2012-07-13 Text inputting method, apparatus and system based on a cache-based language model and a universal language model
EP12811503.7A EP2733582A4 (en) 2011-07-14 2012-07-13 METHOD, DEVICE AND SYSTEM FOR CHARACTER ENTRY

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201110197062.2A CN102880611B (zh) 2011-07-14 2011-07-14 一种语言建模方法及语言建模装置
CN201110197062.2 2011-07-14
CN201110209014.0 2011-07-25
CN201110209014.0A CN102902362B (zh) 2011-07-25 2011-07-25 文字输入方法及系统

Publications (1)

Publication Number Publication Date
WO2013007210A1 true WO2013007210A1 (zh) 2013-01-17

Family

ID=47505527

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2012/078591 WO2013007210A1 (zh) 2011-07-14 2012-07-13 文字输入方法、装置及系统

Country Status (4)

Country Link
US (1) US9176941B2 (zh)
EP (1) EP2733582A4 (zh)
JP (1) JP5926378B2 (zh)
WO (1) WO2013007210A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413745A (zh) * 2019-06-21 2019-11-05 阿里巴巴集团控股有限公司 选择代表文本的方法、确定标准问题的方法及装置
CN110807316A (zh) * 2019-10-30 2020-02-18 安阳师范学院 一种汉语选词填空方法

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150002506A1 (en) * 2013-06-28 2015-01-01 Here Global B.V. Method and apparatus for providing augmented reality display spaces
CN103870553B (zh) * 2014-03-03 2018-07-10 百度在线网络技术(北京)有限公司 一种输入资源推送方法及系统
CN104281649B (zh) * 2014-09-09 2017-04-19 北京搜狗科技发展有限公司 一种输入方法、装置及电子设备
CN107305575B (zh) 2016-04-25 2021-01-26 北京京东尚科信息技术有限公司 人机智能问答系统的断句识别方法和装置
CN107688398B (zh) * 2016-08-03 2019-09-17 中国科学院计算技术研究所 确定候选输入的方法和装置及输入提示方法和装置
US10102199B2 (en) 2017-02-24 2018-10-16 Microsoft Technology Licensing, Llc Corpus specific natural language query completion assistant
CN108573706B (zh) * 2017-03-10 2021-06-08 北京搜狗科技发展有限公司 一种语音识别方法、装置及设备
CN107193973B (zh) * 2017-05-25 2021-07-20 百度在线网络技术(北京)有限公司 语义解析信息的领域识别方法及装置、设备及可读介质
US10943583B1 (en) * 2017-07-20 2021-03-09 Amazon Technologies, Inc. Creation of language models for speech recognition
KR102157390B1 (ko) * 2017-12-01 2020-09-18 한국전자통신연구원 언어모델에 기반한 한국어 생략 성분 복원 방법
CN110245331A (zh) * 2018-03-09 2019-09-17 中兴通讯股份有限公司 一种语句转换方法、装置、服务器及计算机存储介质
CN108920560B (zh) * 2018-06-20 2022-10-04 腾讯科技(深圳)有限公司 生成方法、训练方法、装置、计算机可读介质及电子设备
US11205045B2 (en) * 2018-07-06 2021-12-21 International Business Machines Corporation Context-based autocompletion suggestion
US11074317B2 (en) 2018-11-07 2021-07-27 Samsung Electronics Co., Ltd. System and method for cached convolution calculation
CN110866499B (zh) * 2019-11-15 2022-12-13 爱驰汽车有限公司 手写文本识别方法、系统、设备及介质
CN113589949A (zh) * 2020-04-30 2021-11-02 北京搜狗科技发展有限公司 一种输入方法、装置和电子设备
CN116306612A (zh) * 2021-12-21 2023-06-23 华为技术有限公司 一种词句生成方法及相关设备
US11546323B1 (en) 2022-08-17 2023-01-03 strongDM, Inc. Credential management for distributed services
US11736531B1 (en) 2022-08-31 2023-08-22 strongDM, Inc. Managing and monitoring endpoint activity in secured networks
US11765159B1 (en) 2022-09-28 2023-09-19 strongDM, Inc. Connection revocation in overlay networks
US11916885B1 (en) 2023-01-09 2024-02-27 strongDM, Inc. Tunnelling with support for dynamic naming resolution
US11765207B1 (en) * 2023-03-17 2023-09-19 strongDM, Inc. Declaring network policies using natural language

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1447264A (zh) * 2003-04-18 2003-10-08 清华大学 基于语义构词约束的汉语二字词抽取方法
CN101131706A (zh) * 2007-09-28 2008-02-27 北京金山软件有限公司 一种查询修正方法及系统
CN101206673A (zh) * 2007-12-25 2008-06-25 北京科文书业信息技术有限公司 网络搜索过程中关键词的智能纠错系统及方法

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6418431B1 (en) * 1998-03-30 2002-07-09 Microsoft Corporation Information retrieval and speech recognition based on language models
US7165019B1 (en) * 1999-11-05 2007-01-16 Microsoft Corporation Language input architecture for converting one text form to another text form with modeless entry
US7395205B2 (en) * 2001-02-13 2008-07-01 International Business Machines Corporation Dynamic language model mixtures with history-based buckets
US7103534B2 (en) * 2001-03-31 2006-09-05 Microsoft Corporation Machine learning contextual approach to word determination for text input via reduced keypad keys
US7111248B2 (en) * 2002-01-15 2006-09-19 Openwave Systems Inc. Alphanumeric information input method
JP2006107353A (ja) * 2004-10-08 2006-04-20 Sony Corp 情報処理装置および方法、記録媒体、並びにプログラム
US7917355B2 (en) * 2007-08-23 2011-03-29 Google Inc. Word detection
JP2009075582A (ja) * 2007-08-29 2009-04-09 Advanced Media Inc 端末装置、言語モデル作成装置、および分散型音声認識システム
JP5475795B2 (ja) * 2008-11-05 2014-04-16 グーグル・インコーポレーテッド カスタム言語モデル
US20100131447A1 (en) * 2008-11-26 2010-05-27 Nokia Corporation Method, Apparatus and Computer Program Product for Providing an Adaptive Word Completion Mechanism
US8706643B1 (en) * 2009-01-13 2014-04-22 Amazon Technologies, Inc. Generating and suggesting phrases
JP5054711B2 (ja) * 2009-01-29 2012-10-24 日本放送協会 音声認識装置および音声認識プログラム
GB0905457D0 (en) * 2009-03-30 2009-05-13 Touchtype Ltd System and method for inputting text into electronic devices
US8260615B1 (en) * 2011-04-25 2012-09-04 Google Inc. Cross-lingual initialization of language models
US20120297294A1 (en) * 2011-05-17 2012-11-22 Microsoft Corporation Network search for writing assistance

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1447264A (zh) * 2003-04-18 2003-10-08 清华大学 基于语义构词约束的汉语二字词抽取方法
CN101131706A (zh) * 2007-09-28 2008-02-27 北京金山软件有限公司 一种查询修正方法及系统
CN101206673A (zh) * 2007-12-25 2008-06-25 北京科文书业信息技术有限公司 网络搜索过程中关键词的智能纠错系统及方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP2733582A4 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413745A (zh) * 2019-06-21 2019-11-05 阿里巴巴集团控股有限公司 选择代表文本的方法、确定标准问题的方法及装置
CN110807316A (zh) * 2019-10-30 2020-02-18 安阳师范学院 一种汉语选词填空方法
CN110807316B (zh) * 2019-10-30 2023-08-15 安阳师范学院 一种汉语选词填空方法

Also Published As

Publication number Publication date
EP2733582A4 (en) 2015-01-14
JP2014521158A (ja) 2014-08-25
US20140136970A1 (en) 2014-05-15
JP5926378B2 (ja) 2016-05-25
US9176941B2 (en) 2015-11-03
EP2733582A1 (en) 2014-05-21

Similar Documents

Publication Publication Date Title
WO2013007210A1 (zh) 文字输入方法、装置及系统
US11557289B2 (en) Language models using domain-specific model components
US10719507B2 (en) System and method for natural language processing
US11200506B2 (en) Chatbot integrating derived user intent
WO2020107878A1 (zh) 文本摘要生成方法、装置、计算机设备及存储介质
US20190370398A1 (en) Method and apparatus for searching historical data
US8275618B2 (en) Mobile dictation correction user interface
US20200211537A1 (en) Scalable dynamic class language modeling
US20060149551A1 (en) Mobile dictation correction user interface
CN111753060A (zh) 信息检索方法、装置、设备及计算机可读存储介质
JP7200405B2 (ja) 音声認識のためのコンテキストバイアス
US20140108003A1 (en) Multiple device intelligent language model synchronization
WO2018024166A1 (zh) 确定候选输入的方法、输入提示方法和电子设备
KR20180064504A (ko) 개인화된 엔티티 발음 학습
CN111739514B (zh) 一种语音识别方法、装置、设备及介质
EP3736807B1 (en) Apparatus for media entity pronunciation using deep learning
EP3491541A1 (en) Conversation oriented machine-user interaction
WO2016144988A1 (en) Token-level interpolation for class-based language models
WO2021051514A1 (zh) 一种语音识别方法、装置、计算机设备及非易失性存储介质
US10860588B2 (en) Method and computer device for determining an intent associated with a query for generating an intent-specific response
US11043215B2 (en) Method and system for generating textual representation of user spoken utterance
WO2023108994A1 (zh) 一种语句生成方法及电子设备、存储介质
JP2023503717A (ja) エンド・ツー・エンド音声認識における固有名詞認識
KR101677859B1 (ko) 지식 베이스를 이용하는 시스템 응답 생성 방법 및 이를 수행하는 장치
US11170765B2 (en) Contextual multi-channel speech to text

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12811503

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2014519401

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 14232737

Country of ref document: US

WWE Wipo information: entry into national phase

Ref document number: 2012811503

Country of ref document: EP