CN110377916B

CN110377916B - Word prediction method, word prediction device, computer equipment and storage medium

Info

Publication number: CN110377916B
Application number: CN201910740458.3A
Authority: CN
Inventors: 黄羿衡; 苏丹
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-08-17
Filing date: 2018-08-17
Publication date: 2022-12-16
Anticipated expiration: 2038-08-17
Also published as: CN109117480B; CN110377916A; CN109117480A

Abstract

The application discloses a word prediction method, a word prediction device, computer equipment and a storage medium, wherein in the method, the computer equipment acquires a current word for prediction and first context information of a word sequence before the current word; based on the current word and the first context information, determining the probability that the words to be predicted behind the current word respectively belong to a plurality of different fields; aiming at each field, determining a first possibility that each word in a word list respectively belongs to the word to be predicted based on the current word and first context information; and determining second possibility that each word in the word list respectively belongs to the word to be predicted according to the probability that the word to be predicted respectively belongs to a plurality of different fields and the first possibility that each word in the word list corresponding to each field respectively belongs to the word to be predicted. According to the scheme, the accuracy of predicting the probability of the next word after a certain word can be improved, and the accuracy of predicting the probability of the sentence is improved.

Description

Word prediction method, word prediction device, computer equipment and storage medium

The application is a divisional application which is submitted in 2018, 08 and 17 months and has the application number of 201810942238.4 and the name of the patent application of 'word prediction method, device, computer equipment and storage medium'.

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a word prediction method, an apparatus, a computer device, and a storage medium.

Background

Language models are widely used in many fields such as speech recognition and machine translation. The action of the language model calculates the probability of a sentence occurring in order to select a sentence that best fits the human language from a plurality of candidate sentences. For example, in a speech recognition scenario, an input speech may be recognized as a plurality of candidate sentences, some of which have wrong words or grammars and do not conform to human language, in which case a language model is used to output the probability of reasonableness of each candidate sentence.

In the process of determining the occurrence probability of a sentence to be predicted by the language model, a current word for prediction needs to be determined in the sentence to be predicted, and the probability that each word in a word list of the language model belongs to a next word (i.e., a word to be predicted) after the current word is determined. However, the accuracy of predicting the probability that each word in the language model prediction vocabulary belongs to the word to be predicted after the current word is generally low, so that the accuracy of determining the sentence occurrence probability by the language model is low.

Disclosure of Invention

In view of this, the present application provides a word prediction method, device, computer device and storage medium, so as to improve the accuracy of predicting the probability of occurrence of the next word after a certain word.

To achieve the above object, in one aspect, the present application provides a word prediction method, including:

acquiring a current word for prediction and first context information of a word sequence before the current word;

based on the current word and the first context information, determining the probability that the words to be predicted behind the current word respectively belong to a plurality of different fields;

for each field, determining a first possibility that each word in a word list respectively belongs to the word to be predicted based on the current word and first context information, wherein the first possibility is the possibility that the word in the word list belongs to the word to be predicted under the condition that the word to be predicted belongs to the field; the word list is a set which is constructed in advance and contains a plurality of words;

and determining second possibility that each word in the word list respectively belongs to the word to be predicted according to the probability that the word to be predicted respectively belongs to a plurality of different fields and the first possibility that each word in the word list corresponding to each field respectively belongs to the word to be predicted.

In a possible implementation manner, the vocabulary is a high-frequency vocabulary, the high-frequency vocabulary is composed of a plurality of words with higher frequency of use in a total vocabulary, the total vocabulary is a pre-constructed set containing the plurality of words, and the total number of the words in the total vocabulary is more than the total number of the words in the high-frequency vocabulary;

further comprising:

determining a third possibility that each word in a low-frequency word list belongs to the word to be predicted respectively based on the current word and the first context information, wherein the low-frequency word list is composed of a plurality of words which do not belong to the high-frequency word list in the general word list;

and constructing the possibility that each word in the total word list respectively belongs to the words to be predicted according to the second possibility that each word in the high-frequency word list respectively belongs to the words to be predicted and the third possibility that each word in the low-frequency word list respectively belongs to the words to be predicted.

In another aspect, the present application further provides a word prediction apparatus, including:

an input acquisition unit configured to acquire a current word used for prediction and first context information that a word sequence before the current word has;

the domain prediction unit is used for determining the probability that the words to be predicted behind the current word respectively belong to a plurality of different domains based on the current word and the first context information;

a first prediction unit, configured to determine, for each of the fields, a first possibility that each word in a word list belongs to the word to be predicted based on the current word and first context information, where the first possibility is a possibility that a word in the word list belongs to the word to be predicted when the word to be predicted belongs to the field; the word list is a set which is constructed in advance and contains a plurality of words;

and the second prediction unit is used for determining second possibility that each word in the word list respectively belongs to the word to be predicted according to the probability that the word to be predicted respectively belongs to a plurality of different fields and the first possibility that each word in the word list corresponding to each field respectively belongs to the word to be predicted.

In yet another aspect, the present application further provides a computer device, including:

a processor and a memory;

wherein the processor is configured to execute a program stored in the memory;

the memory is for storing a program for at least:

In yet another aspect, the present application further provides a storage medium having stored therein computer-executable instructions, which when loaded and executed by a processor, implement the word prediction method as described in any one of the above.

As can be seen, in the embodiment of the present application, after the current word used for prediction is obtained, according to context information of the current word and a word sequence between the current words, probabilities that a word to be predicted (a word next to the current word) after the current word belongs to a plurality of different fields are analyzed, and the possibility that each word in a word list belongs to the word to be predicted under the condition that the word to be predicted belongs to each field is respectively determined. Because the different fields of the word to be predicted can affect the possibility that each word in the word list belongs to the word to be predicted, the possibility that each word in the word list belongs to the word to be predicted is comprehensively determined by combining the probability that the word to be predicted belongs to each field and the possibility that each word in the word list belongs to the word to be predicted under the condition that the word to be predicted belongs to different fields, the accuracy of the next word after each word in the predicted word list belongs to the current word can be improved, and the accuracy of predicting the probability of the sentence to which the current word belongs can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on the provided drawings without creative efforts.

FIG. 1 is a schematic diagram showing the construction of a word prediction system in the present application;

FIG. 2 is a flow chart illustrating a word prediction method in an embodiment of the present application;

FIG. 3 is a schematic flow chart diagram illustrating a word prediction method according to an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating components of a language model for implementing word prediction in an embodiment of the present application;

FIG. 5 is a schematic flow chart illustrating the training of a language model according to the present application;

FIG. 6 is a schematic flow chart diagram illustrating a word prediction method according to the present application;

FIG. 7 is a schematic flow chart diagram illustrating a word prediction method according to the present application;

FIG. 8 is a schematic diagram illustrating the composition of yet another language model to which the present application is applicable;

FIG. 9 is a schematic diagram of an application scenario in which the word prediction method of the present application is applicable;

FIG. 10 is a schematic diagram showing a configuration of a word prediction apparatus according to the present application;

fig. 11 is a schematic view showing still another constitution of the word prediction apparatus of the present application;

fig. 12 shows a schematic diagram of a computer device to which the present application is applicable.

Detailed Description

The scheme of the embodiment of the application is suitable for predicting the possibility that each word in the word list is used as the next word after the current word and can form a sentence with the current word aiming at the current word in the sentence, so that the accuracy of predicting the probability of the next word after the current word is improved, and the accuracy of predicting the occurrence probability of the sentence consisting of the current word and the next word is further improved.

The inventor of the application finds out through research that: a word may belong to one or more different domains, e.g., the word a may be a word from an industrial domain, a word from an agricultural domain, a word from a scientific domain, etc. Correspondingly, the next word after the current word may also belong to one or more fields, and when the fields to which the next word belongs are different, the probability distribution of each word in the word list belonging to the word to be predicted is also different, and the field to which the next word belongs is not considered in the prediction process of the existing language model, so that the accuracy of the probability distribution of each word in the predicted word list belonging to the next word is inevitably low.

In order to improve the prediction accuracy, in the process of predicting the next word after the current word, the inventor considers the fields to which the next word may belong, predicts the possibility that each word in the word list belongs to the next word respectively according to various fields, and determines the probability distribution that each word in the word list belongs to the next word comprehensively, so that the accuracy of the probability distribution obtained finally is high.

In order to facilitate understanding of the scheme of the present application, a description is first given of a scenario used in the scheme of the present application. For example, referring to FIG. 1, there is shown a schematic diagram of one component architecture of a word prediction system used in accordance with aspects of the present application.

As can be seen from fig. 1, the word prediction system may include: a computer device 101 and a statistics server 102.

The scheme provided by the embodiment of the application relates to technologies such as voice recognition in a voice technology in artificial intelligence, and relates to technologies such as machine translation in natural language processing in artificial intelligence.

Artificial intelligence, speech technology and natural language processing are explained below.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between a person and a computer using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language people use daily, so it has a close relation with the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

The following describes techniques such as speech recognition in the speech technology and machine translation in natural language processing, with reference to specific embodiments.

The computer device 101 may acquire a sentence to be predicted determined based on a speech recognition technology, a machine translation technology, or an input method technology; determining a current word to be analyzed currently from the sentence to be predicted, and predicting the possibility that each word in a word list is used as a word to be predicted after the next word after the current word in the prediction sentence based on the current word.

For example, in the speech recognition field as an example, after a speech signal input by a user is converted into a plurality of candidate sentence texts based on speech recognition, in order to determine the probability that the candidate sentence texts are correct sentences, that is, determine the probability that the candidate sentence texts conform to the human language, for each candidate sentence text, it is necessary to sequentially take each word in the candidate sentence text as a current word, and predict the probability that each word in a word list respectively belongs to a next word after the current word in the candidate sentence text. Thus, the probability that the candidate sentence text is the correct sentence is comprehensively determined according to the composition of the words in the composition candidate sentence text and the predicted probability of the next word after each word in the candidate sentence text. For example, for the candidate sentence text "happy," the predicted probability of "high" belonging to "very" next word is 0.5, and the probability of "happy" belonging to "high" (which may also be considered as "very high") next word is 0.9, then the probability of occurrence of "happy" may be 0.45.

For the field of machine translation, its application and process of predicting the next word is similar to the field of speech recognition.

As another example, taking the input method field as an example, in the input method field, it may be necessary to analyze the rank of a plurality of candidates that the user may need to input based on each word in a sentence (the sentence may be one word) that the user has input. In this case, the last word in the sentence is required to be used as the current word, and the possibility that each word in the word list and the current word in the sentence can form the sentence is predicted, so that the input method can select the candidate words to be displayed and the sequence of the candidate words according to the prediction result.

As can be seen in FIG. 1, a language model may be deployed in a computer device 101. As can be seen from fig. 1, the speech recognition, the machine translation or the input method determines a language model that the sentence to be analyzed can be input into the computer device 102, so as to analyze the probability of the next word after each word in the sentence to be analyzed based on the language model, and further determine a candidate sentence based on the sentence to be analyzed, or analyze the probability of the sentence to be analyzed (i.e., the probability that the sentence to be analyzed conforms to the human language).

The language model may be a neural network-based language model, or may be other types of language models, which is not limited herein.

The data statistics server can collect a plurality of words used by different users in daily life so as to send the plurality of words to the computer equipment, so that the computer equipment can determine the word list composition; or generating a word list based on a plurality of collected words daily used by the user and feeding the word list back to the computer equipment.

It will be appreciated that the computer device may be a server in a speech recognition, machine translation or input method system, or may be a stand-alone device with data processing capabilities.

With the above generality in mind, a word prediction method according to an embodiment of the present application is described, for example, referring to fig. 2, which shows a flowchart of an embodiment of a word prediction method according to the present application. The method of this embodiment may be applied to the computer device of the present application, and the method may include:

s201, obtaining a current word for prediction and first context information of a word sequence before the current word.

The word currently used for predicting the occurrence probability of the next word is referred to as the current word in the embodiments of the present application. Considering that the next word after the current word is possibly any word in the word list, the application needs to predict the possibility that each word in the word list belongs to the word after the current word and can form a sentence with the current word, so the application refers to the next word after the current word as the word to be predicted.

The current word may be a word used for prediction in a sentence to be predicted. Wherein, the current word may be composed of a character string, for example, a Chinese character; or a plurality of character strings, for example, a phrase consisting of a plurality of chinese characters. Correspondingly, the word to be predicted may also be formed by one or more character strings.

It will be appreciated that the manner in which the current word is obtained may vary in different application scenarios.

For example, in one possible implementation, the word currently used for prediction may be determined from the sentence to be predicted according to the sequence of the words in the sentence to be predicted. In this case, the sentence to be predicted is the sentence for which the occurrence probability needs to be predicted, the sentence to be predicted is composed of a plurality of words, and each word in the sentence to be predicted needs to be sequentially used as the current word. For example, the sentence to be predicted may be a candidate sentence obtained through speech recognition or machine translation, the current word may be a word at any position in the candidate sentence, and the current word is a word at different positions in the candidate sentence at different times.

In yet another possible implementation, the last word in the sentence to be predicted is obtained as the current word for prediction. In this case, it may be the likelihood that a candidate sentence composed of the sentence to be predicted and the next word after the current word needs to be predicted by the language model. For example, the sentence to be predicted is the sentence to be predicted which is currently input by the input method, at the current time, the sentence to be predicted may not be a complete sentence, for example, it may be only a word or a word, but may also be an incomplete sentence composed of a plurality of words, and in order to predict which words and the possibility of which words are the next words after the last word in the sentence to be predicted, it is necessary to take the last word in the sentence to be predicted as the current word.

It will be appreciated that it is possible to directly predict which words are to be predicted (next words) after the current word and the respective likelihoods of these words based on the current word in the sentence to be predicted. However, the accuracy may be low. In order to ensure the prediction accuracy, context information corresponding to the word sequence used for prediction before the current time is also combined in the prediction process in the embodiment of the application.

The word sequence before the current word may be a word sequence formed by one or more words before the current word in the sentence to be predicted; the word sequence may also be null. For example, if the current word is the first word to be predicted, such as the first word of the sentence to be predicted or the sentence to be predicted includes only one word currently, the word sequence before the current word is empty, in which case the word sequence has empty context information.

And if the context information represents the semantic relationship between the words, the context relationship of the word sequence is the semantic relationship between the words in the word sequence. For the convenience of distinction, the context information of the word sequence before the current word is referred to as first context information in the embodiments of the present application.

Alternatively, the current word may be represented by a word vector, and the first context information may also be represented by a vector.

Alternatively, the context information may be derived based on semantic understanding techniques in natural language processing. Semantic understanding techniques include, but are not limited to: lexical analysis, syntactic analysis, semantic analysis, pragmatic analysis, contextual reasoning, emotional analysis, and the like.

S202, based on the current word and the first context information, determining the probability that the words to be predicted behind the current word respectively belong to a plurality of different fields.

The inventor of the present application found through research that: the word to be predicted after the current word may belong to one or more fields, and in different fields to which the word to be predicted belongs, the possibility that each word in the word list belongs to the word to be predicted is influenced. Therefore, in the embodiment of the present application, the probabilities that the word to be predicted belongs to multiple domains are determined based on the first context information of the current word and the word sequence before the current word.

It is understood that the tendency degree of the word to be predicted after the current word belonging to each field can be analyzed based on the current word and the first context information, and the tendency degree can be reflected by probability.

For example, in a possible case, the mapping degrees of different semantic relationships and different fields may be analyzed in advance, so that the mapping degree of the word to be predicted and each field may be obtained based on the semantic relationship represented by the current word and the first context information.

In yet another possible case, the domain distribution model may be trained in advance, and the domain distribution model may be trained by using a plurality of sentence samples. Then, according to the current word and the first context information, and by using the domain distribution model, the probability that the word to be predicted of the current word belongs to a plurality of different domains can be predicted.

The domain distribution model may be set as needed, for example, the domain distribution model may be a recurrent neural network model, such as a Long Short-Term Memory (LSTM) model.

In this case, training the domain distribution model using the plurality of sentence samples may be to train a preset network model using the plurality of sentence samples, and train the network model as the domain distribution model. For example, the sequence of each word in the sentence sample is fixed, and based on the sequence of each word in the sentence sample and the field of each word in the labeled sentence sample, the network model can be trained until the difference degree between the field of each word output by the network model and the actual label meets the requirement.

Optionally, in order to improve accuracy of predicting probabilities that the word to be predicted belongs to different domains, the domain distribution model may be a model included in the language model, so that the domain distribution model may be trained together in a process of training the language model by using a plurality of sentence samples, and the part of content will be introduced in the following content.

Alternatively, the domain distribution model and the language model may be based on machine learning training in artificial intelligence.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formula learning.

The domain distribution model and the language model as mentioned above may be obtained by training a network model, which may be an artificial neural network model, with a plurality of sentence samples.

Artificial neural network models include, but are not limited to: a convolutional neural network model, a cyclic neural network model, and a deep neural network model.

S203, aiming at each field, determining first possibility that each word in the word list respectively belongs to the word to be predicted based on the current word and the first context information.

The word list is constructed in advance and comprises a set of a plurality of words. Vocabularies are sometimes also referred to as vocabularies, or corpora, and the like. The Chinese words in the word list can be words which are possibly used by the user in daily life through big data analysis and the like. For example, the vocabulary may include 20 ten thousand words that the user may use daily.

Wherein, for each domain, the first possibility is a possibility that a word in the word list belongs to the word to be predicted in the case that the word to be predicted belongs to the domain. For a domain, the first likelihood of a word in the vocabulary may also be considered as the likelihood that the word belongs to the word to be predicted if the word belongs to the domain.

The first likelihood may be represented in various manners such as a numerical value and a rank, for example, the first likelihood may be a numerical value, and the greater the value of the first likelihood of a word is, the higher the likelihood degree that the word belongs to the word to be predicted in the word list is.

It is understood that, for each domain, since there are multiple words in the vocabulary, there is a first likelihood associated with each of the multiple words, and each domain has a first likelihood distribution. The first probability distribution corresponding to each domain is actually the first probability distribution of each word in the word list belonging to the word to be predicted under the condition that the word to be predicted belongs to the domain.

The first likelihood distribution includes first likelihoods that the words in the vocabulary are respectively the words to be predicted, for example, the first likelihood distribution may be a vector, each element in the vector points to a word in the vocabulary, and the specific value of the element is the first likelihood that the word pointed to by the element belongs to the words to be predicted.

Different from the situation that the current language model does not consider the fields and directly predicts the probability distribution that each word in the word list belongs to the word to be predicted, the probability that each word in the word list belongs to the word to be predicted is predicted respectively according to the situation that the word to be predicted belongs to each field in various different fields.

In order to enable the language model to respectively predict the first possibility that each word in the word list belongs to the word to be predicted aiming at multiple fields, multiple prediction functions can be set in the language model, and in the process of training the language model by utilizing multiple sentence samples, different prediction functions can correspond to different fields through training. Each prediction function predicts a first possibility that each word in the word list belongs to the word to be predicted based on the current word and the first context information, but the first possibility that each word in the word list predicted by each prediction function belongs to the word to be predicted is not the same due to different fields corresponding to each prediction function, namely the first possibility predicted by different prediction functions is distributed differently.

For example, if the language model is a neural network-based language model, the prediction function may be a logits function in an output layer of the language model, and the word vector of the current word and the vector of the first context information are converted into a vector having the same dimension as the size of the word list through the logits function, that is, output logis is, where logis output by the logis function represents the logarithm of the ratio of occurrence of an event to non-occurrence of the event. Accordingly, logtis corresponding to each word is the first possibility corresponding to the word.

S204, determining second possibility that each word in the word list respectively belongs to the word to be predicted according to the probability that the word to be predicted respectively belongs to a plurality of different fields and the first possibility that each word in the word list corresponding to each field respectively belongs to the word to be predicted.

For example, the weight of each field is determined according to the probability that the word to be predicted belongs to each field, for example, the probability that the word to be predicted belongs to one field is used as the weight of the field, and correspondingly, based on the weight of each field, the weighted sum is performed on the first possibility that each word in the word list respectively corresponding to each field belongs to the word to be predicted, and the weighted sum result is the second possibility that each word in the word list respectively belongs to the word to be predicted.

For example, if the domain includes two domains, namely, industry and agriculture, and the word list includes 3 words, namely { "ball", "television", "going out", "person", "game" }, then the probability that the word to be predicted belongs to the industry domain is assumed to be 0.6, the probability that the word belongs to the agriculture domain is assumed to be 0.4, and the magnitude of the first possibility is represented by the numerical value, and if the word to be predicted belongs to the industry domain, the first possibilities that the words in the word list respectively belong to the domain to be predicted are respectively: { "sphere" =1.2, "television" =1.5, "exit" =0.2}; assuming that under the condition that the words to be predicted belong to the agricultural field, the first possibility that each word in the word list respectively belongs to the field to be predicted is as follows: { "sphere" =0.8, "tv" =1.6, "exit" =0.4}, then the first result can be obtained by multiplying the probability 0.6 corresponding to the industrial field by the first probability distribution corresponding to the industrial field; and multiplying the probability corresponding to the agricultural field by the first probability distribution corresponding to the agricultural field to obtain a second result. The first result is then added to the second result, which is expressed in more detail as follows:

0.6 { "sphere" =1.2, "tv" =1.5, "go out" =0.2} +0.4 { "sphere" =0.8, "tv" =1.6, "go out" =0.4} = { "sphere" =0.6 × 1.2+0.4 × 0.8=1.04, "tv" =0.6 × 1.5+0.4 × 1.6=1.54, "go out" =0.6 × 0.4=0.24}, that is, the second likelihood value that "sphere" belongs to the word to be predicted in the word table is 1.04, the second likelihood value that "tv" belongs to the word to be predicted is 1.54, and the second likelihood value that "go out" belongs to the word to be predicted is 0.24.

Of course, the above is only one implementation manner of comprehensively determining the second possibility that each word belongs to the word to be predicted by combining the probabilities that the word to be predicted belongs to different fields and the first possibility that each word in the word list belongs to the word to be predicted under the condition that the word to be predicted belongs to different fields, and in practical application, there may be other possible implementation manners, which is not limited herein.

It can be understood that, in the case where the second possibility is represented by a numerical value, considering that the numerical values of the second possibilities of each word in the vocabulary belong to the word to be predicted are different in size, it is difficult to intuitively compare the degree of possibility that each word in the vocabulary belongs to the word to be predicted. Therefore, optionally, the second possibility that each word in the word list respectively belongs to the word to be predicted may be normalized, so as to obtain a probability distribution that each word in the word list respectively belongs to the word to be predicted. The probability distribution comprises the probability that each word in the word list belongs to the word to be predicted respectively. After normalization, the sum of the probabilities that all words in the word list belong to the word to be predicted is one.

For example, one mode may be taken as an example, the second possibilities that each word in the vocabulary belongs to the word to be predicted respectively may be processed by a softmax function, so as to output a probability distribution obtained after normalization. For example, if the word list includes C words, the C words correspond to C second likelihoods, where the second likelihood of the ith word in the C words is denoted as v _i Second likelihood v of the ith word of the C words by the softmax function _i The probability S obtained by normalization _i Is represented as follows:

of course, this is only an example of normalization, but normalization by other normalization functions is also applicable to this embodiment.

As can be seen from the above, in the embodiment of the present application, after the current word used for prediction is obtained, according to the context information of the current word and the word sequence before the current word, the probabilities that a word to be predicted after the current word (a word next to the current word) belongs to a plurality of different fields are analyzed, and the possibility that each word in the word list belongs to the word to be predicted under the condition that the word to be predicted belongs to each field is respectively determined. Because the different fields of the word to be predicted can affect the possibility that each word in the word list belongs to the word to be predicted, the possibility that each word in the word list belongs to the word to be predicted is comprehensively determined by combining the probability that the word to be predicted belongs to each field and the possibility that each word in the word list belongs to the word to be predicted under the condition that the word to be predicted belongs to different fields, the accuracy of the next word after each word in the predicted word list belongs to the current word can be improved, and the accuracy of predicting the probability of the sentence to which the current word belongs can be improved.

It is to be understood that, in order to improve the prediction accuracy, after the current word and the first context information are obtained, second context information for characterizing the semantic relationship between the current word and the word sequence before the current word may be further determined based on the current word and the first context information. The second context information may reflect each word and semantic association between each word in a sentence composed of the current word and a word sequence before the current word.

Accordingly, in the above step S202, probabilities that the to-be-predicted word respectively belongs to different domains may be determined based on the second context information; in step S203, a first possibility that each word in the vocabulary respectively belongs to the word to be predicted may be determined according to the second context information.

Furthermore, when the next current word is predicted, the second context information and the next current word can be used as input information to be input into the language model, so that the prediction accuracy is improved.

For the convenience of understanding, a language model including a domain distribution model and a plurality of prediction functions corresponding to different domains is taken as an example for explanation, for example, referring to fig. 3, it shows a schematic diagram of another implementation flow of the word prediction method according to the embodiment of the present application, and the flow is applicable to the computer device of the present application. The process may include:

s301, obtaining a word vector w (t) of a current word for prediction and first context information S (t-1) determined by a pre-trained language model at the last time.

A word vector may also be referred to as word embedding, among others.

It will be appreciated that the current word for prediction differs for different times in the language model, and when the language model needs to be based on semantic relationships between adjacent words, the first context information most recently determined by the language model is actually the first context information that the word sequence before the current word has.

Wherein the context information may be represented by a vector. For the sake of distinction, first context information that a word sequence before a current word has is denoted as s (t-1), and second context information that subsequently denotes a semantic relationship between the current word and the word sequence before the current word is denoted as s (t).

In this embodiment, the language model includes a domain distribution model for determining a domain to which a next word after the current word belongs, and predictor functions respectively corresponding to a plurality of different domains. In this case, the language model and the domain distribution model and the predictor function in the language model are obtained by uniformly training a plurality of sentence samples.

S302, the current word w (t) and the first context information S (t-1) are converted into second context information S (t) representing the semantic relation between the current word and the word sequence before the current word through the language model.

For example, the current word w (t) and the first context information s (t-1) may be transformed according to a preset functional relationship, and the second context information s (t) may be obtained.

For example, s (t) can be calculated by the following formula two:

s (t) = sigmoid (Uw (t) + Ws (t-1)) (formula two);

sigmoid is a set function, and U and W are both preset vectors, wherein U and W can be determined in the process of training the language model.

A Language Model (current neural network Based Language Model, RNNLM) Based on a Recurrent neural network is taken as an example for explanation. Referring to fig. 4, a partial component diagram of a language model, RNNLM, is shown.

As can be seen from fig. 4, the input part of the language model includes, in addition to the word vector w (t) of the current word currently used for prediction, a previous hidden layer output vector s (t-1) output by a previous hidden layer of the hidden layer corresponding to the current word in the language model. Wherein s (t-1) is actually the semantic relationship that each word in the word sequence input into the language model before the current word pair has. Correspondingly, the word vector w (t) and the output vector s (t-1) of the previous hidden layer are input into the current hidden layer corresponding to the current word, so as to obtain the output vector s (t) of the current hidden layer, and the representation of the output vector s (t) of the current hidden layer is the semantic relation between the current word represented by the word vector w (t) and each word in the word sequence before the current word.

And S303, inputting the second context information S (t) into a domain distribution model through a language model, and determining the probability that the words to be predicted behind the current word respectively belong to different domains through the domain distribution model.

Alternatively, the domain distribution model may be a time recursive temporal network model such as LSTM. The specific way of determining the probability that the word to be predicted belongs to different domains by the domain distribution can be seen from the related description of the previous embodiments.

S304, the second context information S (t) is respectively input into the estimation functions corresponding to each field, and the first probability distribution output by each estimation function is obtained.

Wherein the first likelihood distribution comprises first likelihoods that words in the vocabulary respectively belong to the predicted words. For example, the first likelihood distribution may be a vector, the dimension of the vector is the same as the number of words in the word list, and the value of different dimensions in the vector of the first likelihood distribution indicates the value of the likelihood that different words in the word list belong to the word to be predicted.

It should be noted that the sequence of step S304 and step S304 is not limited to that shown in fig. 3, and in practical applications, these two steps may also be executed simultaneously; alternatively, step S304 is executed first and then step S305 is executed.

S305, carrying out weighted summation based on the probability corresponding to each field and the first probability distribution output by the pre-estimation function corresponding to each field to obtain a second probability distribution.

Wherein the second likelihood distribution includes a second likelihood that each word in the vocabulary respectively belongs to a word to be predicted.

For example, assuming that there are n domains, n prediction functions are set correspondingly, each prediction function corresponds to one domain, and the probability that the word to be predicted belongs to the ith domain is assumed to be represented as

And the first possible distribution of the prediction function output corresponding to the ith field represents the vector

And if the value of i is n, comprehensively determining the second probability distribution P of each word in the word list belonging to the word to be predicted _l Can be obtained by the following formula three:

for the sake of understanding, the language model is still taken as the RNNLM model, and is described with reference to fig. 4.

As can be seen from fig. 4, the language model further includes a distribution function model, and is different from a conventional method in which only one prediction function is set, n prediction functions are set in the language model, where n is the number of fields, and specifically may be set as required, and the fields corresponding to the n prediction functions are different. In fig. 4 the predictor function is the logtis function.

As can be seen from fig. 4, after the hidden layer output vector s (t) is output by the current hidden layer corresponding to the current word, s (t) is not only input into the domain distribution model, but also input into a plurality of predictor functions.

And the domain distribution model analyzes the probability that the next word to be predicted after the current word belongs to each domain based on the s (t). In fig. 4, the corresponding probability of each domain is taken as the weight of the subsequent weighting calculation, so the probability that the word to be predicted belongs to the first domain is represented as weight 1, the probability that the word to be predicted belongs to the second domain is represented as weight 2, and so on, the probability that the word to be predicted belongs to the nth domain is represented as weight n.

Wherein, the logtis function corresponding to each field outputs a vector logtis. The logtis is actually an unnormalized probability distribution, each dimension in the logtis represents one word in a word list, the words represented by different dimensions are different, and accordingly, the value of each dimension in the logtis represents the degree of possibility that the words in the word list represented by the dimension respectively belong to the word to be predicted under the condition that the word to be predicted belongs to the field. In fig. 4, logtis of the prediction function output corresponding to the first domain is represented as logtis1, logtis of the prediction function output of the second domain is represented as logtis2, and so on, logtis of the prediction function output of the nth domain is represented as logtis n.

Correspondingly, in order to synthesize the probabilities that the fields to be predicted respectively belong to the fields and the first likelihood distributions logtis output by the prediction functions respectively corresponding to the fields, and determine the second likelihood distributions that the words in the word list respectively belong to the words to be predicted, weighted summation needs to be performed based on the weight of each field and the logtis output by the prediction functions of each field, so as to obtain the weighted logtis.

S306, normalizing the plurality of second possibilities included in the second possibility distribution to obtain probability distribution representing that each word in the word list belongs to the word to be predicted respectively.

It is to be understood that each second likelihood of the second likelihood distribution is an unnormalized probability distribution. Thus, although the probability that a word belongs to the word to be predicted is higher if the value of the second possibility of the word in the word list is larger, the second possibility distribution is not the probability distribution finally output by the conventional language model, and it may be difficult to intuitively see which words belong to the word to be predicted based on the second possibility distribution.

As explained in connection with fig. 4, the output layer of the language model in fig. 4 has a softmax function in addition to the aforementioned multiple logtis functions (predictor functions). Wherein, the estimation function and the softmax function corresponding to each of the plurality of domains belong to an output layer of the language model RNNLM.

As can be seen from fig. 4, after the weighted sums of the logtis output by the prediction functions are obtained, the weighted logtis is input into the softmax function, so that the softmax function outputs the probability distribution that each word in the vocabulary belongs to the word to be predicted.

It can be understood that, for a to-be-predicted statement, the language model takes each word in the to-be-predicted statement as a current word for prediction in sequence, and after the language model predicts the probability distribution of the to-be-predicted word after each word in the word list belongs to the current word for each current word, the language model may also determine the occurrence probability of the to-be-predicted statement according to the predicted probability distributions. Or, after the language model predicts the probability distribution of the words to be predicted after each word in the word list belongs to the current word for a sentence to be predicted, the occurrence probability of the candidate sentences composed of each word in the word list and the sentence to be predicted can be determined according to the probability distribution. The concrete implementation manner of determining the occurrence probability of the sentence to be predicted or the occurrence probability of the candidate sentences respectively composed of the sentence to be predicted and each word in the word list is not limited.

It can be understood that, in the case where the domain distribution model and the plurality of predictor functions are provided in the language model, the language model may be trained through a plurality of sentence samples, and when the language model training is completed, the domain distribution model and the predictor functions in the language model are also trained. Under the condition, sentence samples adopted by the training language model, the field distribution model and the estimation function are the same, so that the matching of field information predicted by the field distribution model and field information represented by the estimation function in the language model is favorably ensured, and the prediction accuracy is favorably improved. Compared with the method of setting a domain distribution model independently of the language model and independently training the domain distribution model, the method has the advantage that the accuracy of setting the domain distribution function in the language model to predict the probability of each word in the word list belonging to the word to be predicted is higher.

To facilitate understanding of the process of training the language model, the following description is given by way of example of a method of training the language model, for example, referring to fig. 5, which shows a schematic flow chart of an implementation of the method of training the language model, and the flow chart may include:

s501, obtaining a plurality of sentence samples for training.

Wherein each sentence sample comprises one or more words. The sequence of each word in each sentence sample is fixed.

S502, aiming at each statement sample, inputting the statement sample into a language model to obtain the occurrence probability of the statement sample predicted by the language model.

S503, judging whether the prediction accuracy of the language model meets the requirement or not according to the occurrence probability of each statement sample predicted by the language model, and if so, finishing training; if not, adjusting the language model, the domain distribution function in the language model and the relevant parameters of each pre-estimated function, and returning to execute the step S502.

It will be appreciated that the sentence samples are sentences that conform to human language. For each sentence sample, the position sequence of each word in the sentence sample is fixed, so that the higher the occurrence probability of the sentence sample is predicted by the language model, the higher the prediction accuracy of the prediction model is. Accordingly, the accuracy of the language model prediction can be finally analyzed through the occurrence probability of each statement sample predicted by the language model.

It is understood that, in the case that the prediction accuracy of the language model meets the requirement, it indicates that, for each current word in the sentence sample, the accuracy of the probability that the word to be predicted after the current word predicted by the domain distribution model in the language model belongs to each domain also meets the requirement. Correspondingly, the first probability distribution estimated by the estimation function corresponding to each field in the language model also meets the requirement, so that the field distribution model and the estimation function in the language model are trained when the language model is trained.

It should be noted that fig. 5 is only used for facilitating understanding of the process of training the language model, and is simply described as a way of training the language model, but it is to be understood that in practical applications, there are other possibilities for the way of training the language model, and the case of training the language model by other ways is also applicable to the present application.

In the foregoing embodiment of the word prediction method, because the number of words in the word list is large, the first likelihood distributions corresponding to the word list are predicted for a plurality of fields respectively (for convenience of description, the first likelihood that each word in the word list belongs to a word to be predicted is referred to as the first likelihood distribution corresponding to the word list), and the weighted summation calculation is performed on the predicted first likelihood distributions, which inevitably requires an increased memory, and because the data processing amount is large, the prediction speed is also affected, and thus the prediction efficiency is relatively low.

In order to further reduce the memory occupation and improve the prediction efficiency on the basis of ensuring the prediction accuracy, the inventor of the present application finds: the number of words in the word list is large, but the number of words commonly used by people is relatively small, for example, the word list may reach the level of 20 ten thousand words, but the number of words commonly used by people may be only about 1 to 2 ten thousand. Based on the method, the first probability distribution of the common words belonging to the words to be predicted can be predicted only by dividing the common words in the word list into fields, and corresponding weighted summation is carried out; and for the rest of the unusual words, the possibility that the unusual words belong to the words to be predicted can be directly predicted without field prediction. Therefore, the use frequency of the uncommon words is low, which means that only a small amount of prediction accuracy is lost, but the memory occupation can be greatly reduced, and the prediction efficiency is improved.

Based on the research findings, the present application may further divide the total vocabulary including all words into two parts, specifically, into two parts according to the usage frequency of the words used by the user, where one part is a high frequency vocabulary, and the other part is a low frequency vocabulary. The total vocabulary is a pre-constructed vocabulary containing a plurality of words, and the total number of the words in the total vocabulary is more than that of the words in the high-frequency vocabulary, and certainly, the total number of the words in the total vocabulary is also more than that of the words in the low-frequency vocabulary. In this case, the overall vocabulary corresponds to the vocabulary of the previous embodiment. The high-frequency word list is formed by a plurality of words with higher use frequency in the word list, for example, the words with the use frequency sequence at the top appointed position in the word list are used as the words in the high-frequency word list according to the sequence from high use frequency to low use frequency of the words. Correspondingly, the low frequency vocabulary is composed of a plurality of words in the general vocabulary which do not belong to the high frequency vocabulary. It can be seen that the frequency of use of words in the low frequency vocabulary is lower than the frequency of use of words in the high frequency vocabulary. The use frequency of each word can be finally obtained by determining the use times of each word by the user through data statistical analysis.

Accordingly, different processing may be performed for words in the high frequency vocabulary and words in the low frequency vocabulary. For example, referring to fig. 6, a flow diagram of a word prediction method of the present application is shown. The method of this embodiment may be applied to the computer device of the present application, and the method may include:

s601, acquiring a current word for prediction and first context information of a word sequence before the current word.

S602, based on the current word and the first context information, determining probabilities that words to be predicted behind the current word respectively belong to a plurality of different fields.

The above steps S601 and S602 can refer to the related description of the previous embodiment, and are not described herein again.

S603, aiming at each field, determining first possibility that each word in the high-frequency word list respectively belongs to the word to be predicted based on the current word and the first context information.

Wherein, for each domain, the first possibility is the possibility that a word in the high-frequency vocabulary belongs to the word to be predicted under the condition that the word to be predicted belongs to the domain.

For the sake of distinction, the words in the high frequency vocabulary may be referred to as high frequency words, and the words in the low frequency vocabulary may be referred to as low frequency words. It is understood that, in the embodiment of the present application, only for the high-frequency vocabulary, in the case that the word to be predicted belongs to each domain, the first possibility that each high-frequency word in the high-frequency vocabulary belongs to the word to be predicted is predicted. Therefore, the number of the high-frequency words needing to be calculated according to the field is relatively small, so that the memory occupation is reduced, the data processing amount is reduced, and the data processing efficiency is improved.

The difference between the high-frequency word list and the word list in the foregoing embodiment is only the number of words, and the process of predicting the first possibility that each word in the high-frequency word list belongs to the word to be predicted is the same as the process of predicting the first possibility that each word in the word list in the foregoing embodiment belongs to the word to be predicted, which may specifically refer to the related description in the foregoing embodiment, and is not described herein again.

Similar to the previous embodiment, optionally, after the step S601, second context information for characterizing a semantic relationship between the current word and a word sequence before the current word may be further determined based on the current word and the first context information. Accordingly, a first possibility that each word in the high-frequency word list respectively belongs to the word to be predicted can be determined based on the second context information.

S604, determining second possibility that each word in the high-frequency word list respectively belongs to the word to be predicted according to the probability that the word to be predicted respectively belongs to a plurality of different fields and the first possibility that each word in the high-frequency word list corresponding to each field respectively belongs to the word to be predicted.

Because the number of words in the high-frequency word list is small, the calculated number of the first possibilities corresponding to each high-frequency word in the high-frequency word list is small, that is, the number of the first possibility distributions containing the first possibilities corresponding to each high-frequency word in the high-frequency word list is small, and therefore, in the process of determining that each word in the high-frequency word list belongs to the second possibility of the word to be predicted, the amount of data required to be processed is small, memory occupation is reduced, and processing efficiency is improved.

S605, determining a third possibility that each word in the low-frequency word list respectively belongs to the word to be predicted based on the current word and the first context information.

For the convenience of distinction, the possibility that a word in the low-frequency vocabulary belongs to the word to be predicted is referred to as a third possibility.

It is understood that although the words in the low frequency vocabulary are used relatively less frequently, the words in the low frequency vocabulary may be words in the sentence to be predicted or words forming a new sentence with the sentence to be predicted.

For example, the candidate sentence identified by the speech is "one flight \32704haoxanth" as an example. In order to predict the probability of occurrence of the candidate sentence, the probability of the word sequence before each word in the candidate sentence needs to be predicted separately. The probability that the word is the next word after flying is required to be predicted, and the word \704belongsto the word in the low-frequency vocabulary, so that the probability that the word in the low-frequency vocabulary belongs to the word to be predicted respectively (namely, the third probability) is predicted according to the context information corresponding to the word \32704andthe word I, and the probability that the word belongs to the next word after flying is obtained.

The method for determining the third possibility corresponding to each word in the low-frequency word list actually does not consider the field to which the next word after the current word belongs, and determines the possibility that each word in the low-frequency words belongs to the word to be predicted directly based on the current word and the first context information.

S606, according to the second possibility that each word in the high-frequency word list respectively belongs to the word to be predicted and the third possibility that each word in the low-frequency word list respectively belongs to the word to be predicted, the possibility that each word in the total word list respectively belongs to the word to be predicted is built.

It can be understood that the dimension of the total word list is the sum of the number of words in the high-frequency word list and the number of words in the low-frequency word list, and the words in the high-frequency word list and the low-frequency word list are not overlapped, so that the second possibility that each word in the high-frequency word list belongs to the word to be predicted is combined with the third possibility that each word in the low-frequency word list belongs to the word to be predicted, the possibility that each word in the high-frequency word list and the low-frequency word list belongs to the word to be predicted can be constructed, that is, the possibility that each word in the total word list respectively belongs to the word to be predicted is obtained.

For example, suppose that the high-frequency word list includes word 1 and word 2, and meanwhile suppose that the second possibility that word 1 belongs to the word to be predicted represents possibility 1, and the second possibility that word 2 belongs to the word to be predicted is possibility 2; and the third possibility that the word 3, the word 4, the word 5, the word 6 and the word 7 belong to the words to be predicted respectively is that: possibility 3, possibility 4, possibility 5, possibility 6 and possibility 7. Combining the two parts to obtain the possibility that each word in the total word list belongs to the word to be predicted may include: word 1: possibility 1; word 2: possibility 2; word 3: possibility 3; word 4: possibility 4; the word 5: possibility 5; word 6: possibility 6; the word 7: possibility 7.

S607, the possibility that each word in the total word list respectively belongs to the word to be predicted is normalized, and the probability distribution that each word in the total word list respectively belongs to the word to be predicted is obtained.

The step S607 is an optional step, and is intended to obtain probability distribution of each word in the total word list belonging to the word to be predicted by normalizing the possibility of each word in the total word list belonging to the word to be predicted, so as to intuitively know the degree of possibility of different words in the word list belonging to the word to be predicted.

Similar to the previous embodiment, in the case of the global vocabulary respectively the high frequency vocabulary and the low frequency vocabulary, the language model may also include a domain distribution model and a plurality of domain respectively corresponding predictor functions. The difference is that the predictive functions corresponding to a plurality of fields in the language model predict the possibility of the words in the high-frequency word list, and the language model also comprises the predictive functions corresponding to the low-frequency word list, and the predictive functions corresponding to the low-frequency word list predict the possibility of the words in the low-frequency word list. Specifically, referring to fig. 7, which shows another schematic flow chart of a word prediction method according to the present application, the embodiment is applicable to a computer device according to the present application, and the flow chart may include:

s701, obtaining a word vector w (t) of a current word for prediction and first context information S (t-1) determined by a pre-trained language model for the last time.

S702, the current word w (t) and the first context information S (t-1) are converted into second context information S (t) representing the semantic relation between the current word and the word sequence before the current word through the language model.

And S703, inputting the second context information S (t) into the domain distribution model through the language model, so as to determine the probability that the words to be predicted behind the current word respectively belong to different domains through the domain distribution model.

The above steps S701 to S703 can refer to the related description of the previous embodiment, and are not described herein again

S704, the second context information S (t) is respectively input into the high-frequency estimation functions corresponding to each field through the language model, and first probability distribution output by each high-frequency estimation function is obtained.

Wherein the first likelihood distribution comprises first likelihoods that words in the high-frequency vocabulary respectively belong to predicted words. For example, the first likelihood distribution may be a vector, the dimension of the vector is the same as the number of words in the high frequency vocabulary, and the value of different dimensions in the vector of the first likelihood distribution indicates the value of the likelihood that different words in the high frequency vocabulary belong to the word to be predicted.

In the method, the prediction functions corresponding to the respective fields are considered to perform probability prediction only on words in the high-frequency vocabulary, so that the prediction functions corresponding to the respective fields are called high-frequency prediction functions and the prediction functions corresponding to the low-frequency vocabulary are called low-frequency prediction functions in order to facilitate the distinction between the prediction functions corresponding to the low-frequency vocabulary.

S705, weighted summation is carried out on the basis of the probability corresponding to each field and the first probability distribution output by the high-frequency estimation function corresponding to each field, and a second probability distribution is obtained.

Wherein the second likelihood distribution includes a second likelihood that each word in the high frequency vocabulary respectively belongs to a word to be predicted.

For the sake of understanding, the language model is still taken as the RNNLM model, and is described with reference to fig. 8. As can be seen from fig. 8, the language model includes a distribution function model and n high frequency estimation functions corresponding to n different domains. n is the number of domains. In fig. 8, all prediction functions are assumed to be logtis functions, so the high frequency prediction function can be described as a high frequency logtis function.

As can be seen from fig. 8, the process of determining the current hidden layer output vector s (t) is the same as that of fig. 4. After s (t) is obtained, s (t) is input into the domain distribution model and also input into each high frequency estimation function respectively.

Similarly to fig. 4, in fig. 8, the probability corresponding to each domain is recorded as a weight, accordingly, the probability that the word to be predicted belongs to the first domain is represented as a weight 1, the probability that the word to be predicted belongs to the second domain is represented as a weight 2, and so on, the probability that the word to be predicted belongs to the nth domain is represented as a weight n.

Accordingly, the high frequency logtis function corresponding to each domain outputs a vector logtis. The logtis output by the high-frequency estimation function corresponding to the first field is represented as logtis1, the logtis output by the high-frequency estimation function corresponding to the second field is represented as logtis2, and the like, the logtis output by the high-frequency estimation function in the nth field is represented as logtis n.

Further, based on the weight of each field and logtis output by the high-frequency estimation function of each field, weighted summation is carried out to obtain weighted high-frequency logtis, and the high-frequency logtis represents second probability distribution of each word in the high-frequency word list belonging to the word to be predicted.

S706, inputting the second context information S (t) to a low-frequency prediction function corresponding to the low-frequency vocabulary through the language model, and obtaining a third possibility distribution output by the low-frequency prediction function.

And the third possibility distribution comprises the possibility that each word in the low-frequency word list respectively belongs to the word to be predicted.

As shown in fig. 8, the language model includes, in addition to the distribution function model and n high-frequency prediction functions corresponding to n different domains, a prediction function for predicting the possibility that each word in the low-frequency vocabulary belongs to the word to be predicted, that is, a low-frequency prediction function.

Correspondingly, s (t) output by the hidden layer is also input into the low-frequency prediction function, the low-frequency prediction function predicts the probability logtis of each word in the low-frequency word list belonging to the word to be predicted based on the s (t), and for convenience of distinguishing, the logtis output by the low-frequency prediction function is expressed as the low-frequency logtis.

Optionally, in consideration of the large number of words in the low-frequency vocabulary, if the probability that each word in the low-frequency vocabulary belongs to the word to be predicted is calculated by directly using the prediction function, the prediction function needs to output a vector with a large dimension, so that the calculation efficiency of the prediction function is affected. In order to improve the calculation efficiency of the prediction function, the dimension reduction may be performed on the low-frequency vocabulary, so that a plurality of words in the low-frequency vocabulary may be divided into m groups, where m is a natural number greater than or equal to 2, and may be specifically set as needed. Wherein each group comprises a plurality of words, and the sum of the usage frequencies of the words in each group is equal.

Correspondingly, m dimension reduction matrixes (not shown in fig. 8) are further respectively arranged between the hidden layer and the low-frequency predictor function corresponding to the low-frequency vocabulary, and each dimension reduction matrix corresponds to one group.

Taking a group as an example, after the hidden layer outputs the second context information s (t), s (t) passes through the corresponding dimension reduction matrix of the group to reduce the dimension of s (t). Then, the language model inputs the reduced s (t) into a low-frequency prediction function, the low-frequency prediction function predicts the possibility that a plurality of low-frequency words in the group respectively belong to the words to be predicted based on the reduced s (t), and obtains the corresponding possibility distribution of the group, and the vector dimension corresponding to the possibility distribution is the same as the dimension of the reduced s (t).

E.g. probability distribution logtis of ith group output divided by low frequency vocabulary _i Can be expressed as follows:

logits _i ＝(proj _i s(t)+biasp _i )×tail _i +bias _i (formula four);

wherein i is a natural number from 1 to m. proj _i And the dimension reduction matrix is corresponding to the ith group. tia _i Vectors representing the words in the ith group; biasp _i A preset first offset vector; bias _i A second offset vector that is preset.

The above is explained by taking one group as an example, and each group needs to operate as above, where the number of words in different groups is different, and therefore, the dimensionality of the dimensionality reduction matrix corresponding to different groups is also different, but finally, the vector dimensionality of the probability distribution of each group is the same. For example, assuming that S (t) is 1024 bits, and assuming that one group 1 includes 1 ten thousand words and a group 2 includes 2 ten thousand words, the dimension reduction matrix corresponding to the group 1 may be 512 dimensions, and accordingly, the vector dimension of the probability distribution corresponding to the group 1 is 512 dimensions; while the dimensionality reduction matrix for set 2 may be 216 bits, the vector dimension for the likelihood of anger for set 2 is also 512 dimensions.

Correspondingly, based on the vectors of the probability distributions corresponding to the groups, the probability distributions for representing that the words in all the groups respectively belong to the words to be predicted can be constructed.

And S707, constructing a total probability distribution representing that each word in the total word list respectively belongs to the word to be predicted based on the second possibility corresponding to the high-frequency word list and the third possibility corresponding to the low-frequency word list.

Wherein, the total probability distribution comprises the probability that each word in the total word list respectively belongs to the word to be predicted.

As can be seen from fig. 8, the dimension of the low frequency logtis is different from the dimension of the high frequency logtis, and both the low frequency logtis and the high frequency logtis can only represent the possibility that the partial word in the total word list belongs to the word to be predicted. Therefore, combining the dimension of the low-frequency logtis and the high-frequency logtis into a vector with the dimension being the same as the number of words in the total word list actually constructs a logtis to represent the possibility that each word in the total word list belongs to the word to be predicted, and for distinguishing, the constructed logtis is called the total logtis.

S708, the corresponding possibility of the plurality of words in the total possibility distribution is normalized, and the probability distribution representing that each word in the total word list belongs to the word to be predicted is obtained.

As shown in fig. 8, the total logtis for characterizing the possibility that each word in the total word list respectively belongs to the word to be predicted is input to the softmax function. Correspondingly, the probability distribution output by the softmax function includes the probability that each word in the total word list belongs to the word to be predicted.

In order to facilitate understanding of the scheme of the present application, an application scenario to which the embodiments of the present application are applied is described. Fig. 9 is a schematic diagram illustrating a composition of an application scenario to which the word prediction method of the present application is applied.

Fig. 9 illustrates an application scenario as a language identification scenario. As can be seen from fig. 9, a speech recognition system is included in the scenario, the speech recognition system including: computer device 901, data statistics server 902 and speech recognition server 903.

The computer device may be a server for analyzing the probability of occurrence of the candidate sentences in the speech recognition system, and the language model mentioned in any of the above embodiments of the present application may be preset in the computer device.

The data statistics server may provide a basis for the computer device to determine the vocabulary.

It is understood that the speech recognition server and the computer device with the language model are illustrated as two separate devices in fig. 9, but it is understood that in practical applications, the computer device and the speech recognition server may be the same device.

As can be seen from fig. 9, the user terminal 904 may transmit the voice to be recognized as input by the user to the voice recognition server 903 as shown in step S91.

And the speech recognition server 903 may convert a plurality of candidate sentence texts to which the speech to be recognized may correspond. In order to determine which of the candidate sentence texts are the sentence texts with high conformity in the human language, that is, which of the candidate sentences are the sentence texts with relatively high accuracy based on the speech recognition, the speech server sends the candidate sentence texts converted based on the speech to be recognized to the computer device 901, as shown in step S92 in fig. 9.

Accordingly, the computer device 901 will, according to the solution introduced in the foregoing embodiment, for each candidate sentence text, sequentially take each word in the candidate sentence text as the current word, and predict the probability distribution of the word to be predicted, in which each word in the word list belongs to the current word. Then, based on the predicted probability distribution and in combination with each word in the candidate sentence text, the occurrence probability of the candidate sentence text can be analyzed, as shown in step S93.

Then, the computer apparatus 901 transmits the predicted occurrence probability of each candidate sentence text to the speech recognition server 903 as shown in step S94.

The speech recognition server 903 ranks the candidate sentence texts in the order of the high occurrence probability to the low occurrence probability, and returns the ranked candidate sentence texts to the user terminal, so as to facilitate the user to quickly select the sentence text corresponding to the speech, as shown in step S95.

It is understood that fig. 9 is only an example of an application scenario, but it is understood that there are many possible application scenarios to which the scheme of the embodiment of the present application is applicable, and the present application is not limited to this.

On the other hand, the application also provides a word prediction device. For example, referring to fig. 10, which shows a schematic structural diagram of an embodiment of a word prediction apparatus of the present application, the word prediction apparatus of the present application is suitable for a computer device of the present application, and the apparatus may include:

an input acquisition unit 1001 configured to acquire a current word used for prediction and first context information that a word sequence before the current word has;

a domain prediction unit 1002, configured to determine, based on the current word and the first context information, probabilities that words to be predicted after the current word respectively belong to multiple different domains;

a first prediction unit 1003, configured to, for each of the domains, determine, based on the current word and the first context information, a first possibility that each word in a word list belongs to the word to be predicted, where the first possibility is a possibility that a word in the word list belongs to the word to be predicted when the word to be predicted belongs to the domain; the word list is a set which is constructed in advance and contains a plurality of words;

the second prediction unit 1004 is configured to determine, according to probabilities that the words to be predicted respectively belong to a plurality of different fields and first probabilities that words in the word list corresponding to each field respectively belong to the words to be predicted, second probabilities that the words in the word list respectively belong to the words to be predicted.

Optionally, the word prediction apparatus may further include:

and the normalization unit is used for normalizing the second possibility that each word in the word list respectively belongs to the word to be predicted after the second prediction unit determines the second possibility that each word in the word list respectively belongs to the word to be predicted, so as to obtain the probability distribution that each word in the word list respectively belongs to the word to be predicted.

Optionally, in order to reduce memory usage in prediction and improve prediction efficiency, in the word prediction apparatus of the present application, the word lists in the first prediction unit and the second prediction unit are high-frequency word lists, and the high-frequency word lists are formed by a plurality of words with high frequency of use in the total word list. In this case, reference may be made to fig. 11, which shows a schematic structural diagram of a still further embodiment of the word prediction apparatus of the present application, and the apparatus of the present embodiment is different from the apparatus of the previous embodiment in that the apparatus may further include:

a third prediction unit 1005, configured to determine, based on the current word and the first context information, a third possibility that each word in a low-frequency vocabulary belongs to the word to be predicted, where the low-frequency vocabulary is formed by a plurality of words in a total vocabulary that do not belong to the high-frequency vocabulary, the total vocabulary is a pre-constructed set including the plurality of words, and the total number of words in the total vocabulary is greater than the total number of words in the high-frequency vocabulary;

a prediction combining unit 1006, configured to construct a probability that each word in the total vocabulary respectively belongs to the word to be predicted according to a second probability that each word in the high-frequency vocabulary respectively belongs to the word to be predicted and a third probability that each word in the low-frequency vocabulary respectively belongs to the word to be predicted.

Optionally, in an embodiment of the above apparatus, the apparatus may further include:

a context conversion unit, configured to determine, after the input obtaining unit obtains the current word and the first context information, second context information used for representing a semantic relationship between the current word and a word sequence before the current word based on the current word and the first context information;

the domain prediction unit is specifically configured to determine, based on the second context information, probabilities that words to be predicted after the current word respectively belong to a plurality of different domains;

the first prediction unit is specifically configured to determine, for each of the fields, a first possibility that each word in a word list belongs to the word to be predicted, based on the second context information.

Further, the domain prediction unit includes:

and the domain prediction subunit is used for determining the probability that the words to be predicted behind the current word respectively belong to a plurality of different domains by utilizing a pre-trained domain distribution model, and the domain distribution model is obtained by training based on a plurality of statement samples.

In one implementation, the input obtaining unit may include:

the input acquisition subunit is used for acquiring a word vector of a current word for prediction and first context information determined by a pre-trained language model for the last time, wherein the language model comprises the domain distribution model and pre-estimation functions corresponding to the multiple different domains, and the domain distribution model and the pre-estimation functions in the language model and the language model are obtained by uniformly training multiple sentence samples;

correspondingly, the first prediction unit comprises:

and the first prediction subunit is used for respectively inputting the second context information into the prediction functions corresponding to the fields and obtaining a first possibility distribution output by each prediction function, wherein the first possibility distribution comprises first possibilities that each word in a word list respectively belongs to the prediction words.

Optionally, the second prediction unit is specifically configured to perform weighted summation based on the probabilities corresponding to the respective fields and the first likelihood distribution output by the predictor function corresponding to the respective fields, so as to obtain a second likelihood distribution, where the second likelihood distribution includes second likelihoods that the respective words in the word list respectively belong to the words to be predicted.

For ease of understanding, refer to fig. 12, which shows a schematic structural diagram of a computer device in an embodiment of the present application. In fig. 7, the computer device 1200 may include: a processor 1201, a memory 1202, a communication interface 1203, an input unit 1204, and a display 1205 and communication bus 1206.

The processor 1201, the memory 1202, the communication interface 1203, the input unit 1204, and the display 1205 are all in communication with each other via a communication bus 1206.

In this embodiment, the processor 1201 may be a Central Processing Unit (CPU), an application-specific integrated circuit (ASIC), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA), or other programmable logic device.

The processor may call the program stored in the memory 1202, and in particular, the processor may perform the operations performed by the computer device side in fig. 1 and fig. 9.

The memory 1202 is used for storing one or more programs, which may include program codes including computer operation instructions, and in this embodiment, the memory stores at least the programs for implementing the following functions:

determining the probability that the words to be predicted behind the current word respectively belong to a plurality of different fields based on the current word and the first context information;

for each field, determining a first possibility that each word in a word list respectively belongs to the word to be predicted based on the current word and the first context information, wherein the first possibility is a possibility that a word in the word list belongs to the word to be predicted under the condition that the word to be predicted belongs to the field; the word list is a set which is constructed in advance and contains a plurality of words;

In one possible implementation, the memory 1202 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created during use of the computer.

Further, the memory 1202 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The communication interface 1203 may be an interface of a communication module, such as an interface of a GSM module.

The present application may further include a display 1204 and an input unit 1205, the display 1204 including a display panel, such as a touch display panel or the like; the input unit may be a touch sensing unit, a keyboard, or the like.

Of course, the computer device structure shown in fig. 12 does not constitute a limitation to the computer device in the embodiment of the present application, and the computer device may include more or less components than those shown in fig. 12 or some components in combination in practical applications.

In another aspect, the present application further provides a storage medium having stored therein computer-executable instructions, which when loaded and executed by a processor, implement the word prediction method as described in any one of the above embodiments.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The word prediction method, the word prediction device and the computer equipment can be applied to any fields of intelligent home, intelligent wearable equipment, virtual assistants, intelligent sound boxes, intelligent marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, intelligent medical treatment, intelligent customer service and the like.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a component of' 8230; \8230;" does not exclude the presence of additional identical elements in the process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The above is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, a plurality of modifications and embellishments can be made without departing from the principle of the present invention, and these modifications and embellishments should also be regarded as the protection scope of the present invention.

Claims

1. A method of word prediction, comprising:

for each field, determining a first possibility that each word in the high-frequency word list respectively belongs to the word to be predicted based on the current word and the first context information, specifically comprising: respectively inputting second context information into pre-trained high-frequency estimation functions corresponding to each field, and obtaining first probability distribution output by each high-frequency estimation function, wherein the first probability distribution comprises first probability that each word in the high-frequency word list respectively belongs to the predicted word; each high-frequency estimation function is obtained by training a plurality of statement samples; the second context information is determined based on the current word and the first context information and is used for representing the semantic relation between the current word and the word sequence before the current word; the first possibility is the possibility that a word in the high-frequency word list belongs to the word to be predicted under the condition that the word to be predicted belongs to the field; the high-frequency word list consists of words which are sequenced at the front designated position by using frequency in the total word list; the total word list is a set which is constructed in advance and contains a plurality of words, and the total number of the words in the total word list is more than the total number of the words in the high-frequency word list;

determining second possibility that each word in the high-frequency word list respectively belongs to the word to be predicted according to the probability that the word to be predicted respectively belongs to a plurality of different fields and the first possibility that each word in the high-frequency word list corresponding to each field respectively belongs to the word to be predicted, specifically comprising: carrying out weighted summation based on the probability corresponding to each field and the first probability distribution output by the high-frequency estimation function corresponding to each field to obtain second probability distribution; the second possibility distribution comprises second possibility that each word in the high-frequency word list respectively belongs to the word to be predicted;

determining a third possibility that each word in the low-frequency word list belongs to the word to be predicted respectively based on the current word and the first context information, specifically comprising: inputting the second context information into a pre-trained low-frequency estimation function, and obtaining a third probability distribution output by the low-frequency estimation function; the third likelihood distribution comprises a third likelihood that each word in the low-frequency word list belongs to a word to be predicted respectively; the low-frequency prediction function is obtained by training a plurality of statement samples; the low-frequency word list is formed by a plurality of words which do not belong to the high-frequency word list in the total word list;

according to the second possibility that each word in the high-frequency word list respectively belongs to the word to be predicted and the third possibility that each word in the low-frequency word list respectively belongs to the word to be predicted, the possibility that each word in the total word list respectively belongs to the word to be predicted is established;

and normalizing the possibility that each word in the word list respectively belongs to the words to be predicted to obtain the probability distribution that each word in the word list respectively belongs to the words to be predicted.

2. The word prediction method according to claim 1, wherein obtaining first context information that a word sequence before the current word has comprises:

and acquiring the first context information which is determined by the language model for the last time based on a pre-trained language model, wherein the language model is obtained by training a plurality of statement samples.

3. The word prediction method according to claim 2, wherein the determining, based on the current word and the first context information, probabilities that words to be predicted after the current word respectively belong to a plurality of different domains comprises:

converting the current word and the first context information into second context information representing semantic relation between the current word and a word sequence before the current word through the language model;

and inputting the second context information into a pre-trained domain distribution model so as to determine the probability that the words to be predicted respectively belong to different domains through the domain distribution model, wherein the domain distribution model is obtained based on training of a plurality of statement samples.

4. The word prediction method according to claim 3, wherein said converting, by the language model, the current word and the first context information into second context information representing a semantic relationship between the current word and a word sequence preceding the current word comprises:

acquiring the first context information output by a previous hidden layer of a current hidden layer corresponding to the current word in the language model;

and inputting the first context information and the current word into the current hidden layer to obtain the second context information output by the current hidden layer.

5. The word prediction method according to claim 1, wherein the low-frequency vocabulary includes m groups of sub-low-frequency vocabularies, each group of sub-low-frequency vocabularies is formed by a plurality of words in the low-frequency vocabulary, the sum of the use frequencies of the words in each group of sub-low-frequency vocabularies is equal, and m is a natural number greater than or equal to 2;

the inputting the second context information into a pre-trained low-frequency pre-estimation function and obtaining a third probability distribution output by the low-frequency pre-estimation function includes:

for each group of sub low-frequency word lists, inputting the second context information into a preset dimension reduction matrix corresponding to the group of sub low-frequency word lists to obtain dimension-reduced second context information; wherein, a group of sub-low frequency word lists corresponds to a dimension reduction matrix;

and inputting the second context information after dimension reduction into the low-frequency prediction function to obtain a third possibility that each word in the group of sub low-frequency word lists output by the low-frequency prediction function belongs to the word to be predicted respectively, and obtaining a third possibility distribution corresponding to each group of sub low-frequency word lists so as to obtain a third possibility distribution corresponding to each word in each group of sub low-frequency word lists respectively.

6. The word prediction method of claim 1, wherein obtaining the current word for prediction comprises:

acquiring at least one candidate sentence text corresponding to the voice to be recognized or the text to be translated;

for each candidate sentence text, sequentially taking each word in the candidate sentence text as the current word;

the word prediction method further includes:

and aiming at each candidate sentence text, obtaining the probability that the candidate sentence text is a correct sentence according to the probability distribution corresponding to the next word after each word in the candidate sentence text so as to obtain the probability corresponding to at least one candidate sentence text.

7. The word prediction method according to claim 6, further comprising:

sorting the probabilities corresponding to the at least one candidate sentence text from high to low;

and displaying the at least one ordered candidate sentence text.

8. The word prediction method of claim 1, wherein obtaining the current word for prediction comprises:

taking the last word in the input sentence as the current word;

the word prediction method further includes:

and screening at least one candidate word to be displayed from the total word list according to the probability distribution of the words in the total word list after the words belong to the current word respectively, and determining the display sequence of the at least one candidate word.

9. A speech recognition system, comprising:

the voice recognition server is used for acquiring a plurality of candidate sentence texts corresponding to the voice to be recognized, which is input by the user terminal;

a computer device for receiving a plurality of the candidate sentence texts;

for each candidate sentence text, sequentially taking each word in the candidate sentence text as a current word for prediction;

aiming at each current word, acquiring the current word and first context information of a word sequence before the current word;

for each field, determining a first possibility that each word in the high-frequency word list respectively belongs to the word to be predicted based on the current word and the first context information, specifically comprising: respectively inputting second context information into pre-trained high-frequency estimation functions corresponding to various fields, and obtaining first possibility distribution output by each high-frequency estimation function, wherein the first possibility distribution comprises first possibility that each word in the high-frequency word list respectively belongs to the predicted word; each high-frequency estimation function is obtained by training a plurality of statement samples; the second context information is determined based on the current word and the first context information and is used for representing the semantic relation between the current word and the word sequence before the current word; the first possibility is the possibility that a word in the high-frequency word list belongs to the word to be predicted under the condition that the word to be predicted belongs to the field; the high-frequency word list is composed of words which are ordered by using frequency in a front designated position in a general word list; the word list is a set which is constructed in advance and contains a plurality of words, and the total number of the words in the word list is more than that of the words in the high-frequency word list;

normalizing the possibility that each word in the word list respectively belongs to the word to be predicted to obtain the probability distribution that each word in the word list respectively belongs to the word to be predicted; to obtain the probability distribution corresponding to the next word after each word in the candidate sentence text;

obtaining the probability that the candidate sentence text is a correct sentence according to the probability distribution corresponding to the next word after each word in the candidate sentence text so as to obtain the probability corresponding to each of the candidate sentence texts;

the voice recognition server is further configured to sort the probabilities corresponding to the candidate sentence texts respectively from high to low, and feed back the sorted candidate sentence texts to the user terminal.

10. The speech recognition system of claim 9, further comprising:

the data statistics server is used for acquiring a plurality of words used by different users; the computer device is also used for receiving a plurality of words used by different users and sent by the data statistics server; determining the total word list based on a plurality of words used by different users;

or the like, or, alternatively,

the data statistics server is used for acquiring a plurality of words used by different users; determining the total word list based on a plurality of words used by different users; sending the thesaurus to the computer device.

11. A storage medium having stored thereon computer-executable instructions that, when loaded and executed by a processor, carry out a method of word prediction according to any one of claims 1 to 8.