WO2023280265A1

WO2023280265A1 - Word or sentence generation method, model training method and related device

Info

Publication number: WO2023280265A1
Application number: PCT/CN2022/104334
Authority: WO
Inventors: 肖镜辉; 刘群; 吴海腾; 谢武锋; 熊元峰
Original assignee: 华为技术有限公司
Priority date: 2021-07-08
Filing date: 2022-07-07
Publication date: 2023-01-12
Also published as: CN113655893A; CN113655893B

Abstract

Disclosed in the embodiments of the present application are a word or sentence generation method, a model training method and a related device in the field of artificial intelligence, which can be used for word or sentence recommendation in an input method. The method comprises: acquiring a character string sequence, wherein the character string sequence comprises M character strings, and each character string indicates one or more candidate words; encoding each character string into a character string vector by means of an encoder, and then, on the basis of the character string vector, acquiring a first probability of each candidate word indicated by the character string; and finally, on the basis of the first probability, generating a target word or sentence, wherein the target word or sentence comprises M target words, and each target word is one of one or more candidate words indicated by each character string. By means of the embodiments of the present application, the accuracy of a generated target word or sentence can be improved, thereby improving the recommendation accuracy of an input method.

Description

A word and sentence generation method, model training method and related equipment

This application claims the priority of the Chinese patent application with the application number 202110775982.1 and the title of the invention "a method for generating words and sentences, a method for training models and related equipment" submitted to the China Patent Office on July 08, 2021, the entire contents of which are incorporated by reference incorporated in this application.

technical field

The present application relates to the technical field of input methods, in particular to a method for generating words and sentences, a method for training models and related equipment.

Background technique

The input method editor is a necessary application program for the client, and is widely used in desktop computers, notebooks, mobile phones, tablets, smart TVs, car computers and other devices; and the user's daily activities, such as: searching for places, finding restaurants, Chatting and making friends, travel planning, etc., will largely be transformed into user input behaviors, so the data of the input method editor can be used to accurately describe users. Therefore, input method editors have great strategic significance in the Internet field.

In the input scenario, after the user enters characters (such as pinyin) on the device, the input method editor will generate words (words or sentences) and prompt the words and sentences for the user to choose. The accuracy of the generated words and sentences directly affects the input method editor. The accuracy rate and user experience; for this, a method that can accurately generate words and sentences is needed.

Contents of the invention

Embodiments of the present application provide a method for generating words and sentences, a method for training models, and related equipment. The method can improve the accuracy of generated words and sentences.

The first aspect of the embodiment of the present application provides a method for generating words and sentences, which can be applied to terminal devices or cloud servers, and specifically includes: obtaining a character string sequence, the character string sequence includes M character strings, each character A string indicates one or more candidate words; among them, a string can be understood as a combination of characters, which is a carrier of language information, carries pronunciation information, and is used to generate words or sentences; corresponding to different types of languages, the form of a string is different , taking Chinese as an example, the string can include one pinyin or multiple pinyin, and M is a positive integer; according to the string sequence, through the encoder, M first string vectors are obtained, and each first string vector corresponds to M A character string in the character string; the encoder can be understood as a deep learning network model, and there are various network structures of the encoder, which are not specifically limited in the embodiment of the present application; specifically, the network structure of the encoder can be Transformer The network structure of the encoder part of the network, or the network structure of a series of other networks obtained by the encoder part of the Transformer network; based on the M first character string vectors, obtain the first word of each candidate word indicated by the M character strings One probability, the first probability of a candidate word can be understood as, in the case of a user inputting a character string, the user selects the probability of the current candidate word from all the candidate words indicated by the character string; based on the first probability, a target word and sentence is generated, and the target The words and sentences include M target words, and each target word is one of one or more candidate words indicated by each character string. Specifically, the target words and sentences may be a word or a sentence.

The string sequence is encoded by the encoder to obtain the first string vector, which is the representation of the string after fusing the information of the entire string sequence, not just the string itself, that is, the first A character string vector contains more information; so calculating the first probability of the target word based on the first character string vector and generating the target word and sentence based on the first probability can improve the accuracy of the generated target word and sentence, thereby improving the input method. Accuracy.

As an achievable way, obtaining M first character string vectors through the encoder according to the character string sequence includes: obtaining M first position vectors and M second character string vectors according to the character string sequence, each A position vector represents the position of a character string in the character string sequence, and each second character string vector represents a character string; according to M first position vectors and M second character string vectors, through an encoder, multiple The first string vector.

The Bert model needs to encode the words based on the position vector of the word, the vector of the word, the vector used to distinguish whether the word is in the first sentence or the second sentence, and the vector related to the separator "SEP" and the tag "CLS". , and in the embodiment of the present application, the first character string vector can be obtained by the encoder only according to the two vectors of the first position vector and the second character string vector of the character string; therefore, in the embodiment of the present application The encoder needs to process fewer vectors, and the encoding efficiency is higher, thereby improving the response speed of the input method.

As an achievable way, the encoder is trained based on the conversion task, where the conversion task is the task of converting a sequence of sample strings into sample words and sentences.

In the application phase, the encoder is used to convert the string into the first string vector, and then the first string vector is used to obtain the target words and sentences. It can be seen that in the application phase, the function of the encoder is the same as that of encoding in the process of training based on conversion tasks. The function of the encoder is similar; therefore, the encoder trained based on the conversion task is used to encode the string sequence, which can improve the encoding accuracy of the encoder, thereby improving the accuracy of the input method.

As an achievable way, based on the M first character string vectors, obtaining the first probability of each candidate word indicated by the M character strings includes: based on the M first character string vectors, through a probability model, obtaining M The first probability of each candidate word indicated by the string. The probability model is trained based on the conversion task. The probability model and the encoder can be regarded as a whole, that is, a deep learning model, and the encoder can be regarded as this The first half of the deep learning model, the probability model can be regarded as the second half of the deep learning model; among them, the conversion task is the task of converting the sequence of sample strings into sample words and sentences.

Obtaining the first probability of candidate words through the probability model can improve the accuracy of the first probability; and, similar to the encoder, in the application phase, the function of the probability model is similar to that of the probability model in the process of training based on the conversion task, so , using the probability model trained based on the conversion task to calculate the first probability, which can improve the accuracy of the first probability, thereby improving the accuracy of the input method.

As an achievable way, based on the first probability, generating the target word and sentence includes: according to the character string sequence, through the Ngram model, obtaining the third probability of each candidate word indicated by M character strings, wherein, for any candidate word For example, the third probability of the candidate word represents the conditional probability of the occurrence of the candidate word when one or more candidate words appear in front; based on the first probability, the third probability and the Viterbi algorithm, the target word and sentence is generated, The Viterbi algorithm is a dynamic programming algorithm, which is used to find the Viterbi path that is most likely to produce a sequence of observed events. The Viterbi path can also be called the optimal path, and the Viterbi algorithm can also be called a finite state transition. Transducer (Finite State Transducers, FST) algorithm.

The first probability of a candidate word can be understood as the conditional probability of the candidate word in the presence of a string sequence, and the third probability of the candidate word can be understood as the conditional probability of the current candidate word in the presence of other candidate words, so in In the process of generating target words and sentences, both the first probability of candidate words and the third probability of candidate words calculated by the Ngram model are considered, which is conducive to generating target words and sentences with higher accuracy.

As an achievable manner, based on the first probability, generating target words and sentences includes: obtaining reference words from a reference dictionary, and the reference dictionary may include at least one of the following types of thesaurus: basic thesaurus, phrase thesaurus, user personal words Library, hotspot thesaurus, various field thesaurus, the quantity of reference word can be one, also can be multiple, and reference word includes P candidate words indicated by P reference character strings, and each reference character string indicates a candidate word , P reference strings are included in the string sequence, and the positions in the string sequence are continuous, where P is an integer greater than 1; based on the first probabilities of the P candidate words, calculate the fourth probability of the reference word , the fourth probability indicates the possibility of the user selecting a reference word when inputting P reference character strings; there are many ways to calculate the fourth probability of a reference word, for example, the geometry of the first probability of P candidate words The average value is used as the fourth probability of the reference word; based on the fourth probability and the first probability of each candidate word indicated by other character strings in the character string sequence except the P reference character strings, a target word and sentence is generated.

Since the training and distribution of encoders and probability models often takes a long period, it cannot reflect changes in user input trends and user input scenarios in a timely manner, and it is difficult to cope with new words and hot words that appear on the Internet. The reference dictionary can provide a variety of Words in the scene, new words or hot words are used as reference words to assist in the generation of target words and sentences, which can make up for the shortcomings of encoders and probability models and improve the accuracy of target words and sentences.

As an achievable way, based on the fourth probability and the first probability of each candidate word indicated by other strings in the string sequence except the P reference strings, generating the target word and sentence includes: through the Ngram model, obtaining the character The fifth probability of each candidate word indicated by other strings except P reference strings in the string sequence, and the fifth probability of the reference word; based on each The first probability, the fourth probability, the fifth probability and the Viterbi algorithm of a candidate word to generate the target word and sentence.

Wherein, the embodiment of the present application regards all candidate words in the reference words as a whole, so that it is not necessary to calculate the conditional probability between the candidate words inside the reference words through the Ngram model, and only need to calculate the first position of the reference words through the Ngram model. Five probabilities are enough; in the process of calculating the fifth probability of the reference word, the fifth probability of the first candidate word in the reference word can be calculated, and the fifth probability of the first candidate word can be used as The fifth probability of the reference word.

In this implementation, not only the Ngram model is used, but also the reference dictionary is used. Based on the previous descriptions of the Ngram model and the reference dictionary, it can be known that this implementation can combine the advantages of the reference dictionary and the Ngram model, thereby further improving the accuracy of the target words and sentences. Accuracy.

As an achievable way, the target character string is the character string after the P reference character strings in the character string sequence; the fifth probability of each candidate word indicated by the target character string is, among the Q candidate words that appear The conditional probability of the occurrence of the candidate word indicated by the target character string in the case, Q is a positive integer; the Q candidate words include one of each character string indicated by each character string in the Q consecutive character strings before the target character string in the character string sequence candidate words, and when the Q character strings include the reference character string, the Q candidate words include candidate words in the reference words indicated by the reference character string.

As an achievable manner, after generating the target word and sentence based on the first probability, the method further includes:

Prompting the target word and sentence as the preferred word and sentence, the preferred word and sentence is the first word and sentence among the multiple words and sentences prompted by the input method.

In the input scene, the terminal device will prompt multiple words and sentences. In the embodiment of the present application, the target words and sentences are prompted as the preferred words and sentences, so that the target words and sentences with the highest possibility of the user's choice can be preferentially prompted to the user, so as to improve the user's input efficiency.

As an implementable manner, the character string includes one pinyin or multiple pinyins.

Based on the fact that the character string includes one or more Pinyin, this implementation provides a specific Chinese application scenario for the method in the embodiment of the present application.

The second aspect of the embodiment of the present application provides a model training method, including: obtaining a sample string sequence, the sample string sequence includes K sample strings, and each sample string indicates one or more sample candidate words, wherein, K is a positive integer; according to the sequence of sample strings, through the encoder, K first sample string vectors are obtained, and each sample string vector corresponds to a sample string; based on the K first sample string vectors, obtain The second probability of each sample candidate word indicated by the K sample character strings; based on the second probability, the encoder is adjusted.

Since the first aspect describes the character string, the encoder, the character string sequence, and the first probability, etc., the character string, the encoder, the character string sequence, and the second Two probability to understand.

The sample string sequence is encoded by the encoder to obtain the first sample string vector, which is a representation of the sample string after fusing the information of the entire sample string sequence, not just Indicates the sample string itself, that is, the first sample string vector contains more information; so the second probability of the target sample word is calculated based on the first sample string vector, and the encoder is adjusted based on the second probability, It can improve the accuracy of the trained encoder and probability model, thereby improving the accuracy of the input method.

As an achievable way, obtaining K first sample character string vectors through an encoder according to the sample character string sequence includes: obtaining K second position vectors and K second sample character strings according to the sample character string sequence Vector, each second position vector represents the position of a sample character string in the sample character string sequence, and each second sample character string vector represents a sample character string; according to K second position vectors and K second sample characters The string vectors are passed through the encoder to obtain K first sample string vectors.

In the embodiment of the present application, according to the second position vector of the sample character string and the second sample character string vector, the first sample character string vector can be obtained through the encoder; and the Bert model needs the position vector of the word, the word In addition to the vector, the vector used to distinguish whether the word is in the first sentence or the second sentence, and the vector related to the separator "SEP" and the tag "CLS" are also needed; therefore, the encoder in the embodiment of the application needs to process The number of vectors is less and the encoding efficiency is higher, which can improve the training efficiency.

As an achievable way, the sample candidate words indicated by each sample string contain a target sample word, where the target sample word is equivalent to the sample label; correspondingly, based on the second probability, adjusting the encoder includes: The parameters of the encoder are adjusted so that the second probability of the target sample word increases, and/or so that the second probabilities of other sample candidate words except the target sample word decrease.

For example, the sample string sequence is "nuoyafangzhouhenbang", for the sample string "nuo", the corresponding sample candidate words include "nuo", "waxy", "cowardly", etc., let "nuo" be the target sample word , then by adjusting the parameters of the encoder, the second probability of "nuo" can be increased, and the second probability of "waxy" and "cowardly" can be reduced.

In this implementation, the target sample words are preset, and during the training process, by adjusting the parameters of the encoder, the second probability of the target sample words increases and/or the second probability of other sample candidate words except the target sample words increases. The probability is reduced, so that the second probability of the target sample word is greater than the second probability of other sample candidate words, thereby realizing the training of the encoder.

As an achievable manner, based on the K first sample character string vectors, obtaining the second probability of each sample candidate word indicated by the K sample character strings includes: based on the K first sample character string vectors, by The probability model obtains the second probability of each sample candidate word indicated by K sample strings; correspondingly, based on K first sample string vectors, obtains the second probability of each sample candidate word indicated by K sample strings After the second probability, the method further includes: adjusting the probability model based on the second probability.

Obtaining the second probability of the sample candidate words through the probability model can improve the accuracy of the second probability; and adjusting the probability model based on the second probability can improve the accuracy of the second probability output by the probability model.

As an achievable way, the sample candidate words indicated by each sample character string contain a target sample word; based on the second probability, adjusting the probability model includes: adjusting the parameters of the probability model so that the second probability of the target sample word The second probability is increased, and/or the second probability of other sample candidate words except the target sample word is decreased.

In this implementation, the target sample words are preset, and during the training process, by adjusting the parameters of the probability model, the second probability of the target sample words increases and/or the second probability of other sample candidate words except the target sample words increases. The probability is reduced, so that the second probability of the target sample word is greater than the second probability of other sample candidate words, thereby realizing the training of the probability model.

As an achievable manner, obtaining the sample character string sequence includes: obtaining K sample character strings in the sample character string sequence based on K target sample words.

Obtaining sample character strings based on target sample words can improve the efficiency of obtaining sample character strings.

As an implementable manner, the sample character string includes one pinyin or multiple pinyins.

The third aspect of the embodiment of the present application provides a word and sentence generation device, including: a first acquisition unit, configured to acquire a character string sequence, the character string sequence includes M character strings, each character string indicates one or more candidate words, Wherein, M is a positive integer; the first encoding unit is used to obtain M first character string vectors through an encoder according to the character string sequence, and each first character string vector corresponds to one of the M character strings; The second acquisition unit is used to obtain the first probability of each candidate word indicated by M character strings based on M first character string vectors; the generation unit is used to generate target words and sentences based on the first probability, and the target words and sentences include M target words, each target word is one of one or more candidate words indicated by each character string.

As an achievable way, the first encoding unit is used to obtain M first position vectors and M second character string vectors according to the character string sequence, and each first position vector represents a character string in the character string sequence Each second character string vector represents a character string; according to the M first position vectors and the M second character string vectors, a plurality of first character string vectors are obtained through an encoder.

As an achievable way, the encoder is trained based on the conversion task, which is the task of converting a sequence of sample strings into sample words and sentences.

As an achievable way, the second acquisition unit is used to obtain the first probability of each candidate word indicated by the M character strings based on the M first character string vectors through a probability model, and the probability model is based on the conversion task After training, the conversion task is the task of converting the sequence of sample strings into sample words and sentences.

As an achievable way, the generation unit is used to obtain the third probability of each candidate word indicated by the M strings through the Ngram model according to the string sequence; based on the first probability, the third probability and Viterbi Algorithm to generate target words and sentences.

As an achievable manner, the generation unit is used to obtain reference words from the reference dictionary, the reference words include P candidate words indicated by P reference character strings, each reference character string indicates a candidate word, and P reference characters The string is included in the string sequence, and the position in the string sequence is continuous, wherein, P is an integer greater than 1; based on the first probabilities of the P candidate words, calculate the fourth probability of the reference word; based on the fourth probability and the first probability of each candidate word indicated by other character strings in the character string sequence except the P reference character strings, to generate target words and sentences.

As an achievable way, the generation unit is used to obtain the fifth probability of each candidate word indicated by other strings in the string sequence except P reference strings, and the fifth probability of the reference word through the Ngram model ; Based on the first probability, the fourth probability, the fifth probability and the Viterbi algorithm of each candidate word indicated by other character strings except P reference character strings in the character string sequence, generate target words and sentences.

As an achievable manner, the device further includes a prompting unit, configured to prompt the target word and sentence as a preferred word and sentence, and the preferred word and sentence is the first word and sentence among the multiple words and sentences prompted by the input method.

For the specific implementation, related description and technical effects of the above units, please refer to the description of the first aspect of the embodiment of the present application.

The fourth aspect of the embodiment of the present application provides a model training device, including: a third acquisition unit, configured to acquire a sequence of sample strings, the sequence of sample strings includes K sample strings, and each sample string indicates one or more sample candidate words, wherein K is a positive integer; the second coding unit is used to obtain K first sample string vectors through an encoder according to the sample string sequence, and each sample string vector corresponds to a sample character string; the fourth acquisition unit is used to obtain the second probability of each sample candidate word indicated by K sample strings based on K first sample string vectors; the adjustment unit is used to encode based on the second probability device to adjust.

As an achievable way, the second coding unit is used to obtain K second position vectors and K second sample string vectors according to the sample string sequence, and each second position vector represents a sample string in the sample The position in the string sequence, each second sample string vector represents a sample string; according to K second position vectors and K second sample string vectors, K first sample characters are obtained through an encoder string vector.

As an achievable manner, the sample candidate words indicated by each sample string contain a target sample word; the adjustment unit is configured to adjust the parameters of the encoder so that the second probability of the target sample word increases, and/or In order to reduce the second probability of other sample candidate words except the target sample word.

As an achievable way, the fourth acquisition unit is used to obtain the second probability of each sample candidate word indicated by the K sample strings through a probability model based on the K first sample string vectors; the adjustment unit , is also used to adjust the probability model based on the second probability.

As an achievable manner, the sample candidate words indicated by each sample string contain a target sample word; the adjustment unit is configured to adjust the parameters of the probability model, so that the second probability of the target sample word increases, and/or In order to reduce the second probability of other sample candidate words except the target sample word.

As an implementable manner, the third acquiring unit is configured to acquire K sample character strings in the sample character string sequence based on the K target sample words.

For the specific implementation, related description and technical effects of the above units, please refer to the description of the second aspect of the embodiment of the present application.

The fifth aspect of the embodiment of the present application provides a computer device, including: one or more processors and a memory; wherein, computer-readable instructions are stored in the memory; one or more processors read the computer-readable instructions to The computer device is made to implement the method in any implementation manner of the first aspect.

The sixth aspect of the embodiment of the present application provides a training device, including: one or more processors and memory; wherein, computer-readable instructions are stored in the memory; one or more processors read the computer-readable instructions to Make the training device implement the method in any implementation manner of the second aspect.

The seventh aspect of the embodiment of the present application provides a computer-readable storage medium, including computer-readable instructions. When the computer-readable instructions are run on the computer, the computer is executed according to any implementation manner of the first aspect or the second aspect. method.

The eighth aspect of the embodiment of the present application provides a chip, including one or more processors. Part or all of the processor is used to read and execute the computer program stored in the memory, so as to execute the method in any possible implementation manner of the first aspect or the second aspect above.

Optionally, the chip includes a memory, and the memory and the processor are connected to the memory through a circuit or wires. Further optionally, the chip further includes a communication interface, and the processor is connected to the communication interface. The communication interface is used to receive data and/or information to be processed, and the processor obtains the data and/or information from the communication interface, processes the data and/or information, and outputs the processing result through the communication interface. The communication interface may be an input-output interface.

In some implementations, some of the one or more processors can also implement some steps in the above method through dedicated hardware. For example, the processing related to the neural network model can be performed by a dedicated neural network processor or graphics processor to achieve.

The method provided in the embodiment of the present application may be implemented by one chip, or may be implemented by multiple chips in cooperation.

The ninth aspect of the embodiments of the present application provides a computer program product, the computer program product includes computer software instructions, and the computer software instructions can be loaded by a processor to implement any one of the above first or second aspects. Methods.

Description of drawings

FIG. 1 is a schematic diagram of an application scenario of an embodiment of the present application;

Fig. 2 is the schematic diagram of word sequence in the embodiment of the present application;

Fig. 3 is the schematic diagram of pre-training language model;

FIG. 4 is a schematic diagram of the system architecture of the embodiment of the present application;

FIG. 5 is a schematic diagram of an embodiment of the model training method provided by the embodiment of the present application;

Fig. 6 is a comparative schematic diagram of the original input of the encoder and the Bert model in the embodiment of the present application;

Fig. 7 is a comparative schematic diagram of the direct input of the encoder and the Bert model in the embodiment of the present application;

FIG. 8 is a schematic diagram of an embodiment of a method for generating words and sentences provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of an embodiment of candidate words in the embodiment of the present application;

Fig. 10 is a schematic diagram of the combination of the first probability and the third probability in the embodiment of the present application;

FIG. 11 is a schematic diagram of an embodiment of generating target words and sentences in the embodiment of the present application;

FIG. 12 is a schematic diagram of the use of the reference dictionary in the embodiment of the present application;

FIG. 13 is a schematic structural diagram of a device for generating words and sentences provided by an embodiment of the present application;

FIG. 14 is a schematic structural diagram of a model training device provided in an embodiment of the present application;

FIG. 15 is a schematic structural diagram of a computer device provided by an embodiment of the present application.

detailed description

Embodiments of the present application provide a method for generating words and sentences, a method for training models, and related equipment. The method can improve the accuracy of generated words and sentences, thereby improving the accuracy of input methods and user experience.

The embodiment of the present application can be applied to the input scenario shown in FIG. 1 .

In this input scenario, the user can input a character string on the terminal device. Correspondingly, the input method editor (Input Method Editor, IME) deployed inside the terminal device will receive the character string entered by the user and generate a character string based on the character string. corresponding phrase, and then prompt the phrase to the user.

Among them, a character string can be understood as a combination of characters, which is a carrier of language information and is used to generate a sentence; the sentence can be one word or multiple words, and a word can also become a word.

The above-mentioned input scene can be an input scene of multiple languages such as Chinese, Japanese, and Chinese; corresponding to different types of languages, the form of the character string is different; taking Chinese as an example, the character string can include one pinyin or multiple pinyins, specifically, as As shown in Figure 1, when the character string nuoyafangzhou is input, the words and sentences suggested by the input method editor are Noah's Ark is great, Noah's Ark and Noah.

In this embodiment of the application, the terminal device may be a desktop computer, a notebook computer, a tablet computer, a smart phone, or a smart TV. In addition, the terminal device may also be any other device that can deploy an input method editor such as a vehicle computer.

It can be understood that in the example shown in FIG. 1 , the suggested words and sentences include Noah's Ark, which is great. It can be seen that the suggested words and sentences are more accurate, which can obviously improve the user's input efficiency and user experience.

However, with the development of the mobile Internet, on the one hand, the language used by users is becoming more and more abundant, and new words on the Internet emerge in an endless stream; on the other hand, the application scenarios of input methods are becoming more and more extensive and diverse. Therefore, the difficulty of prompting words and sentences of the input method editor is greatly increased.

In order to be able to accurately prompt words and sentences to the user, an embodiment of the present application provides a method for generating words and sentences, which uses an encoder to encode a character string (such as pinyin) input by the user into a character string vector, and then generates a target based on the character string vector phrases to improve the accuracy of the generated phrases.

For ease of understanding, the technical terms mentioned in the embodiments of the present application are firstly explained below.

Input method preferred word: When the user enters a character string, the input method editor will provide the user with a candidate list, which is used to prompt the user for words and sentences, and the first in the candidate list is called the preferred input method word.

Transformer network structure: a deep neural network structure, including input layer, self-attention layer, feed-forward layer, normalization layer and other substructures.

Bert model: A model with a Transformer network structure, and on the basis of the Transformer network structure, a "pre-training + fine-tuning" learning paradigm is proposed, and two pre-training tasks, Masked Language Model and Next Sentence Prediction, are designed.

Ngram model: A model widely used in Chinese input method tasks.

Zero probability problem: In the process of using the Ngram model, in some cases, the value of probability will be calculated as zero, and the probability of zero value will cause many problems in engineering implementation; for example, because of zero probability, it is impossible to compare the probability between size, only random results can be returned.

Smoothing algorithm (smoothing algorithm): An algorithm designed to solve the zero probability problem of the Ngram model. When judging that there is a zero probability risk, the smoothing algorithm usually uses a stable but inaccurate low-order Ngram model probability. to fit unstable but accurate higher-order Ngram model probabilities.

Viterbi Algorithm: is a dynamic programming algorithm for finding the Viterbi path that is most likely to produce the sequence of observed events, or the sequence of hidden states, especially in the context of Markov information sources and hidden Markov models Among them, it is often used in speech recognition, keyword recognition, computational linguistics and bioinformatics; among them, the Viterbi algorithm can also be called the Finite State Transducers (FST) algorithm.

The Ngram model is introduced in detail below.

For a language sequence (for example: a sentence is a word sequence), the sequence probability P(w ₁ ,w ₂ ,...w _n ) can be decomposed into the product of conditional probabilities, as follows: P(w ₁ ,w ₂ ,...w _n )＝P(w ₁ )*P(w ₂ |w ₁ )*P(w ₃ |w ₁ ,w ₂ )*...P(w _n |w ₁ ,...w _n-1 ), where w ₁ , w ₂ ... w _n represent the words in the sequence respectively, and P represents the probability.

However, it is difficult to accurately obtain the value of the probability P(w _n |w ₁ ,... _wn-1 ) by statistical methods. Therefore, the Ngram model makes the Markov assumption that the probability of the current word is only related to a limited number of N words. When N takes different values, a series of specific Ngram models are obtained. For example: when N=2, the probability of the current word is only related to the past word, and the value of P(w _n |w ₁ ,...w _n-1 ) degenerates to that of P(w _n |w _n-1 ) value, ie

The Ngram model at this time is called a Bigram model; similarly, when N=3, the Ngram model is called a Trigram model; when N=4, the Ngram model is called a Fourgram model.

There is a problem with the Ngram model during use. In the application scenario, some word combinations do not appear in the training set. At this time, the probability value estimated by Ngram for these word combinations is 0, which will cause a series of problems in engineering. In order to avoid this zero probability situation, various smoothing algorithms have been developed.

The smoothing algorithm can be simply understood as, when the probability of the Ngram model is 0, the product of a certain weight and the probability of the (N-1)gram model is used as the probability of the (N)gram model.

The Ngram model is described below with a specific example.

Specifically, it is assumed that the word sequence is: Noah’s technology is strong; the probability of the word sequence can be decomposed into the product of conditional probabilities, that is, P (Noah, Asia,, technology, technology, strong) = P (Noah) * P (Asia | Nuo) * P (of | Nuo, Ya) * P (Technology | Nuo, Ya, of) * P (Technology | Nuo, Ya, of, technology) * P (strong | Nuo, Ya, of, technology, technology ); after adopting the gram model of N=2, P (Nuo, Asia, of, technology, technology, strong)=P (Nuo | B)*P (Ya|Nuo)*P(of | Asia)*P( technology|of)*P(technology|technology)*P(strong|technology); after adopting the gram model of N=3, P(nuo, ya, of, technology, technology, strong)=P(nuo|A, B)*P(A|Nuo, B)*P(的|Nuo, A)*P(Technology|Nuo, A, of)*P(Technology|Nuo, A, of, technology)*P(strong|Nuo , ya, of, technology, technology).

It should be noted that when N=1, since there are no other words in front of "nuo", a word (expressed as A in the above example) will be automatically added as a reference during the calculation process using the Ngram model; similarly, when N= At 2 o'clock, since there are no other words in front of "nuo", two words (A and B are used in the above example) will be automatically added as a reference during the calculation process using the Ngram model.

The Viterbi algorithm is described below.

Taking the pinyin input method as an example, as shown in Figure 2, the bottom line represents pinyin nodes, and the upper four lines of nodes are Chinese characters corresponding to pinyin nodes. These Chinese characters constitute various possibilities for user input. The probability of each Chinese character node can be calculated by using the Ngram model. Since the probability of the Chinese character node is actually the conditional probability of the occurrence of the previous N Chinese character nodes, this probability can also be regarded as the path transition probability between Chinese character nodes.

For example, when N=2, the Ngram model can be used to calculate the probabilities P (Ya | Nuo), P (Ya | Cowardly), P (Ya | Waxy), P (Ya | No), these probabilities can also be called "Nuo The path transition probability from "" to "Asia", the path transition probability from "cowardly" to "Asia", the path transition probability from "waxy" to "Asia", and the path transition probability from "No" to "Asia".

Corresponding to each of the six pinyins of "nuo", "ya", "de", "ji", "shu", and "qiang", there are four choices of Chinese characters, so the number of combinations of these Chinese characters is 4* 4*4*4*4*4; using the Viterbi algorithm and the path transition probability between Chinese characters, a node path with the highest probability can be searched, and this node path can also be called the optimal path, which can be specifically shown in Figure 2 "Noah's technology is strong".

The following describes the pretrained language model (pertrained language model, PLM) and Bert model.

The pre-trained language model is an important general model in the field of natural language processing (NLP) that has emerged in recent years. Among them, NLP is a technology that allows computers to understand and process human natural language. AI) is an important technical means.

As shown in Figure 3, the pre-trained language model mainly includes three aspects: network structure, learning paradigm and (pre-)training tasks.

The network structure of the pre-trained language model adopts the network structure of the encoder part of the Transformer network. The encoder part includes an input layer, a self-attention layer, a feed-forward layer, and a normalization layer.

There are many types of pre-trained language models, among which the representative one belongs to the Bert model.

The Bert model is based on the encoder part of the encoder, using the "pre-training + fine-tuning" learning paradigm, that is, using pre-training tasks on a large amount of unlabeled corpus to learn the basic model, and then fine-tuning the basic model on the target task. In order to obtain the Bert model, the pre-training tasks mainly refer to the Masked Language Model task and the Next Sentence Prediction task.

The system architecture of the embodiment of the present application is introduced below.

As shown in FIG. 4 , the system architecture of the embodiment of the present application includes a training phase and an application phase, which will be described below using Chinese as an example.

In the training phase, the Chinese character corpus is passed through the word segmentation device to obtain the word segmentation data. Next, on the basis of the word segmentation data, train the Ngram model. At the same time, the word segmentation is expected to be converted from Chinese characters to pinyin through a phonetic converter to obtain pinyin corpus. Then, on the basis of the Pinyin prediction, the encoder is trained to encode the Pinyin into a vector; since the encoder also uses the encoder part of the Transformer network, it is similar to the existing Bert model and is used for encoding Pinyin, so the encoder can also be called a Pinyin Bert model.

In the application stage, the Pinyin Bert model is combined with the Ngram model, and then combined with various external resource banks, such as: basic thesaurus, phrase thesaurus, user thesaurus, various domain thesaurus (Figure 4 shows domain words 1, Field words 2 and field words 3), etc., to obtain an input engine, which is used to prompt corresponding words and sentences in response to the pinyin input by the user.

Referring to FIG. 5 , the model training method provided by the embodiment of the present application will be introduced from the training stage first.

Specifically, the embodiment of the present application provides an embodiment of a model training method, which can be applied to multiple languages such as Chinese, Japanese, and Korean. Since the process of model training requires a large amount of computation, this Embodiments are typically performed by a server.

As shown in Figure 5, this embodiment includes:

Step 101, acquire a sample character string sequence.

The sample character string sequence includes K sample character strings, where K is a positive integer.

In the embodiment of the present application, a character string can be understood as a combination of characters, which is a carrier of language information and is used to generate a sentence; the sentence can be one word, or multiple words, and a word can also become a word.

The above-mentioned input scene can be an input scene of multiple languages such as Chinese, Japanese, and Chinese; corresponding to different types of languages, the form of the character string is different; taking Chinese as an example, the character string can include one pinyin or multiple pinyin, at this time, the characters A string can also be called a pinyin string, for example, the string can be "nuoyafangzhou".

A sample character string refers to a character string used as a sample and used for training.

Each sample character string indicates one or more sample candidate words, and the sample candidate words may be one character or multiple characters.

For example, when the sample character string is "nuo", the corresponding sample candidate words can be "nuo", "waxy", "cowardly" etc.; when the sample character string is "ya", the corresponding sample candidate words can be " Asia", "pressure", "ah" and so on.

There are many methods for obtaining the sample character string sequence, which are not specifically limited in this embodiment of the present application.

Exemplarily, step 101 includes: acquiring K sample character strings in the sample character string sequence based on the K target sample words.

For example, as shown in FIG. 4 , when the sample character string is pinyin, the target sample word can be converted from Chinese characters to pinyin by a phonetic converter to obtain the sample character string.

Step 102: Obtain K first sample character string vectors through an encoder according to the sample character string sequence, and each first sample character string vector corresponds to a sample character string.

The encoder can be understood as a deep learning network model, and there are various network structures of the encoder, which are not specifically limited in the embodiment of the present application; specifically, the network structure of the encoder can adopt the network structure of the encoder part of the Transformer network , or adopt the network structure of a series of other networks obtained from the encoder part of the Transformer network.

Although the network structure of the encoder in the embodiment of this application is similar to the network structure of the Bert model, the network structure of the encoder part of the Transformer network is also used, but the actual situation is quite different. The following will illustrate this application through multiple comparisons. The encoder in the embodiment is different from that of the Bert model.

For example, take the sample string as a pinyin string as an example; as shown in Figure 6, the model on the left represents the Bert model, and its original input is two Chinese sentences "Noah's Ark" and "Great", and the separator "SEP" separation, in addition, the original input also includes the label "CLS" for text classification; the model on the right represents the encoder in the embodiment of the application, and its original input is no longer two Chinese sentences, Instead, the sample string sequence "nuo ya fang zhou hen bang" does not require the separator "SEP", and, since the encoder does not need to classify the text, the original input of the encoder does not need the token "CLS".

As an implementation, step 102 includes:

Obtain K second position vectors and K second sample string vectors according to the sample character string sequence; obtain K first samples through an encoder based on the K second position vectors and K second sample character string vectors String vector.

Wherein, each second position vector represents the position of a sample string in the sample string sequence, taking the sample string sequence "nuo ya fang zhou hen bang" as an example, the second position vector corresponding to the sample string "fang" Indicates the position of "fang" in the sample string sequence "nuo ya fang zhou hen bang".

Each second sample character string vector represents a sample character string, wherein the second sample character string vector can be obtained through random initialization, or can be obtained through pre-training using an algorithm such as Word2Vector.

It should be noted that the second sample string vector is different from the first sample string vector, and the second sample string vector is only generated based on one sample string, so it only contains the information of one sample string itself; The first sample string vector is generated based on the encoder, and the encoder combines the information of multiple sample strings during the process of generating the first sample string vector. Therefore, the first sample string vector not only Contains information about a sample string itself, as well as information about other sample strings.

Taking the sample character string shown in FIG. 6 as a pinyin string as an example, and referring to FIG. 7 , the difference between the encoder in the embodiment of the present application and the Bert model will be described below.

Specifically, as shown in Figure 7, the left side of Figure 7 represents the direct input of the Bert model (that is, converted from the original input), specifically including three embedding embedding layers; corresponding to the original input shown in Figure 6, these three embedding Layers from bottom to top are position embedding layer, segment embedding layer and token embedding layer, where position embedding is used to distinguish different positions of the token in the sequence; segment embedding is used to distinguish whether the token is input In the first Chinese sentence ("Noah's Ark"), or in the second Chinese sentence ("Great"), prepare for the next sentence prediction task; token embedding represents the semantics of the token.

In the Bert model, the token is a Chinese character in a Chinese sentence. For example, the token can be the Chinese character "nuo"; the token can also be "SEP" or "CLS".

The right side of Figure 7 shows the direct input of the encoder in the embodiment of the present application, specifically including the position embedding position embedding layer and the mark embedding token embedding layer, but not including the segment embedding segment embedding layer, wherein the position embedding is used to distinguish between tokens At different positions in the sequence, the token embedding represents the semantics of the token.

In the encoder in the embodiment of the present application, the token is a pinyin or multiple pinyins, for example, the token can be "nuo" or "ya".

When the token is "nuo", E0 in the position embedding layer represents the position vector of "nuo", and Enuo in the token embedding layer represents the character vector of "nuo".

In addition, it can be seen from FIG. 7 that the length of each direct input of the encoder in the embodiment of the present application is smaller than the length of each direct input of the Bert model.

It should be noted that the ultimate goal of the Bert model is to do various tasks related to documents or sentences, such as: text classification, reading comprehension, question answering system, etc. Therefore, the length of the original input of the Bert model should cover most of the documents or sentences , is usually set to 512 tokens, correspondingly, the length of the direct input of the Bert model is also 512 tokens (Figure 7 only shows 9 tokens); and the final goal of the encoder in the embodiment of the present application It is used for the input method, that is, to receive the user’s input on the terminal device. Generally speaking, the user’s input is relatively short. Therefore, the length of the original input of the encoder in the embodiment of the present application does not need to be too long, and is usually set to 16 or 32 tokens (only 6 tokens are shown in FIG. 7 ), correspondingly, the length of the direct input of the encoder in the embodiment of the present application is also 16 or 32 tokens.

The length of the direct input of the encoder is small, so the number of parameters input to the encoder is small; and, taking the character string as pinyin as an example, the total number of pinyin is much smaller than the total number of Chinese characters, so the number of tokens that the encoder needs to process The total number is small; this can reduce the workload in the training process and improve the training efficiency.

Step 103, based on the K first sample character string vectors, obtain the second probability of each sample candidate word indicated by the K sample character strings.

Wherein, the second probability of the sample candidate word represents the probability of obtaining the sample candidate word according to the first sample character string vector.

There are multiple methods for calculating the second probability, which are not specifically limited in this embodiment of the present application.

As an implementable manner, step 103 may also include:

Based on the K first sample character string vectors, the second probability of each sample candidate word indicated by the K sample character strings is obtained through a probability model.

Specifically, the K first sample character string vectors may be input into the probability model, and the probability model will output the second probability.

At this time, the probability model and the encoder can be regarded as a whole, that is, a deep learning model, and the encoder can be regarded as the first half of the deep learning model, and the probability model can be regarded as the second half of the deep learning model.

Step 104, adjust the encoder based on the second probability.

It should be noted that there are many methods for adjusting the encoder based on the second probability, which are not specifically limited in this embodiment of the present application.

As an achievable manner, the sample candidate words indicated by each sample string contain a target sample word, and accordingly, step 104 includes: adjusting the parameters of the encoder so that the second probability of the target sample word increases, and /or to reduce the second probabilities of other sample candidate words except the target sample word.

For example, the sample string sequence is "nuoyafangzhouhenbang", for the sample string "nuo", the corresponding sample candidate words include "nuo", "waxy", "cowardly", etc., let "nuo" be the target sample word , then the parameters of the encoder can be adjusted so that the second probability of "nuo" increases, while the second probability of "waxy" and "cowardly" decrease.

In this embodiment, the target sample word is equivalent to the sample label. By adjusting the parameters of the encoder, the second probability of the target sample word can be increased as much as possible, while the second probability of other sample candidate words except the target sample word Reduce as much as possible; ideally, by adjusting the parameters of the encoder, the second probability of the target sample word is greater than the second probability of other sample candidate words.

Step 105, adjust the probability model based on the second probability.

Exemplarily, step 105 includes: adjusting the parameters of the probability model to increase the second probability of the target sample word, and/or to decrease the second probability of other sample candidate words except the target sample word.

The process of adjusting the parameters of the probability model is similar to the process of adjusting the parameters of the encoder. For details, refer to the related description of step 104 for understanding.

It should be noted that step 105 is optional, specifically, step 105 is performed when step 103 is realized by a probability model.

In addition, in the training phase, step 102 to step 105 will be repeatedly executed until the condition is met, and the training will stop; the embodiment of the present application does not specifically limit the content of the condition, for example, the condition may be that the value of the loss function is less than the threshold, where, The value of the loss function may be calculated according to the second probability, and the condition may also be that the number of repeated executions reaches a preset number of times.

In the embodiment of the present application, the sample character string sequence is encoded by an encoder to obtain the first sample character string vector, which is a combination of the information of the entire sample character string sequence and the sample characters The representation of the string, not just the sample string itself, that is, the first sample string vector contains more information; so the second probability of the target sample word is calculated based on the first sample string vector, and based on the second Probability Adjusting the encoder and probability model can improve the accuracy of the trained encoder and probability model, thereby improving the accuracy of the input method.

The above describes the training process of the encoder and the probability model. In addition, the Ngram model may also be used in the process of generating words and sentences using the method of generating words and sentences provided by the embodiment of the present application; therefore, the following The training process of the Ngram model will be described.

The training process of the Ngram model can be understood as the process of calculating the conditional probability between words.

Specifically, taking the pinyin input method as an example, the Chinese corpus is first converted into a sequence of Chinese words through a tokenizer, and then the conditional probability between words is counted through technical methods; After the tokenizer, the Chinese word sequence "Huawei/company/recent/released/latest/flagship phone" is obtained.

If N=2, the calculation method of the conditional probability between words is

Among them, C(w _n-1 ) is the total number of occurrences of word w _n-1 in all corpora, C(w _n-1 ,w _n ) is the simultaneous occurrence of two words w _n-1 and w _n in all corpora the number of occurrences; correspondingly,

The method for generating words and sentences provided by the embodiments of the present application will be introduced below from the application stage.

Specifically, the embodiment of the present application provides an embodiment of a method for generating words and sentences, which can be applied to input method systems in multiple languages such as Chinese, Japanese, and Korean; the input method system can be deployed in terminal devices , can also be deployed in the cloud server; when the input method system is deployed in the cloud server, this embodiment is executed by the cloud server, and the cloud server sends the generated target words to the terminal device for display on the terminal device.

As shown in Figure 8, this embodiment includes:

Step 201, obtain a character string sequence, the character string sequence includes M character strings, each character string indicates one or more candidate words, wherein, M is a positive integer.

Specifically, step 201 may include: obtaining a character string sequence according to user input.

Since the character string is described above, it is not described in detail here, and step 201 can be understood by referring to the relevant description of step 101 for details.

In order to be able to prompt the user with more kinds of target words and sentences, under normal circumstances, a character string indicates multiple candidate words; Corresponds, then the character string indicates a candidate word.

Step 202: Obtain M first character string vectors through an encoder according to the character string sequence, and each first character string vector corresponds to one of the M character strings.

Exemplarily, the encoder is trained based on a conversion task, wherein the conversion task is a task of converting sample character string sequences into sample words and sentences.

It should be noted that the training process based on the conversion task can be understood as the training process of the encoder in the training phase. For details, please refer to the relevant description of the training phase above for understanding.

As an implementable manner, step 202 includes:

Acquiring M first position vectors and M second character string vectors according to the string sequence, each first position vector represents the position of a character string in the character string sequence, and each second character string vector represents a character string;

According to the M first position vectors and the M second character string vectors, multiple first character string vectors are obtained through an encoder.

Step 202 is similar to step 102, which can be understood with reference to the relevant description of step 102, except that the number M of the first character string vectors in step 202 may be different from the number N of the first sample character string vectors.

Step 203, based on the M first character string vectors, the first probability of each candidate word indicated by the M character strings is obtained.

As an implementable manner, step 203 includes:

Based on the M first character string vectors, the first probability of each candidate word indicated by the M character strings is obtained through a probability model, and the probability model is obtained based on conversion task training.

Among them, the conversion task is the task of converting sample character string sequences into sample words and sentences.

It should be noted that the training process based on the conversion task can be understood as the training process of the probability model in the training phase. For details, please refer to the relevant description of the training phase above for understanding.

Step 203 is similar to step 103, which can be understood with reference to the related description of step 103, except that the number M of the first character string vectors in step 203 may be different from the number N of the first sample character string vectors.

Step 204, based on the first probability, generate target words and sentences, the target words and sentences include M target words, and each target word is one of one or more candidate words indicated by each character string.

Specifically, for each character string, a candidate word can be selected from all candidate words corresponding to the character string based on the first probability; thus, for M character strings, M candidate words can be selected, and these M Candidate words can form target words and sentences.

Usually, the candidate word with the highest probability is selected from all the candidate words corresponding to the character string to generate the target word and sentence.

For example, as shown in Figure 9, each string in the strings "nuo", "ya", "fang", "zhou", "hen" and "bang" indicates three candidate words; "nuo", choose "nuo" with the highest first probability. Similarly, for other character strings, choose the candidate words with the highest first probability as "ya", "fang", "zhou", "very" and "stick "; Based on this, the target phrase "Noah's Ark is great" can be generated.

Step 205, prompting the target word and sentence as the preferred word and sentence, which is the first word and sentence among the multiple words and sentences prompted by the input method.

In the input scene, the terminal device will prompt multiple words and sentences, and the embodiment of the present application uses the target word and sentence as the preferred word and sentence for prompting; taking Figure 1 as an example, the terminal device prompts three words and sentences, among which the preferred words and sentences are: Noah's Ark is great .

It should be noted that there are many methods for generating target words and sentences. In addition to the methods mentioned above, there are many other methods, which will be introduced below.

As an achievable way, the encoder and the Ngram model can be combined to generate target words and sentences based on the first probability output by the encoder and using the Ngram model, so as to improve the accuracy of the generated target words and sentences.

First, taking the character string as pinyin as an example, a theoretical analysis is made on the combination of the encoder and the Ngram model.

The embodiment of the present application can be regarded as converting the pinyin sequence y ₁ , y ₂ ... y _n into the corresponding word sequence w ₁ , w ₂ ... w _n (also can be understood as words and sentences), actually from all word sequences Select the word sequence with the largest conditional probability P(w ₁ , w ₂ ... w _n |y ₁ , y ₂ ... y _n ) as the target word.

According to Bayesian principle, this conditional probability can be decomposed and transformed as follows:

The above formula converts the conditional probability P(w ₁ ,w ₂ ...w _n |y ₁ ,y ₂ ...y _n ) into word probability P(w _i |y ₁ ,y ₂ ...y _n ,w ₁ , w ₂ ...,w _i-1 ) in the form of continuous product. Among them, the conditional probability P(w _i |y ₁ ,y ₂ ...y _n ,w ₁ ,w ₂ ...,w _i-1 ) representing the word can be further decomposed, as follows:

Among them, P(w _i |y ₁ ,y ₂ ... y _n ) is the first probability calculated above, and P( _wi |w _in ...,w _i-1 ) is the probability calculated by the Ngram model. In the last derivation of the above formula, the Markov assumption of the Ngram model is adopted, and the probability P(w _i |w ₁ ,w ₂ ...,wi _-1 ) is simplified to be only related to the first N words of w _i , which is to degenerate the probability P( _wi |w _in …,wi _-1 ) into P( _wi |w _in …,wi _-1 ), which can be specifically expressed as

Based on the above analysis, it can be seen that the first probability calculated above can be combined with the conditional probability calculated by the Ngram model to obtain a more accurate probability of words, thereby prompting more accurate target words and sentences.

Specifically, step 204 includes:

According to the string sequence, through the Ngram model, the third probability of each candidate word indicated by the M strings is obtained;

Based on the first probability, the third probability and the Viterbi algorithm, the target words and sentences are generated.

Based on the previous description of the Ngram model, it can be seen that the third probability of candidate words is actually the conditional probability of the occurrence of the first N candidate words, where the value of N can be set according to actual needs, for example, N It can be 1 or 2.

Based on the previous theoretical analysis, it can be seen that for each candidate word, the first probability and the third probability corresponding to the candidate word can be multiplied to obtain the combined probability (actually also the conditional probability), and the combined probability and Viterbi Viterbi Algorithm to generate target words and sentences.

The above process will be specifically described below in conjunction with FIG. 10 .

As shown in Figure 10, the first probability can be calculated based on the output of the coding model. Taking the Chinese character "square" as an example, the first probability of the Chinese character "square"=P(square | nuo, ya, fang, zhou, hen, bang ); the third probability can be obtained based on the Ngram model, taking the Chinese character "fang" as an example, assuming N=2, the third probability of the Chinese character "fang"=P(fang|ya).

Based on this, the combination probability of the Chinese character "fang" can be obtained by multiplying the first probability P(方|nuo, ya, fang, zhou, hen, bang) and the third probability P(方|亚).

Using the above method, the combination probability of all Chinese characters can be obtained, and then the Viterbi algorithm can be used to obtain a path with the highest probability, that is, the target sentence.

It is understandable that the training and distribution of encoders and probability models often take a long period of time, which cannot reflect changes in user input trends and user input scenarios in a timely manner, and it is difficult to cope with new words and hot words that appear on the network. For this reason, in the application phase, various types of dictionaries can be added to make up for the shortcomings of encoders and probability models.

Wherein, the dictionary can also be called a thesaurus, and the thesaurus can include at least one of the following types of thesaurus: basic thesaurus, phrase thesaurus, user personal thesaurus, hotspot thesaurus, various domain thesaurus, domain thesaurus It may be a thesaurus in the field of operating systems, a thesaurus in the field of artificial intelligence technology, and the like.

Correspondingly, as an achievable manner, as shown in FIG. 11, step 204 includes:

Step 301, obtain reference words from a reference dictionary.

The reference words include P candidate words indicated by P reference character strings, each reference character string indicates a candidate word, the P reference character strings are included in the character string sequence, and the positions in the character string sequence are continuous, wherein, P is an integer greater than 1.

The embodiment of the present application does not specifically limit the number of reference words, and the number of reference words may be one or multiple.

The reference words are described below through specific examples.

Specifically, the reference character string is "nuoyafangzhouhenbang"; as shown in FIG. 12 , the reference word obtained from the reference dictionary may be "Noah's Ark" indicated by the reference character string "nuoyafangzhou".

Step 302: Calculate the fourth probability of the reference word based on the respective first probabilities of the P candidate words.

It should be noted that there are multiple methods for calculating the fourth probability, which are not specifically limited in this embodiment of the present application.

Exemplarily, the geometric mean of the first probabilities of the P candidate words may be used as the fourth probability of the reference word.

For example, still taking Figure 12 as an example, refer to the fourth probability of the word "Noah's Ark"

Wherein, P(nuo), P(ya), P(fang) and P(zhou) represent the first probabilities of the candidate words "nuo", "ya", "fang" and "zhou" respectively.

Step 303: Generate target words and sentences based on the fourth probability and the first probability of each candidate word indicated by other character strings in the character string sequence except the P reference character strings.

Specifically, based on the fourth probability and the first probability of each candidate word indicated by other character strings, the probability of all first word combinations formed by the reference word and the candidate words indicated by other character strings can be calculated; The first probability of each candidate word can obtain the probability of all second word combinations formed by the candidate words indicated by each character string; finally, select the word combination with the highest probability from all first word combinations and all second word combinations as the target words.

Taking Figure 9 as an example, the reference word "Noah's Ark" and the three candidate words indicated by the character string "hen" and the three candidate words indicated by the character string "bang" form nine first word combinations, based on the fourth probability, The first probabilities of the three candidate words indicated by the character string "hen" and the first probabilities of the three candidate words indicated by the character string "bang" can calculate the probabilities of the nine first word combinations.

Based on the strings "nuo", "ya", "fang", "zhou", "hen" and "bang", each string corresponds to three candidate words, forming a total of 3*3*3*3* 3*3 second word combinations; the probability of each second word combination can be calculated according to the first probability of the candidate words.

Finally, the word combination with the highest probability is selected from the 9 first word combinations and the 3*3*3*3*3*3 second word combinations as the target words and sentences.

It can be understood that the first word combination is included in the second word combination; since the first word combination contains reference words, and the reference words are included in the reference dictionary, so the word combination containing the reference words can be preferentially selected as the target word and sentence.

Specifically, the calculation method of the corresponding fourth probability can be set in step 302, so that the fourth probability of the obtained reference word is greater than the product of the first probabilities of each candidate word in the reference word, so that the reference word contains The probability of the word combination becomes larger, so it can be preferentially selected.

For example, if the geometric mean of the first probabilities of the P candidate words is used as the fourth probability of the reference word, it can be ensured that the fourth probability of the reference word is greater than the product of the first probabilities of the P candidate words in the reference word.

In addition, when the fourth probability of the reference word is greater than the product of the first probabilities of each candidate word in the reference word, when using the first probability to calculate the probability of the second word combination, the probability of the first word combination can not be calculated, only The first probability is used to calculate the probability of other second word combinations in the second word combination except the first word combination.

In this embodiment, the insufficiency of the encoder and the probability model is made up for by adding a reference lexicon, so that the accuracy of the target words and sentences can be improved.

In order to further improve the accuracy of the target words and sentences, the encoder, the reference lexicon and the Ngram model can be combined to generate the target words and sentences.

Specifically, as an implementable manner, step 303 includes:

Through the Ngram model, obtain the fifth probability of each candidate word and the fifth probability of the reference word indicated by other character strings in the character string sequence except P reference character strings;

Based on the first probability, the fourth probability, the fifth probability and the Viterbi algorithm of each candidate word indicated by other character strings in the character string sequence except the P reference character strings, the target words and sentences are generated.

It should be noted that the embodiment of the present application regards all candidate words in the reference words as a whole, so that the conditional probability between the candidate words in the reference words does not need to be calculated through the Ngram model, and only the reference words need to be calculated through the Ngram model The fifth probability of the word is enough; in the process of calculating the fifth probability of the reference word, the fifth probability of the candidate word ranked first in the reference word can be calculated, and the first candidate word ranked first Five probability as the fifth probability of the reference word.

The above process will be described below through a specific example.

For example, still taking Fig. 9 as an example, the reference word is "Noah's Ark"; the fourth probability of "Noah's Ark" can be calculated by step 302, and the three candidate words indicated by the character string "hen" can be calculated by step 203 The first probability, the first probability of the three candidate words indicated by the string "bang"; Next, calculate the fifth probability of the candidate word "nuo" that ranks first in the reference words through the Ngram model, and put "nuo" The fifth probability of " is used as the fifth probability of the reference word "Noah's Ark", and the fifth probability of the three candidate words indicated by the string "hen" and the fifth probability of the three candidate words indicated by the string "bang" are calculated through the Ngram model. The fifth probability; finally, based on the first probability of each candidate word indicated by other strings in the string sequence except P reference strings, the fourth probability, the fifth probability and the Viterbi algorithm can obtain the most probable word combination , and the word combination with the highest probability is used as the target word.

It should be noted that since the reference dictionary provides reference words, in the process of calculating the probability of the candidate words following the reference words through the Ngram model, if the candidate words indicated by the reference string are needed, only the reference words can be considered candidate words in .

Specifically, as an achievable manner, the target character string is a character string that is ranked after the P reference character strings in the character string sequence.

The fifth probability of each candidate word indicated by the target string is the conditional probability of the occurrence of the candidate word indicated by the target string when Q candidate words appear, wherein Q is a positive integer, specifically based on different Ngram models definite.

The Q candidate words include a candidate word indicated by each character string in the Q continuous character strings before the target character string in the string sequence, and when the Q character strings contain the reference character string, the Q candidate words Contains candidate words from the reference words indicated by the reference string.

Taking Fig. 9 as an example, in the process of calculating the fifth probability of the candidate word "hen", if Q=1, then the fifth probability of "hen" represents the conditional probability under the occurrence of the candidate word "舟"; In the process of calculating the fifth probability of the candidate word "list", if Q=2, then the fifth probability of "mark" indicates that a candidate word (such as hate) indicated by the candidate word "zhou" and the character string "hen" appears The conditional probability for the case of .

Please refer to FIG. 13 , the embodiment of the present application also provides a device for generating words and sentences, including: a first acquisition unit 401, configured to acquire a character string sequence, the character string sequence includes M character strings, and each character string indicates one or more candidate words, wherein M is a positive integer; the first encoding unit 402 is used to obtain M first character string vectors through an encoder according to the character string sequence, and each first character string vector corresponds to M character strings a character string; the second acquisition unit 403 is used to obtain the first probability of each candidate word indicated by the M character strings based on the M first character string vectors; the generation unit 404 is used to generate based on the first probability The target word and sentence, the target word and sentence includes M target words, and each target word is one of one or more candidate words indicated by each character string.

As an achievable manner, the first encoding unit 402 is configured to obtain M first position vectors and M second character string vectors according to the character string sequence, and each first position vector represents a character string in the character string sequence Each second character string vector represents a character string; according to M first position vectors and M second character string vectors, multiple first character string vectors are obtained through an encoder.

As an achievable way, the second acquiring unit 403 is configured to acquire the first probability of each candidate word indicated by the M character strings based on the M first character string vectors through a probability model, and the probability model is based on the conversion The conversion task is the task of converting the sequence of sample strings into sample words and sentences.

As an achievable manner, the generating unit 404 is configured to obtain the third probability of each candidate word indicated by the M character strings through the Ngram model according to the character string sequence; based on the first probability, the third probability and the Viterbi Viterbi algorithm to generate target words and sentences.

As an achievable manner, the generation unit 404 is configured to acquire reference words from the reference dictionary, the reference words include P candidate words indicated by P reference character strings, each reference character string indicates a candidate word, and the P reference words The string is contained in the string sequence, and the position in the string sequence is continuous, wherein, P is an integer greater than 1; based on the first probabilities of each of the P candidate words, the fourth probability of the reference word is calculated; based on the fourth probabilities and the first probability of each candidate word indicated by other character strings in the character string sequence except the P reference character strings, to generate target words and sentences.

As an achievable way, the generation unit 404 is used to obtain the fifth probability of each candidate word indicated by other strings in the string sequence except the P reference character strings, and the fifth probability of the reference word through the Ngram model. Probability; based on the first probability, fourth probability, fifth probability and Viterbi algorithm of each candidate word indicated by other character strings except P reference character strings in the character string sequence, generate target words and sentences.

As a practicable manner, the device further includes a prompting unit 405, configured to prompt the target word and sentence as a preferred word and sentence, and the preferred word and sentence is the first word and sentence among the multiple words and sentences prompted by the input method.

For the specific implementation, related description and technical effects of the above units, please refer to the description of the application stage of the embodiment of the present application.

Please refer to FIG. 14 , the embodiment of the present application also provides a model training device, including: a third acquisition unit 501, configured to acquire a sequence of sample strings, the sequence of sample strings includes K sample strings, each sample string Indicate one or more sample candidate words, wherein K is a positive integer; the second encoding unit 502 is used to obtain K first sample string vectors through an encoder according to the sample string sequence, and each sample string The vector corresponds to a sample character string; the fourth acquisition unit 503 is used to obtain the second probability of each sample candidate word indicated by the K sample character strings based on the K first sample character string vectors; the adjustment unit 504 is used to Based on the second probability, the encoder is adjusted.

As an implementable manner, the second encoding unit 502 is configured to obtain K second position vectors and K second sample character string vectors according to the sample character string sequence, and each second position vector represents a sample character string in position in the sample string sequence, each second sample string vector represents a sample string; according to K second position vectors and K second sample string vectors, K first samples are obtained through an encoder String vector.

As an achievable manner, the sample candidate words indicated by each sample string contain a target sample word; the adjustment unit 504 is configured to adjust the parameters of the encoder so that the second probability of the target sample word increases, and/or Or to reduce the second probabilities of other sample candidate words except the target sample word.

As an achievable manner, the fourth obtaining unit 503 is configured to obtain the second probability of each sample candidate word indicated by the K sample strings through a probability model based on the K first sample string vectors; adjust Unit 504 is further configured to adjust the probability model based on the second probability.

As an achievable manner, the sample candidate words indicated by each sample string contain a target sample word; the adjustment unit 504 is configured to adjust the parameters of the probability model so that the second probability of the target sample word increases, and/or Or to reduce the second probabilities of other sample candidate words except the target sample word.

As an implementable manner, the third acquiring unit 501 is configured to acquire K sample character strings in the sample character string sequence based on the K target sample words.

Wherein, for the specific implementation, related descriptions and technical effects of the above units, please refer to the description of the training stage in the embodiment of the present application.

Please refer to Figure 15. Figure 15 is a schematic structural diagram of a computer device provided by an embodiment of the present application. The computer device can be a terminal device or a server, and is specifically used to implement the function of the word-sentence generation device in the embodiment corresponding to Figure 13 Or Fig. 14 corresponds to the function of the model training device in the embodiment; the computer equipment 1800 may have relatively large differences due to different configurations or performances, and may include one or more central processing units (central processing units, CPU) 1822 (for example, one or more than one processor) and memory 1832, and one or more storage media 1830 (such as one or more mass storage devices) for storing application programs 1842 or data 1844. Wherein, the memory 1832 and the storage medium 1830 may be temporary storage or persistent storage. The program stored in the storage medium 1830 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the computer device. Furthermore, the central processing unit 1822 may be configured to communicate with the storage medium 1830 , and execute a series of instruction operations in the storage medium 1830 on the computer device 1800 .

Computer device 1800 can also include one or more power supplies 1826, one or more wired or wireless network interfaces 1850, one or more input and output interfaces 1858, and/or, one or more operating systems 1841, such as Windows Server™, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

In the embodiment of the present application, the central processing unit 1822 may be used to execute the retrieval method performed by the word and sentence generating device in the embodiment corresponding to FIG. 13 . Specifically, the central processing unit 1822 can be used for:

Obtain a character string sequence, the character string sequence includes M character strings, each character string indicates one or more candidate words, where M is a positive integer;

According to the string sequence, M first string vectors are obtained through an encoder, and each first string vector corresponds to one of the M strings;

Based on the M first character string vectors, the first probability of each candidate word indicated by the M character strings is obtained;

Based on the first probability, target words and sentences are generated, and the target words and sentences include M target words, and each target word is one of one or more candidate words indicated by each character string.

In the embodiment of the present application, the central processing unit 1822 may be used to execute the model training method performed by the model training device in the embodiment corresponding to FIG. 14 . Specifically, the central processing unit 1822 can be used for:

Obtain a sequence of sample strings, the sequence of sample strings includes K sample strings, each sample string indicates one or more sample candidate words, where K is a positive integer;

According to the sample character string sequence, K first sample character string vectors are obtained through an encoder, and each sample character string vector corresponds to a sample character string;

Based on the K first sample character string vectors, obtain the second probability of each sample candidate word indicated by the K sample character strings;

Based on the second probability, the encoder is adjusted.

The embodiment of the present application also provides a chip, including one or more processors. Part or all of the processor is used to read and execute the computer program stored in the memory, so as to execute the methods of the foregoing embodiments.

In some implementations, some of the one or more processors may implement some of the steps in the above method through dedicated hardware, for example, the processing related to the neural network model may be performed by a dedicated neural network processor or graphics processor to achieve.

The embodiment of the present application also provides a computer storage medium, which is used for storing computer software instructions used by the above-mentioned computer equipment, which includes a program for executing a program designed for the computer equipment.

The computer device may be the word-sentence generating device in the embodiment corresponding to FIG. 13 or the model training device in the embodiment corresponding to FIG. 14 .

The embodiment of the present application also provides a computer program product, the computer program product includes computer software instructions, and the computer software instructions can be loaded by a processor to implement the procedures in the methods shown in the foregoing embodiments.

Those skilled in the art can clearly understand that for the convenience and brevity of description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, and will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed system, device and method can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .

Claims

A method for generating words and sentences, characterized in that, comprising:

Obtain a character string sequence, the character string sequence includes M character strings, each of which indicates one or more candidate words, where M is a positive integer;

According to the string sequence, M first string vectors are obtained through an encoder, and each of the first string vectors corresponds to one of the M strings;

Based on the M first character string vectors, obtaining a first probability of each candidate word indicated by the M character strings;

Based on the first probability, generate target words and sentences, where the target words and sentences include M target words, each of which is one of the one or more candidate words indicated by each character string.
The method according to claim 1, wherein said obtaining M first character string vectors through an encoder according to said character string sequence comprises:

Acquire M first position vectors and M second character string vectors according to the character string sequence, each of the first position vectors represents a position of the character string in the character string sequence, and each of the a second character string vector representing one of said character strings;

Obtain the multiple first string vectors through an encoder according to the M first position vectors and the M second string vectors.
The method according to claim 1 or 2, wherein the encoder is trained based on a conversion task, and the conversion task is a task of converting sample character string sequences into sample words and sentences.
The method according to any one of claims 1 to 3, wherein said obtaining the first probability of each candidate word indicated by said M character strings based on said M first character string vectors comprises:

Based on the M first character string vectors, the first probability of each candidate word indicated by the M character strings is obtained through a probability model, the probability model is obtained based on conversion task training, and the conversion task is The task of converting a sequence of sample strings into sample words.
According to the method described in any one of claims 1 to 4, it is characterized in that said generating target words and sentences based on said first probability comprises:

According to the character string sequence, through the Ngram model, obtain the third probability of each candidate word indicated by the M character strings;

Based on the first probability, the third probability and the Viterbi algorithm, target words and sentences are generated.
According to the method described in any one of claims 1 to 4, it is characterized in that said generating target words and sentences based on said first probability comprises:

Obtain reference words from a reference dictionary, the reference words include P candidate words indicated by P reference character strings, each of the reference character strings indicates one candidate word, and the P reference character strings are included in the In the character string sequence, and the positions in the character string sequence are continuous, wherein, P is an integer greater than 1;

Based on the respective first probabilities of the P candidate words, calculating a fourth probability of the reference word;

Based on the fourth probability and the first probability of each candidate word indicated by other character strings in the character string sequence except the P reference character strings, a target word and sentence is generated.
The method according to claim 6, characterized in that, the first of each candidate word indicated based on the fourth probability and other character strings in the character string sequence except the P reference character strings. Probability, the generated target words and sentences include:

Through the Ngram model, obtain the fifth probability of each candidate word indicated by other character strings except the P reference character strings in the character string sequence, and the fifth probability of the reference word;

Based on the first probability of each candidate word indicated by other character strings in the character string sequence except the P reference character strings, the fourth probability, the fifth probability and the Viterbi algorithm, a target word and sentence is generated.
The method according to claim 7, wherein the target character string is a character string after the P reference character strings in the character string sequence;

The fifth probability of each candidate word indicated by the target character string is the conditional probability that the candidate word indicated by the target character string occurs when Q candidate words appear, and Q is a positive integer;

The Q candidate words include a candidate word indicated by each character string in the Q consecutive character strings before the target character string in the character string sequence, and when the Q character strings include the When the reference character string is mentioned above, the Q candidate words include candidate words in the reference words indicated by the reference character string.
According to the method according to any one of claims 1 to 8, it is characterized in that, after said generating target words and sentences based on said first probability, said method further comprises: prompting said target words and sentences as preferred words and sentences, The preferred word and sentence is the first word and sentence among the multiple words and sentences prompted by the input method.
The method according to any one of claims 1 to 9, wherein the character string includes one pinyin or multiple pinyins.
A model training method, characterized in that, comprising:

Acquiring a sequence of sample character strings, the sequence of sample character strings includes K sample character strings, each of which indicates one or more sample candidate words, where K is a positive integer;

Obtain K first sample string vectors through an encoder according to the sample string sequence, each of the sample string vectors corresponds to one of the sample strings;

Based on the K first sample character string vectors, acquiring a second probability of each sample candidate word indicated by the K sample character strings;

Based on the second probability, the encoder is adjusted.
The method according to claim 11, wherein said obtaining K first sample character string vectors through an encoder according to the sample character string sequence comprises:

Acquiring K second position vectors and K second sample character string vectors according to the sample character string sequence, each of the second position vectors represents a position of the sample character string in the sample character string sequence, Each of the second sample string vectors represents one of the sample strings;

According to the K second position vectors and the K second sample character string vectors, the K first sample character string vectors are obtained through an encoder.
The method according to claim 11 or 12, wherein each sample candidate word indicated by the sample string contains a target sample word;

The adjusting the encoder based on the second probability includes:

Adjusting the parameters of the encoder so that the second probability of the target sample word increases, and/or so that the second probabilities of other sample candidate words except the target sample word decrease.
The method according to any one of claims 11 to 13, characterized in that, based on the K first sample character string vectors, obtaining the first number of each sample candidate word indicated by the K sample character strings Two probabilities include:

Based on the K first sample character string vectors, the second probability of each sample candidate word indicated by the K sample character strings is obtained through a probability model;

After obtaining the second probability of each sample candidate word indicated by the K sample strings based on the K first sample string vectors, the method further includes:

Based on the second probability, the probability model is adjusted.
The method according to claim 14, wherein each sample candidate word indicated by the sample string contains a target sample word;

The adjusting the probability model based on the second probability includes:

Adjusting the parameters of the probability model to increase the second probability of the target sample word, and/or to decrease the second probability of other sample candidate words except the target sample word.
The method according to any one of claims 11 to 15, wherein said acquiring a sample character string sequence comprises:

K sample character strings in the sequence of sample character strings are obtained based on the K target sample words.
The method according to any one of claims 11 to 16, wherein the sample character string includes one pinyin or multiple pinyins.
A word and sentence generating device is characterized in that, comprising:

The first obtaining unit is used to obtain a sequence of character strings, the sequence of character strings includes M character strings, and each of the character strings indicates one or more candidate words, wherein M is a positive integer;

The first encoding unit is configured to obtain M first character string vectors through an encoder according to the character string sequence, and each of the first character string vectors corresponds to one of the M character strings;

A second acquiring unit, configured to acquire the first probability of each candidate word indicated by the M character strings based on the M first character string vectors;

A generation unit, configured to generate target words and sentences based on the first probability, the target words and sentences include M target words, and each of the target words is one of the one or more candidate words indicated by each character string .
A model training device, characterized in that it comprises:

A third acquiring unit, configured to acquire a sequence of sample character strings, the sequence of sample character strings includes K sample character strings, each of which indicates one or more sample candidate words, where K is a positive integer;

The second encoding unit is configured to obtain K first sample string vectors through an encoder according to the sample string sequence, and each of the sample string vectors corresponds to one of the sample strings;

A fourth acquiring unit, configured to acquire the second probability of each sample candidate word indicated by the K sample character strings based on the K first sample character string vectors;

An adjusting unit, configured to adjust the encoder based on the second probability.
A computer device, characterized by comprising: one or more processors and a memory; wherein computer-readable instructions are stored in the memory;

The one or more processors read the computer readable instructions to cause the training device to implement the method of any one of claims 1-10.
A training device, characterized in that it comprises: one or more processors and a memory; wherein computer-readable instructions are stored in the memory;

The one or more processors read the computer readable instructions to cause the computer device to implement the method of any one of claims 11-17.
A computer-readable storage medium, characterized in that it includes computer-readable instructions, and when the computer-readable instructions are run on a computer, the computer executes the method according to any one of claims 1 to 17 .
A computer program product, characterized by comprising computer-readable instructions, which, when the computer-readable instructions are run on a computer, cause the computer to execute the method according to any one of claims 1 to 17.