WO2023280265A1 - Word or sentence generation method, model training method and related device - Google Patents

Word or sentence generation method, model training method and related device Download PDF

Info

Publication number
WO2023280265A1
WO2023280265A1 PCT/CN2022/104334 CN2022104334W WO2023280265A1 WO 2023280265 A1 WO2023280265 A1 WO 2023280265A1 CN 2022104334 W CN2022104334 W CN 2022104334W WO 2023280265 A1 WO2023280265 A1 WO 2023280265A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
probability
character string
words
word
Prior art date
Application number
PCT/CN2022/104334
Other languages
French (fr)
Chinese (zh)
Inventor
肖镜辉
刘群
吴海腾
谢武锋
熊元峰
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023280265A1 publication Critical patent/WO2023280265A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/02Input arrangements using manually operated switches, e.g. using keyboards or dials
    • G06F3/023Arrangements for converting discrete items of information into a coded form, e.g. arrangements for interpreting keyboard generated codes as alphanumeric codes, operand codes or instruction codes
    • G06F3/0233Character input methods
    • G06F3/0237Character input methods using prediction or retrieval techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Definitions

  • the present application relates to the technical field of input methods, in particular to a method for generating words and sentences, a method for training models and related equipment.
  • the input method editor is a necessary application program for the client, and is widely used in desktop computers, notebooks, mobile phones, tablets, smart TVs, car computers and other devices; and the user's daily activities, such as: searching for places, finding restaurants, Chatting and making friends, travel planning, etc., will largely be transformed into user input behaviors, so the data of the input method editor can be used to accurately describe users. Therefore, input method editors have great strategic significance in the Internet field.
  • the input method editor will generate words (words or sentences) and prompt the words and sentences for the user to choose.
  • words words or sentences
  • the accuracy of the generated words and sentences directly affects the input method editor.
  • the accuracy rate and user experience; for this, a method that can accurately generate words and sentences is needed.
  • Embodiments of the present application provide a method for generating words and sentences, a method for training models, and related equipment.
  • the method can improve the accuracy of generated words and sentences.
  • the first aspect of the embodiment of the present application provides a method for generating words and sentences, which can be applied to terminal devices or cloud servers, and specifically includes: obtaining a character string sequence, the character string sequence includes M character strings, each character A string indicates one or more candidate words; among them, a string can be understood as a combination of characters, which is a carrier of language information, carries pronunciation information, and is used to generate words or sentences; corresponding to different types of languages, the form of a string is different , taking Chinese as an example, the string can include one pinyin or multiple pinyin, and M is a positive integer; according to the string sequence, through the encoder, M first string vectors are obtained, and each first string vector corresponds to M A character string in the character string; the encoder can be understood as a deep learning network model, and there are various network structures of the encoder, which are not specifically limited in the embodiment of the present application; specifically, the network structure of the encoder can be Transformer The network structure of the encoder part of the network, or the network structure of a
  • the string sequence is encoded by the encoder to obtain the first string vector, which is the representation of the string after fusing the information of the entire string sequence, not just the string itself, that is, the first A character string vector contains more information; so calculating the first probability of the target word based on the first character string vector and generating the target word and sentence based on the first probability can improve the accuracy of the generated target word and sentence, thereby improving the input method. Accuracy.
  • obtaining M first character string vectors through the encoder according to the character string sequence includes: obtaining M first position vectors and M second character string vectors according to the character string sequence, each A position vector represents the position of a character string in the character string sequence, and each second character string vector represents a character string; according to M first position vectors and M second character string vectors, through an encoder, multiple The first string vector.
  • the Bert model needs to encode the words based on the position vector of the word, the vector of the word, the vector used to distinguish whether the word is in the first sentence or the second sentence, and the vector related to the separator "SEP" and the tag "CLS".
  • the first character string vector can be obtained by the encoder only according to the two vectors of the first position vector and the second character string vector of the character string; therefore, in the embodiment of the present application
  • the encoder needs to process fewer vectors, and the encoding efficiency is higher, thereby improving the response speed of the input method.
  • the encoder is trained based on the conversion task, where the conversion task is the task of converting a sequence of sample strings into sample words and sentences.
  • the encoder is used to convert the string into the first string vector, and then the first string vector is used to obtain the target words and sentences.
  • the function of the encoder is the same as that of encoding in the process of training based on conversion tasks.
  • the function of the encoder is similar; therefore, the encoder trained based on the conversion task is used to encode the string sequence, which can improve the encoding accuracy of the encoder, thereby improving the accuracy of the input method.
  • obtaining the first probability of each candidate word indicated by the M character strings includes: based on the M first character string vectors, through a probability model, obtaining M The first probability of each candidate word indicated by the string.
  • the probability model is trained based on the conversion task.
  • the probability model and the encoder can be regarded as a whole, that is, a deep learning model, and the encoder can be regarded as this
  • the first half of the deep learning model, the probability model can be regarded as the second half of the deep learning model; among them, the conversion task is the task of converting the sequence of sample strings into sample words and sentences.
  • Obtaining the first probability of candidate words through the probability model can improve the accuracy of the first probability; and, similar to the encoder, in the application phase, the function of the probability model is similar to that of the probability model in the process of training based on the conversion task, so , using the probability model trained based on the conversion task to calculate the first probability, which can improve the accuracy of the first probability, thereby improving the accuracy of the input method.
  • generating the target word and sentence includes: according to the character string sequence, through the Ngram model, obtaining the third probability of each candidate word indicated by M character strings, wherein, for any candidate word
  • the third probability of the candidate word represents the conditional probability of the occurrence of the candidate word when one or more candidate words appear in front
  • the target word and sentence is generated,
  • the Viterbi algorithm is a dynamic programming algorithm, which is used to find the Viterbi path that is most likely to produce a sequence of observed events.
  • the Viterbi path can also be called the optimal path, and the Viterbi algorithm can also be called a finite state transition. Transducer (Finite State Transducers, FST) algorithm.
  • the first probability of a candidate word can be understood as the conditional probability of the candidate word in the presence of a string sequence
  • the third probability of the candidate word can be understood as the conditional probability of the current candidate word in the presence of other candidate words, so in In the process of generating target words and sentences, both the first probability of candidate words and the third probability of candidate words calculated by the Ngram model are considered, which is conducive to generating target words and sentences with higher accuracy.
  • generating target words and sentences includes: obtaining reference words from a reference dictionary, and the reference dictionary may include at least one of the following types of thesaurus: basic thesaurus, phrase thesaurus, user personal words Library, hotspot thesaurus, various field thesaurus, the quantity of reference word can be one, also can be multiple, and reference word includes P candidate words indicated by P reference character strings, and each reference character string indicates a candidate word , P reference strings are included in the string sequence, and the positions in the string sequence are continuous, where P is an integer greater than 1; based on the first probabilities of the P candidate words, calculate the fourth probability of the reference word , the fourth probability indicates the possibility of the user selecting a reference word when inputting P reference character strings; there are many ways to calculate the fourth probability of a reference word, for example, the geometry of the first probability of P candidate words The average value is used as the fourth probability of the reference word; based on the fourth probability and the first probability of each candidate word indicated by other character strings in the character string
  • the reference dictionary can provide a variety of Words in the scene, new words or hot words are used as reference words to assist in the generation of target words and sentences, which can make up for the shortcomings of encoders and probability models and improve the accuracy of target words and sentences.
  • generating the target word and sentence includes: through the Ngram model, obtaining the character The fifth probability of each candidate word indicated by other strings except P reference strings in the string sequence, and the fifth probability of the reference word; based on each The first probability, the fourth probability, the fifth probability and the Viterbi algorithm of a candidate word to generate the target word and sentence.
  • the embodiment of the present application regards all candidate words in the reference words as a whole, so that it is not necessary to calculate the conditional probability between the candidate words inside the reference words through the Ngram model, and only need to calculate the first position of the reference words through the Ngram model.
  • Five probabilities are enough; in the process of calculating the fifth probability of the reference word, the fifth probability of the first candidate word in the reference word can be calculated, and the fifth probability of the first candidate word can be used as The fifth probability of the reference word.
  • the target character string is the character string after the P reference character strings in the character string sequence;
  • the fifth probability of each candidate word indicated by the target character string is, among the Q candidate words that appear
  • the conditional probability of the occurrence of the candidate word indicated by the target character string in the case, Q is a positive integer;
  • the Q candidate words include one of each character string indicated by each character string in the Q consecutive character strings before the target character string in the character string sequence candidate words, and when the Q character strings include the reference character string, the Q candidate words include candidate words in the reference words indicated by the reference character string.
  • the method further includes:
  • the preferred word and sentence is the first word and sentence among the multiple words and sentences prompted by the input method.
  • the terminal device will prompt multiple words and sentences.
  • the target words and sentences are prompted as the preferred words and sentences, so that the target words and sentences with the highest possibility of the user's choice can be preferentially prompted to the user, so as to improve the user's input efficiency.
  • the character string includes one pinyin or multiple pinyins.
  • this implementation provides a specific Chinese application scenario for the method in the embodiment of the present application.
  • the second aspect of the embodiment of the present application provides a model training method, including: obtaining a sample string sequence, the sample string sequence includes K sample strings, and each sample string indicates one or more sample candidate words, wherein, K is a positive integer; according to the sequence of sample strings, through the encoder, K first sample string vectors are obtained, and each sample string vector corresponds to a sample string; based on the K first sample string vectors, obtain The second probability of each sample candidate word indicated by the K sample character strings; based on the second probability, the encoder is adjusted.
  • the first aspect describes the character string, the encoder, the character string sequence, and the first probability, etc., the character string, the encoder, the character string sequence, and the second Two probability to understand.
  • the sample string sequence is encoded by the encoder to obtain the first sample string vector, which is a representation of the sample string after fusing the information of the entire sample string sequence, not just Indicates the sample string itself, that is, the first sample string vector contains more information; so the second probability of the target sample word is calculated based on the first sample string vector, and the encoder is adjusted based on the second probability, It can improve the accuracy of the trained encoder and probability model, thereby improving the accuracy of the input method.
  • obtaining K first sample character string vectors through an encoder according to the sample character string sequence includes: obtaining K second position vectors and K second sample character strings according to the sample character string sequence Vector, each second position vector represents the position of a sample character string in the sample character string sequence, and each second sample character string vector represents a sample character string; according to K second position vectors and K second sample characters
  • the string vectors are passed through the encoder to obtain K first sample string vectors.
  • the first sample character string vector can be obtained through the encoder; and the Bert model needs the position vector of the word, the word In addition to the vector, the vector used to distinguish whether the word is in the first sentence or the second sentence, and the vector related to the separator "SEP" and the tag "CLS" are also needed; therefore, the encoder in the embodiment of the application needs to process The number of vectors is less and the encoding efficiency is higher, which can improve the training efficiency.
  • the sample candidate words indicated by each sample string contain a target sample word, where the target sample word is equivalent to the sample label; correspondingly, based on the second probability, adjusting the encoder includes: The parameters of the encoder are adjusted so that the second probability of the target sample word increases, and/or so that the second probabilities of other sample candidate words except the target sample word decrease.
  • the sample string sequence is "nuoyafangzhouhenbang", for the sample string "nuo", the corresponding sample candidate words include “nuo", “waxy”, “cowardly”, etc., let “nuo” be the target sample word , then by adjusting the parameters of the encoder, the second probability of "nuo” can be increased, and the second probability of "waxy” and “cowardly” can be reduced.
  • the target sample words are preset, and during the training process, by adjusting the parameters of the encoder, the second probability of the target sample words increases and/or the second probability of other sample candidate words except the target sample words increases. The probability is reduced, so that the second probability of the target sample word is greater than the second probability of other sample candidate words, thereby realizing the training of the encoder.
  • obtaining the second probability of each sample candidate word indicated by the K sample character strings includes: based on the K first sample character string vectors, by The probability model obtains the second probability of each sample candidate word indicated by K sample strings; correspondingly, based on K first sample string vectors, obtains the second probability of each sample candidate word indicated by K sample strings After the second probability, the method further includes: adjusting the probability model based on the second probability.
  • Obtaining the second probability of the sample candidate words through the probability model can improve the accuracy of the second probability; and adjusting the probability model based on the second probability can improve the accuracy of the second probability output by the probability model.
  • the sample candidate words indicated by each sample character string contain a target sample word; based on the second probability, adjusting the probability model includes: adjusting the parameters of the probability model so that the second probability of the target sample word The second probability is increased, and/or the second probability of other sample candidate words except the target sample word is decreased.
  • the target sample words are preset, and during the training process, by adjusting the parameters of the probability model, the second probability of the target sample words increases and/or the second probability of other sample candidate words except the target sample words increases. The probability is reduced, so that the second probability of the target sample word is greater than the second probability of other sample candidate words, thereby realizing the training of the probability model.
  • obtaining the sample character string sequence includes: obtaining K sample character strings in the sample character string sequence based on K target sample words.
  • Obtaining sample character strings based on target sample words can improve the efficiency of obtaining sample character strings.
  • the sample character string includes one pinyin or multiple pinyins.
  • this implementation provides a specific Chinese application scenario for the method in the embodiment of the present application.
  • the third aspect of the embodiment of the present application provides a word and sentence generation device, including: a first acquisition unit, configured to acquire a character string sequence, the character string sequence includes M character strings, each character string indicates one or more candidate words, Wherein, M is a positive integer; the first encoding unit is used to obtain M first character string vectors through an encoder according to the character string sequence, and each first character string vector corresponds to one of the M character strings; The second acquisition unit is used to obtain the first probability of each candidate word indicated by M character strings based on M first character string vectors; the generation unit is used to generate target words and sentences based on the first probability, and the target words and sentences include M target words, each target word is one of one or more candidate words indicated by each character string.
  • the first encoding unit is used to obtain M first position vectors and M second character string vectors according to the character string sequence, and each first position vector represents a character string in the character string sequence
  • Each second character string vector represents a character string; according to the M first position vectors and the M second character string vectors, a plurality of first character string vectors are obtained through an encoder.
  • the encoder is trained based on the conversion task, which is the task of converting a sequence of sample strings into sample words and sentences.
  • the second acquisition unit is used to obtain the first probability of each candidate word indicated by the M character strings based on the M first character string vectors through a probability model, and the probability model is based on the conversion task
  • the conversion task is the task of converting the sequence of sample strings into sample words and sentences.
  • the generation unit is used to obtain the third probability of each candidate word indicated by the M strings through the Ngram model according to the string sequence; based on the first probability, the third probability and Viterbi Algorithm to generate target words and sentences.
  • the generation unit is used to obtain reference words from the reference dictionary
  • the reference words include P candidate words indicated by P reference character strings, each reference character string indicates a candidate word, and P reference characters
  • the string is included in the string sequence, and the position in the string sequence is continuous, wherein, P is an integer greater than 1; based on the first probabilities of the P candidate words, calculate the fourth probability of the reference word; based on the fourth probability and the first probability of each candidate word indicated by other character strings in the character string sequence except the P reference character strings, to generate target words and sentences.
  • the generation unit is used to obtain the fifth probability of each candidate word indicated by other strings in the string sequence except P reference strings, and the fifth probability of the reference word through the Ngram model ; Based on the first probability, the fourth probability, the fifth probability and the Viterbi algorithm of each candidate word indicated by other character strings except P reference character strings in the character string sequence, generate target words and sentences.
  • the target character string is the character string after the P reference character strings in the character string sequence;
  • the fifth probability of each candidate word indicated by the target character string is, among the Q candidate words that appear
  • the conditional probability of the occurrence of the candidate word indicated by the target character string in the case, Q is a positive integer;
  • the Q candidate words include one of each character string indicated by each character string in the Q consecutive character strings before the target character string in the character string sequence candidate words, and when the Q character strings include the reference character string, the Q candidate words include candidate words in the reference words indicated by the reference character string.
  • the device further includes a prompting unit, configured to prompt the target word and sentence as a preferred word and sentence, and the preferred word and sentence is the first word and sentence among the multiple words and sentences prompted by the input method.
  • the character string includes one pinyin or multiple pinyins.
  • the fourth aspect of the embodiment of the present application provides a model training device, including: a third acquisition unit, configured to acquire a sequence of sample strings, the sequence of sample strings includes K sample strings, and each sample string indicates one or more sample candidate words, wherein K is a positive integer; the second coding unit is used to obtain K first sample string vectors through an encoder according to the sample string sequence, and each sample string vector corresponds to a sample character string; the fourth acquisition unit is used to obtain the second probability of each sample candidate word indicated by K sample strings based on K first sample string vectors; the adjustment unit is used to encode based on the second probability device to adjust.
  • the second coding unit is used to obtain K second position vectors and K second sample string vectors according to the sample string sequence, and each second position vector represents a sample string in the sample The position in the string sequence, each second sample string vector represents a sample string; according to K second position vectors and K second sample string vectors, K first sample characters are obtained through an encoder string vector.
  • the sample candidate words indicated by each sample string contain a target sample word; the adjustment unit is configured to adjust the parameters of the encoder so that the second probability of the target sample word increases, and/or In order to reduce the second probability of other sample candidate words except the target sample word.
  • the fourth acquisition unit is used to obtain the second probability of each sample candidate word indicated by the K sample strings through a probability model based on the K first sample string vectors; the adjustment unit , is also used to adjust the probability model based on the second probability.
  • the sample candidate words indicated by each sample string contain a target sample word; the adjustment unit is configured to adjust the parameters of the probability model, so that the second probability of the target sample word increases, and/or In order to reduce the second probability of other sample candidate words except the target sample word.
  • the third acquiring unit is configured to acquire K sample character strings in the sample character string sequence based on the K target sample words.
  • the sample character string includes one pinyin or multiple pinyins.
  • the fifth aspect of the embodiment of the present application provides a computer device, including: one or more processors and a memory; wherein, computer-readable instructions are stored in the memory; one or more processors read the computer-readable instructions to The computer device is made to implement the method in any implementation manner of the first aspect.
  • the sixth aspect of the embodiment of the present application provides a training device, including: one or more processors and memory; wherein, computer-readable instructions are stored in the memory; one or more processors read the computer-readable instructions to Make the training device implement the method in any implementation manner of the second aspect.
  • the seventh aspect of the embodiment of the present application provides a computer-readable storage medium, including computer-readable instructions.
  • the computer-readable instructions When the computer-readable instructions are run on the computer, the computer is executed according to any implementation manner of the first aspect or the second aspect. method.
  • the eighth aspect of the embodiment of the present application provides a chip, including one or more processors. Part or all of the processor is used to read and execute the computer program stored in the memory, so as to execute the method in any possible implementation manner of the first aspect or the second aspect above.
  • the chip includes a memory, and the memory and the processor are connected to the memory through a circuit or wires. Further optionally, the chip further includes a communication interface, and the processor is connected to the communication interface.
  • the communication interface is used to receive data and/or information to be processed, and the processor obtains the data and/or information from the communication interface, processes the data and/or information, and outputs the processing result through the communication interface.
  • the communication interface may be an input-output interface.
  • some of the one or more processors can also implement some steps in the above method through dedicated hardware.
  • the processing related to the neural network model can be performed by a dedicated neural network processor or graphics processor to achieve.
  • the method provided in the embodiment of the present application may be implemented by one chip, or may be implemented by multiple chips in cooperation.
  • the ninth aspect of the embodiments of the present application provides a computer program product, the computer program product includes computer software instructions, and the computer software instructions can be loaded by a processor to implement any one of the above first or second aspects. Methods.
  • FIG. 1 is a schematic diagram of an application scenario of an embodiment of the present application
  • Fig. 2 is the schematic diagram of word sequence in the embodiment of the present application.
  • Fig. 3 is the schematic diagram of pre-training language model
  • FIG. 4 is a schematic diagram of the system architecture of the embodiment of the present application.
  • FIG. 5 is a schematic diagram of an embodiment of the model training method provided by the embodiment of the present application.
  • Fig. 6 is a comparative schematic diagram of the original input of the encoder and the Bert model in the embodiment of the present application;
  • Fig. 7 is a comparative schematic diagram of the direct input of the encoder and the Bert model in the embodiment of the present application;
  • FIG. 8 is a schematic diagram of an embodiment of a method for generating words and sentences provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of an embodiment of candidate words in the embodiment of the present application.
  • Fig. 10 is a schematic diagram of the combination of the first probability and the third probability in the embodiment of the present application.
  • FIG. 11 is a schematic diagram of an embodiment of generating target words and sentences in the embodiment of the present application.
  • FIG. 12 is a schematic diagram of the use of the reference dictionary in the embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a device for generating words and sentences provided by an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of a model training device provided in an embodiment of the present application.
  • FIG. 15 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • Embodiments of the present application provide a method for generating words and sentences, a method for training models, and related equipment.
  • the method can improve the accuracy of generated words and sentences, thereby improving the accuracy of input methods and user experience.
  • the embodiment of the present application can be applied to the input scenario shown in FIG. 1 .
  • the user can input a character string on the terminal device.
  • the input method editor (Input Method Editor, IME) deployed inside the terminal device will receive the character string entered by the user and generate a character string based on the character string. corresponding phrase, and then prompt the phrase to the user.
  • a character string can be understood as a combination of characters, which is a carrier of language information and is used to generate a sentence; the sentence can be one word or multiple words, and a word can also become a word.
  • the above-mentioned input scene can be an input scene of multiple languages such as Chinese, Japanese, and Chinese; corresponding to different types of languages, the form of the character string is different; taking Chinese as an example, the character string can include one pinyin or multiple pinyins, specifically, as As shown in Figure 1, when the character string nuoyafangzhou is input, the words and sentences suggested by the input method editor are Arthur's Ark is great, Arthur's Ark and Arthur.
  • the terminal device may be a desktop computer, a notebook computer, a tablet computer, a smart phone, or a smart TV.
  • the terminal device may also be any other device that can deploy an input method editor such as a vehicle computer.
  • the suggested words and sentences include Arthur's Ark, which is great. It can be seen that the suggested words and sentences are more accurate, which can obviously improve the user's input efficiency and user experience.
  • an embodiment of the present application provides a method for generating words and sentences, which uses an encoder to encode a character string (such as pinyin) input by the user into a character string vector, and then generates a target based on the character string vector phrases to improve the accuracy of the generated phrases.
  • a character string such as pinyin
  • Input method preferred word When the user enters a character string, the input method editor will provide the user with a candidate list, which is used to prompt the user for words and sentences, and the first in the candidate list is called the preferred input method word.
  • Transformer network structure a deep neural network structure, including input layer, self-attention layer, feed-forward layer, normalization layer and other substructures.
  • Bert model A model with a Transformer network structure, and on the basis of the Transformer network structure, a "pre-training + fine-tuning" learning paradigm is proposed, and two pre-training tasks, Masked Language Model and Next Sentence Prediction, are designed.
  • Ngram model A model widely used in Chinese input method tasks.
  • Zero probability problem In the process of using the Ngram model, in some cases, the value of probability will be calculated as zero, and the probability of zero value will cause many problems in engineering implementation; for example, because of zero probability, it is impossible to compare the probability between size, only random results can be returned.
  • Smoothing algorithm An algorithm designed to solve the zero probability problem of the Ngram model. When judging that there is a zero probability risk, the smoothing algorithm usually uses a stable but inaccurate low-order Ngram model probability. to fit unstable but accurate higher-order Ngram model probabilities.
  • Viterbi Algorithm is a dynamic programming algorithm for finding the Viterbi path that is most likely to produce the sequence of observed events, or the sequence of hidden states, especially in the context of Markov information sources and hidden Markov models Among them, it is often used in speech recognition, keyword recognition, computational linguistics and bioinformatics; among them, the Viterbi algorithm can also be called the Finite State Transducers (FST) algorithm.
  • FST Finite State Transducers
  • the Ngram model is introduced in detail below.
  • the Ngram model makes the Markov assumption that the probability of the current word is only related to a limited number of N words.
  • N takes different values, a series of specific Ngram models are obtained.
  • the smoothing algorithm can be simply understood as, when the probability of the Ngram model is 0, the product of a certain weight and the probability of the (N-1)gram model is used as the probability of the (N)gram model.
  • the Ngram model is described below with a specific example.
  • the bottom line represents pinyin nodes
  • the upper four lines of nodes are Chinese characters corresponding to pinyin nodes. These Chinese characters constitute various possibilities for user input.
  • the probability of each Chinese character node can be calculated by using the Ngram model. Since the probability of the Chinese character node is actually the conditional probability of the occurrence of the previous N Chinese character nodes, this probability can also be regarded as the path transition probability between Chinese character nodes.
  • the Ngram model can be used to calculate the probabilities P (Ya
  • pretrained language model pertrained language model, PLM
  • Bert model The following describes the pretrained language model (pertrained language model, PLM) and Bert model.
  • the pre-trained language model is an important general model in the field of natural language processing (NLP) that has emerged in recent years.
  • NLP natural language processing
  • AI is an important technical means.
  • the pre-trained language model mainly includes three aspects: network structure, learning paradigm and (pre-)training tasks.
  • the network structure of the pre-trained language model adopts the network structure of the encoder part of the Transformer network.
  • the encoder part includes an input layer, a self-attention layer, a feed-forward layer, and a normalization layer.
  • the Bert model is based on the encoder part of the encoder, using the "pre-training + fine-tuning" learning paradigm, that is, using pre-training tasks on a large amount of unlabeled corpus to learn the basic model, and then fine-tuning the basic model on the target task.
  • the pre-training tasks mainly refer to the Masked Language Model task and the Next Sentence Prediction task.
  • the system architecture of the embodiment of the present application includes a training phase and an application phase, which will be described below using Chinese as an example.
  • the Chinese character corpus is passed through the word segmentation device to obtain the word segmentation data.
  • train the Ngram model is expected to be converted from Chinese characters to pinyin through a phonetic converter to obtain pinyin corpus.
  • the encoder is trained to encode the Pinyin into a vector; since the encoder also uses the encoder part of the Transformer network, it is similar to the existing Bert model and is used for encoding Pinyin, so the encoder can also be called a Pinyin Bert model.
  • the Pinyin Bert model is combined with the Ngram model, and then combined with various external resource banks, such as: basic thesaurus, phrase thesaurus, user thesaurus, various domain thesaurus ( Figure 4 shows domain words 1, Field words 2 and field words 3), etc., to obtain an input engine, which is used to prompt corresponding words and sentences in response to the pinyin input by the user.
  • various external resource banks such as: basic thesaurus, phrase thesaurus, user thesaurus, various domain thesaurus ( Figure 4 shows domain words 1, Field words 2 and field words 3), etc.
  • the model training method provided by the embodiment of the present application will be introduced from the training stage first.
  • the embodiment of the present application provides an embodiment of a model training method, which can be applied to multiple languages such as Chinese, Japanese, and Korean. Since the process of model training requires a large amount of computation, this Embodiments are typically performed by a server.
  • this embodiment includes:
  • Step 101 acquire a sample character string sequence.
  • the sample character string sequence includes K sample character strings, where K is a positive integer.
  • a character string can be understood as a combination of characters, which is a carrier of language information and is used to generate a sentence; the sentence can be one word, or multiple words, and a word can also become a word.
  • the above-mentioned input scene can be an input scene of multiple languages such as Chinese, Japanese, and Chinese; corresponding to different types of languages, the form of the character string is different; taking Chinese as an example, the character string can include one pinyin or multiple pinyin, at this time, the characters A string can also be called a pinyin string, for example, the string can be "nuoyafangzhou”.
  • a sample character string refers to a character string used as a sample and used for training.
  • Each sample character string indicates one or more sample candidate words, and the sample candidate words may be one character or multiple characters.
  • the corresponding sample candidate words can be “nuo”, “waxy”, “cowardly” etc.; when the sample character string is "ya”, the corresponding sample candidate words can be " Asia", “pressure”, “ah” and so on.
  • step 101 includes: acquiring K sample character strings in the sample character string sequence based on the K target sample words.
  • the target sample word can be converted from Chinese characters to pinyin by a phonetic converter to obtain the sample character string.
  • Step 102 Obtain K first sample character string vectors through an encoder according to the sample character string sequence, and each first sample character string vector corresponds to a sample character string.
  • the encoder can be understood as a deep learning network model, and there are various network structures of the encoder, which are not specifically limited in the embodiment of the present application; specifically, the network structure of the encoder can adopt the network structure of the encoder part of the Transformer network , or adopt the network structure of a series of other networks obtained from the encoder part of the Transformer network.
  • the network structure of the encoder in the embodiment of this application is similar to the network structure of the Bert model, the network structure of the encoder part of the Transformer network is also used, but the actual situation is quite different. The following will illustrate this application through multiple comparisons.
  • the encoder in the embodiment is different from that of the Bert model.
  • the model on the left represents the Bert model, and its original input is two Chinese sentences “Noah's Ark” and “Great”, and the separator "SEP” separation, in addition, the original input also includes the label "CLS” for text classification; the model on the right represents the encoder in the embodiment of the application, and its original input is no longer two Chinese sentences, Instead, the sample string sequence "nuo ya fang zhou hen bang" does not require the separator "SEP", and, since the encoder does not need to classify the text, the original input of the encoder does not need the token "CLS".
  • step 102 includes:
  • each second position vector represents the position of a sample string in the sample string sequence, taking the sample string sequence "nuo ya fang zhou hen bang" as an example, the second position vector corresponding to the sample string "fang” Indicates the position of "fang” in the sample string sequence "nuo ya fang zhou hen bang".
  • Each second sample character string vector represents a sample character string, wherein the second sample character string vector can be obtained through random initialization, or can be obtained through pre-training using an algorithm such as Word2Vector.
  • the second sample string vector is different from the first sample string vector, and the second sample string vector is only generated based on one sample string, so it only contains the information of one sample string itself;
  • the first sample string vector is generated based on the encoder, and the encoder combines the information of multiple sample strings during the process of generating the first sample string vector. Therefore, the first sample string vector not only Contains information about a sample string itself, as well as information about other sample strings.
  • Figure 7 the left side of Figure 7 represents the direct input of the Bert model (that is, converted from the original input), specifically including three embedding embedding layers; corresponding to the original input shown in Figure 6, these three embedding Layers from bottom to top are position embedding layer, segment embedding layer and token embedding layer, where position embedding is used to distinguish different positions of the token in the sequence; segment embedding is used to distinguish whether the token is input In the first Chinese sentence ("Noah's Ark"), or in the second Chinese sentence ("Great”), prepare for the next sentence prediction task; token embedding represents the semantics of the token.
  • the token is a Chinese character in a Chinese sentence.
  • the token can be the Chinese character "nuo”; the token can also be "SEP” or "CLS".
  • the right side of Figure 7 shows the direct input of the encoder in the embodiment of the present application, specifically including the position embedding position embedding layer and the mark embedding token embedding layer, but not including the segment embedding segment embedding layer, wherein the position embedding is used to distinguish between tokens At different positions in the sequence, the token embedding represents the semantics of the token.
  • the token is a pinyin or multiple pinyins, for example, the token can be "nuo" or "ya”.
  • E0 in the position embedding layer represents the position vector of "nuo"
  • Enuo in the token embedding layer represents the character vector of "nuo”.
  • each direct input of the encoder in the embodiment of the present application is smaller than the length of each direct input of the Bert model.
  • the length of the original input of the Bert model should cover most of the documents or sentences , is usually set to 512 tokens, correspondingly, the length of the direct input of the Bert model is also 512 tokens ( Figure 7 only shows 9 tokens); and the final goal of the encoder in the embodiment of the present application It is used for the input method, that is, to receive the user’s input on the terminal device. Generally speaking, the user’s input is relatively short.
  • the length of the original input of the encoder in the embodiment of the present application does not need to be too long, and is usually set to 16 or 32 tokens (only 6 tokens are shown in FIG. 7 ), correspondingly, the length of the direct input of the encoder in the embodiment of the present application is also 16 or 32 tokens.
  • the length of the direct input of the encoder is small, so the number of parameters input to the encoder is small; and, taking the character string as pinyin as an example, the total number of pinyin is much smaller than the total number of Chinese characters, so the number of tokens that the encoder needs to process The total number is small; this can reduce the workload in the training process and improve the training efficiency.
  • Step 103 based on the K first sample character string vectors, obtain the second probability of each sample candidate word indicated by the K sample character strings.
  • the second probability of the sample candidate word represents the probability of obtaining the sample candidate word according to the first sample character string vector.
  • step 103 may also include:
  • the second probability of each sample candidate word indicated by the K sample character strings is obtained through a probability model.
  • the K first sample character string vectors may be input into the probability model, and the probability model will output the second probability.
  • the probability model and the encoder can be regarded as a whole, that is, a deep learning model, and the encoder can be regarded as the first half of the deep learning model, and the probability model can be regarded as the second half of the deep learning model.
  • Step 104 adjust the encoder based on the second probability.
  • step 104 includes: adjusting the parameters of the encoder so that the second probability of the target sample word increases, and /or to reduce the second probabilities of other sample candidate words except the target sample word.
  • the sample string sequence is "nuoyafangzhouhenbang", for the sample string "nuo”, the corresponding sample candidate words include “nuo", “waxy”, “cowardly”, etc., let “nuo” be the target sample word , then the parameters of the encoder can be adjusted so that the second probability of "nuo” increases, while the second probability of "waxy” and “cowardly” decrease.
  • the target sample word is equivalent to the sample label.
  • the second probability of the target sample word can be increased as much as possible, while the second probability of other sample candidate words except the target sample word Reduce as much as possible; ideally, by adjusting the parameters of the encoder, the second probability of the target sample word is greater than the second probability of other sample candidate words.
  • Step 105 adjust the probability model based on the second probability.
  • step 105 includes: adjusting the parameters of the probability model to increase the second probability of the target sample word, and/or to decrease the second probability of other sample candidate words except the target sample word.
  • the process of adjusting the parameters of the probability model is similar to the process of adjusting the parameters of the encoder. For details, refer to the related description of step 104 for understanding.
  • step 105 is optional, specifically, step 105 is performed when step 103 is realized by a probability model.
  • step 102 to step 105 will be repeatedly executed until the condition is met, and the training will stop; the embodiment of the present application does not specifically limit the content of the condition, for example, the condition may be that the value of the loss function is less than the threshold, where, The value of the loss function may be calculated according to the second probability, and the condition may also be that the number of repeated executions reaches a preset number of times.
  • the sample character string sequence is encoded by an encoder to obtain the first sample character string vector, which is a combination of the information of the entire sample character string sequence and the sample characters
  • the representation of the string, not just the sample string itself, that is, the first sample string vector contains more information; so the second probability of the target sample word is calculated based on the first sample string vector, and based on the second Probability Adjusting the encoder and probability model can improve the accuracy of the trained encoder and probability model, thereby improving the accuracy of the input method.
  • the Ngram model may also be used in the process of generating words and sentences using the method of generating words and sentences provided by the embodiment of the present application; therefore, the following The training process of the Ngram model will be described.
  • the training process of the Ngram model can be understood as the process of calculating the conditional probability between words.
  • the Chinese corpus is first converted into a sequence of Chinese words through a tokenizer, and then the conditional probability between words is counted through technical methods; After the tokenizer, the Chinese word sequence "Huawei/company/recent/released/latest/flagship phone" is obtained.
  • C(w n-1 ) is the total number of occurrences of word w n-1 in all corpora
  • C(w n-1 ,w n ) is the simultaneous occurrence of two words w n-1 and w n in all corpora the number of occurrences; correspondingly,
  • the embodiment of the present application provides an embodiment of a method for generating words and sentences, which can be applied to input method systems in multiple languages such as Chinese, Japanese, and Korean; the input method system can be deployed in terminal devices , can also be deployed in the cloud server; when the input method system is deployed in the cloud server, this embodiment is executed by the cloud server, and the cloud server sends the generated target words to the terminal device for display on the terminal device.
  • this embodiment includes:
  • Step 201 obtain a character string sequence, the character string sequence includes M character strings, each character string indicates one or more candidate words, wherein, M is a positive integer.
  • step 201 may include: obtaining a character string sequence according to user input.
  • step 201 can be understood by referring to the relevant description of step 101 for details.
  • a character string indicates multiple candidate words; Corresponds, then the character string indicates a candidate word.
  • Step 202 Obtain M first character string vectors through an encoder according to the character string sequence, and each first character string vector corresponds to one of the M character strings.
  • the encoder is trained based on a conversion task, wherein the conversion task is a task of converting sample character string sequences into sample words and sentences.
  • the training process based on the conversion task can be understood as the training process of the encoder in the training phase.
  • the relevant description of the training phase above for understanding please refer to the relevant description of the training phase above for understanding.
  • step 202 includes:
  • each first position vector represents the position of a character string in the character string sequence
  • each second character string vector represents a character string
  • multiple first character string vectors are obtained through an encoder.
  • Step 202 is similar to step 102, which can be understood with reference to the relevant description of step 102, except that the number M of the first character string vectors in step 202 may be different from the number N of the first sample character string vectors.
  • Step 203 based on the M first character string vectors, the first probability of each candidate word indicated by the M character strings is obtained.
  • step 203 includes:
  • the first probability of each candidate word indicated by the M character strings is obtained through a probability model, and the probability model is obtained based on conversion task training.
  • the conversion task is the task of converting sample character string sequences into sample words and sentences.
  • the training process based on the conversion task can be understood as the training process of the probability model in the training phase.
  • the relevant description of the training phase above for understanding please refer to the relevant description of the training phase above for understanding.
  • Step 203 is similar to step 103, which can be understood with reference to the related description of step 103, except that the number M of the first character string vectors in step 203 may be different from the number N of the first sample character string vectors.
  • Step 204 based on the first probability, generate target words and sentences, the target words and sentences include M target words, and each target word is one of one or more candidate words indicated by each character string.
  • a candidate word can be selected from all candidate words corresponding to the character string based on the first probability; thus, for M character strings, M candidate words can be selected, and these M Candidate words can form target words and sentences.
  • the candidate word with the highest probability is selected from all the candidate words corresponding to the character string to generate the target word and sentence.
  • each string in the strings “nuo”, “ya”, “fang”, “zhou”, “hen” and “bang” indicates three candidate words; “nuo”, choose “nuo” with the highest first probability.
  • choose the candidate words with the highest first probability as “ya”, “fang”, “zhou”, “very” and “stick “; Based on this, the target phrase "Noah's Ark is great” can be generated.
  • Step 205 prompting the target word and sentence as the preferred word and sentence, which is the first word and sentence among the multiple words and sentences prompted by the input method.
  • the terminal device will prompt multiple words and sentences, and the embodiment of the present application uses the target word and sentence as the preferred word and sentence for prompting; taking Figure 1 as an example, the terminal device prompts three words and sentences, among which the preferred words and sentences are: Arthur's Ark is great .
  • the encoder and the Ngram model can be combined to generate target words and sentences based on the first probability output by the encoder and using the Ngram model, so as to improve the accuracy of the generated target words and sentences.
  • the embodiment of the present application can be regarded as converting the pinyin sequence y 1 , y 2 ... y n into the corresponding word sequence w 1 , w 2 ... w n (also can be understood as words and sentences), actually from all word sequences Select the word sequence with the largest conditional probability P(w 1 , w 2 ... w n
  • the above formula converts the conditional probability P(w 1 ,w 2 ...w n
  • y 1 ,y 2 ...y n ,w 1 ,w 2 ...,w i-1 ) representing the word can be further decomposed, as follows:
  • y 1 ,y 2 ... y n ) is the first probability calculated above
  • w in ...,w i-1 ) is the probability calculated by the Ngram model.
  • the Markov assumption of the Ngram model is adopted, and the probability P(w i
  • the first probability calculated above can be combined with the conditional probability calculated by the Ngram model to obtain a more accurate probability of words, thereby prompting more accurate target words and sentences.
  • step 204 includes:
  • the target words and sentences are generated.
  • the third probability of candidate words is actually the conditional probability of the occurrence of the first N candidate words, where the value of N can be set according to actual needs, for example, N It can be 1 or 2.
  • the first probability and the third probability corresponding to the candidate word can be multiplied to obtain the combined probability (actually also the conditional probability), and the combined probability and Viterbi Viterbi Algorithm to generate target words and sentences.
  • the combination probability of the Chinese character “fang” can be obtained by multiplying the first probability P( ⁇
  • the combination probability of all Chinese characters can be obtained, and then the Viterbi algorithm can be used to obtain a path with the highest probability, that is, the target sentence.
  • the dictionary can also be called a thesaurus
  • the thesaurus can include at least one of the following types of thesaurus: basic thesaurus, phrase thesaurus, user personal thesaurus, hotspot thesaurus, various domain thesaurus, domain thesaurus It may be a thesaurus in the field of operating systems, a thesaurus in the field of artificial intelligence technology, and the like.
  • step 204 includes:
  • Step 301 obtain reference words from a reference dictionary.
  • the reference words include P candidate words indicated by P reference character strings, each reference character string indicates a candidate word, the P reference character strings are included in the character string sequence, and the positions in the character string sequence are continuous, wherein, P is an integer greater than 1.
  • the embodiment of the present application does not specifically limit the number of reference words, and the number of reference words may be one or multiple.
  • the reference character string is "nuoyafangzhouhenbang”; as shown in FIG. 12 , the reference word obtained from the reference dictionary may be "Noah's Ark" indicated by the reference character string "nuoyafangzhou”.
  • Step 302 Calculate the fourth probability of the reference word based on the respective first probabilities of the P candidate words.
  • the geometric mean of the first probabilities of the P candidate words may be used as the fourth probability of the reference word.
  • Step 303 Generate target words and sentences based on the fourth probability and the first probability of each candidate word indicated by other character strings in the character string sequence except the P reference character strings.
  • the probability of all first word combinations formed by the reference word and the candidate words indicated by other character strings can be calculated;
  • the first probability of each candidate word can obtain the probability of all second word combinations formed by the candidate words indicated by each character string; finally, select the word combination with the highest probability from all first word combinations and all second word combinations as the target words.
  • the reference word "Noah's Ark" and the three candidate words indicated by the character string "hen” and the three candidate words indicated by the character string “bang” form nine first word combinations, based on the fourth probability,
  • the first probabilities of the three candidate words indicated by the character string "hen” and the first probabilities of the three candidate words indicated by the character string "bang” can calculate the probabilities of the nine first word combinations.
  • each string corresponds to three candidate words, forming a total of 3*3*3*3* 3*3 second word combinations; the probability of each second word combination can be calculated according to the first probability of the candidate words.
  • the word combination with the highest probability is selected from the 9 first word combinations and the 3*3*3*3*3*3*3 second word combinations as the target words and sentences.
  • the first word combination is included in the second word combination; since the first word combination contains reference words, and the reference words are included in the reference dictionary, so the word combination containing the reference words can be preferentially selected as the target word and sentence.
  • the calculation method of the corresponding fourth probability can be set in step 302, so that the fourth probability of the obtained reference word is greater than the product of the first probabilities of each candidate word in the reference word, so that the reference word contains The probability of the word combination becomes larger, so it can be preferentially selected.
  • the fourth probability of the reference word is greater than the product of the first probabilities of the P candidate words in the reference word.
  • the probability of the first word combination when using the first probability to calculate the probability of the second word combination, the probability of the first word combination can not be calculated, only The first probability is used to calculate the probability of other second word combinations in the second word combination except the first word combination.
  • the insufficiency of the encoder and the probability model is made up for by adding a reference lexicon, so that the accuracy of the target words and sentences can be improved.
  • the encoder, the reference lexicon and the Ngram model can be combined to generate the target words and sentences.
  • step 303 includes:
  • the target words and sentences are generated.
  • the embodiment of the present application regards all candidate words in the reference words as a whole, so that the conditional probability between the candidate words in the reference words does not need to be calculated through the Ngram model, and only the reference words need to be calculated through the Ngram model
  • the fifth probability of the word is enough; in the process of calculating the fifth probability of the reference word, the fifth probability of the candidate word ranked first in the reference word can be calculated, and the first candidate word ranked first Five probability as the fifth probability of the reference word.
  • the reference word is "Noah's Ark"
  • the fourth probability of "Noah's Ark” can be calculated by step 302
  • the three candidate words indicated by the character string "hen” can be calculated by step 203
  • the first probability the first probability of the three candidate words indicated by the string "bang”
  • the fifth probability of " is used as the fifth probability of the reference word "Noah's Ark”
  • the fifth probability of the three candidate words indicated by the string "hen” and the fifth probability of the three candidate words indicated by the string "bang” are calculated through the Ngram model.
  • the fifth probability finally, based on the first probability of each candidate word indicated by other strings in the string sequence except P reference strings, the fourth probability, the fifth probability and the Viterbi algorithm can obtain the most probable word combination , and the word combination with the highest probability is used as the target word.
  • the reference dictionary provides reference words, in the process of calculating the probability of the candidate words following the reference words through the Ngram model, if the candidate words indicated by the reference string are needed, only the reference words can be considered candidate words in .
  • the target character string is a character string that is ranked after the P reference character strings in the character string sequence.
  • the fifth probability of each candidate word indicated by the target string is the conditional probability of the occurrence of the candidate word indicated by the target string when Q candidate words appear, wherein Q is a positive integer, specifically based on different Ngram models definite.
  • the Q candidate words include a candidate word indicated by each character string in the Q continuous character strings before the target character string in the string sequence, and when the Q character strings contain the reference character string, the Q candidate words Contains candidate words from the reference words indicated by the reference string.
  • the fifth probability of "hen” represents the conditional probability under the occurrence of the candidate word " ⁇ ";
  • the fifth probability of "mark” indicates that a candidate word (such as hate) indicated by the candidate word "zhou” and the character string "hen” appears The conditional probability for the case of .
  • the embodiment of the present application also provides a device for generating words and sentences, including: a first acquisition unit 401, configured to acquire a character string sequence, the character string sequence includes M character strings, and each character string indicates one or more candidate words, wherein M is a positive integer; the first encoding unit 402 is used to obtain M first character string vectors through an encoder according to the character string sequence, and each first character string vector corresponds to M character strings a character string; the second acquisition unit 403 is used to obtain the first probability of each candidate word indicated by the M character strings based on the M first character string vectors; the generation unit 404 is used to generate based on the first probability The target word and sentence, the target word and sentence includes M target words, and each target word is one of one or more candidate words indicated by each character string.
  • the first encoding unit 402 is configured to obtain M first position vectors and M second character string vectors according to the character string sequence, and each first position vector represents a character string in the character string sequence Each second character string vector represents a character string; according to M first position vectors and M second character string vectors, multiple first character string vectors are obtained through an encoder.
  • the encoder is trained based on the conversion task, which is the task of converting a sequence of sample strings into sample words and sentences.
  • the second acquiring unit 403 is configured to acquire the first probability of each candidate word indicated by the M character strings based on the M first character string vectors through a probability model, and the probability model is based on the conversion
  • the conversion task is the task of converting the sequence of sample strings into sample words and sentences.
  • the generating unit 404 is configured to obtain the third probability of each candidate word indicated by the M character strings through the Ngram model according to the character string sequence; based on the first probability, the third probability and the Viterbi Viterbi algorithm to generate target words and sentences.
  • the generation unit 404 is configured to acquire reference words from the reference dictionary, the reference words include P candidate words indicated by P reference character strings, each reference character string indicates a candidate word, and the P reference words
  • the string is contained in the string sequence, and the position in the string sequence is continuous, wherein, P is an integer greater than 1; based on the first probabilities of each of the P candidate words, the fourth probability of the reference word is calculated; based on the fourth probabilities and the first probability of each candidate word indicated by other character strings in the character string sequence except the P reference character strings, to generate target words and sentences.
  • the generation unit 404 is used to obtain the fifth probability of each candidate word indicated by other strings in the string sequence except the P reference character strings, and the fifth probability of the reference word through the Ngram model. Probability; based on the first probability, fourth probability, fifth probability and Viterbi algorithm of each candidate word indicated by other character strings except P reference character strings in the character string sequence, generate target words and sentences.
  • the target character string is the character string after the P reference character strings in the character string sequence;
  • the fifth probability of each candidate word indicated by the target character string is, among the Q candidate words that appear
  • the conditional probability of the occurrence of the candidate word indicated by the target character string in the case, Q is a positive integer;
  • the Q candidate words include one of each character string indicated by each character string in the Q consecutive character strings before the target character string in the character string sequence candidate words, and when the Q character strings include the reference character string, the Q candidate words include candidate words in the reference words indicated by the reference character string.
  • the device further includes a prompting unit 405, configured to prompt the target word and sentence as a preferred word and sentence, and the preferred word and sentence is the first word and sentence among the multiple words and sentences prompted by the input method.
  • the character string includes one pinyin or multiple pinyins.
  • the embodiment of the present application also provides a model training device, including: a third acquisition unit 501, configured to acquire a sequence of sample strings, the sequence of sample strings includes K sample strings, each sample string Indicate one or more sample candidate words, wherein K is a positive integer; the second encoding unit 502 is used to obtain K first sample string vectors through an encoder according to the sample string sequence, and each sample string The vector corresponds to a sample character string; the fourth acquisition unit 503 is used to obtain the second probability of each sample candidate word indicated by the K sample character strings based on the K first sample character string vectors; the adjustment unit 504 is used to Based on the second probability, the encoder is adjusted.
  • a third acquisition unit 501 configured to acquire a sequence of sample strings, the sequence of sample strings includes K sample strings, each sample string Indicate one or more sample candidate words, wherein K is a positive integer
  • the second encoding unit 502 is used to obtain K first sample string vectors through an encoder according to the sample string sequence, and each sample string The vector corresponds
  • the second encoding unit 502 is configured to obtain K second position vectors and K second sample character string vectors according to the sample character string sequence, and each second position vector represents a sample character string in position in the sample string sequence, each second sample string vector represents a sample string; according to K second position vectors and K second sample string vectors, K first samples are obtained through an encoder String vector.
  • the sample candidate words indicated by each sample string contain a target sample word; the adjustment unit 504 is configured to adjust the parameters of the encoder so that the second probability of the target sample word increases, and/or Or to reduce the second probabilities of other sample candidate words except the target sample word.
  • the fourth obtaining unit 503 is configured to obtain the second probability of each sample candidate word indicated by the K sample strings through a probability model based on the K first sample string vectors; adjust Unit 504 is further configured to adjust the probability model based on the second probability.
  • the sample candidate words indicated by each sample string contain a target sample word; the adjustment unit 504 is configured to adjust the parameters of the probability model so that the second probability of the target sample word increases, and/or Or to reduce the second probabilities of other sample candidate words except the target sample word.
  • the third acquiring unit 501 is configured to acquire K sample character strings in the sample character string sequence based on the K target sample words.
  • the sample character string includes one pinyin or multiple pinyins.
  • Figure 15 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • the computer device can be a terminal device or a server, and is specifically used to implement the function of the word-sentence generation device in the embodiment corresponding to Figure 13
  • Fig. 14 corresponds to the function of the model training device in the embodiment
  • the computer equipment 1800 may have relatively large differences due to different configurations or performances, and may include one or more central processing units (central processing units, CPU) 1822 (for example, one or more than one processor) and memory 1832, and one or more storage media 1830 (such as one or more mass storage devices) for storing application programs 1842 or data 1844.
  • the memory 1832 and the storage medium 1830 may be temporary storage or persistent storage.
  • the program stored in the storage medium 1830 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the computer device. Furthermore, the central processing unit 1822 may be configured to communicate with the storage medium 1830 , and execute a series of instruction operations in the storage medium 1830 on the computer device 1800 .
  • Computer device 1800 can also include one or more power supplies 1826, one or more wired or wireless network interfaces 1850, one or more input and output interfaces 1858, and/or, one or more operating systems 1841, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • the central processing unit 1822 may be used to execute the retrieval method performed by the word and sentence generating device in the embodiment corresponding to FIG. 13 .
  • the central processing unit 1822 can be used for:
  • the character string sequence includes M character strings, each character string indicates one or more candidate words, where M is a positive integer;
  • M first string vectors are obtained through an encoder, and each first string vector corresponds to one of the M strings;
  • the first probability of each candidate word indicated by the M character strings is obtained
  • target words and sentences are generated, and the target words and sentences include M target words, and each target word is one of one or more candidate words indicated by each character string.
  • the central processing unit 1822 may be used to execute the model training method performed by the model training device in the embodiment corresponding to FIG. 14 .
  • the central processing unit 1822 can be used for:
  • the sequence of sample strings includes K sample strings, each sample string indicates one or more sample candidate words, where K is a positive integer;
  • K first sample character string vectors are obtained through an encoder, and each sample character string vector corresponds to a sample character string;
  • the encoder is adjusted.
  • the embodiment of the present application also provides a chip, including one or more processors. Part or all of the processor is used to read and execute the computer program stored in the memory, so as to execute the methods of the foregoing embodiments.
  • the chip includes a memory, and the memory and the processor are connected to the memory through a circuit or wires. Further optionally, the chip further includes a communication interface, and the processor is connected to the communication interface.
  • the communication interface is used to receive data and/or information to be processed, and the processor obtains the data and/or information from the communication interface, processes the data and/or information, and outputs the processing result through the communication interface.
  • the communication interface may be an input-output interface.
  • some of the one or more processors may implement some of the steps in the above method through dedicated hardware, for example, the processing related to the neural network model may be performed by a dedicated neural network processor or graphics processor to achieve.
  • the method provided in the embodiment of the present application may be implemented by one chip, or may be implemented by multiple chips in cooperation.
  • the embodiment of the present application also provides a computer storage medium, which is used for storing computer software instructions used by the above-mentioned computer equipment, which includes a program for executing a program designed for the computer equipment.
  • the computer device may be the word-sentence generating device in the embodiment corresponding to FIG. 13 or the model training device in the embodiment corresponding to FIG. 14 .
  • the embodiment of the present application also provides a computer program product, the computer program product includes computer software instructions, and the computer software instructions can be loaded by a processor to implement the procedures in the methods shown in the foregoing embodiments.
  • the disclosed system, device and method can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Human Computer Interaction (AREA)
  • Machine Translation (AREA)

Abstract

Disclosed in the embodiments of the present application are a word or sentence generation method, a model training method and a related device in the field of artificial intelligence, which can be used for word or sentence recommendation in an input method. The method comprises: acquiring a character string sequence, wherein the character string sequence comprises M character strings, and each character string indicates one or more candidate words; encoding each character string into a character string vector by means of an encoder, and then, on the basis of the character string vector, acquiring a first probability of each candidate word indicated by the character string; and finally, on the basis of the first probability, generating a target word or sentence, wherein the target word or sentence comprises M target words, and each target word is one of one or more candidate words indicated by each character string. By means of the embodiments of the present application, the accuracy of a generated target word or sentence can be improved, thereby improving the recommendation accuracy of an input method.

Description

一种词句生成方法、模型训练方法及相关设备A word and sentence generation method, model training method and related equipment
本申请要求于2021年07月08日提交中国专利局、申请号为202110775982.1、发明名称为“一种词句生成方法、模型训练方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202110775982.1 and the title of the invention "a method for generating words and sentences, a method for training models and related equipment" submitted to the China Patent Office on July 08, 2021, the entire contents of which are incorporated by reference incorporated in this application.
技术领域technical field
本申请涉及输入法技术领域,尤其涉及一种词句生成方法、模型训练方法及相关设备。The present application relates to the technical field of input methods, in particular to a method for generating words and sentences, a method for training models and related equipment.
背景技术Background technique
输入法编辑器是客户端必备的应用程序,广泛的应用于台式机、笔记本、手机、平板、智能电视、车载电脑等设备中;并且,用户的日常活动,如:搜索地点、查找餐馆、聊天交友、出行规划等,很大程度上会转化为用户的输入行为,所以利用输入法编辑器的数据能够对用户进行精准的刻画。因此,输入法编辑器在互联网领域,具有重大的战略意义。The input method editor is a necessary application program for the client, and is widely used in desktop computers, notebooks, mobile phones, tablets, smart TVs, car computers and other devices; and the user's daily activities, such as: searching for places, finding restaurants, Chatting and making friends, travel planning, etc., will largely be transformed into user input behaviors, so the data of the input method editor can be used to accurately describe users. Therefore, input method editors have great strategic significance in the Internet field.
在输入场景下,用户在设备上输入字符(例如拼音)后,输入法编辑器会生成词句(词语或句子)并提示该词句以供用户选择,生成的词句的准确率直接影响输入法编辑器的准确率以及用户的体验;为此,需要一种能够准确生成词句的方法。In the input scenario, after the user enters characters (such as pinyin) on the device, the input method editor will generate words (words or sentences) and prompt the words and sentences for the user to choose. The accuracy of the generated words and sentences directly affects the input method editor. The accuracy rate and user experience; for this, a method that can accurately generate words and sentences is needed.
发明内容Contents of the invention
本申请实施例提供了一种词句生成方法、模型训练方法及相关设备,该方法能够提高生成的词句的准确率。Embodiments of the present application provide a method for generating words and sentences, a method for training models, and related equipment. The method can improve the accuracy of generated words and sentences.
本申请实施例第一方面提供了一种词句生成方法,该方法可以应用于终端设备,也可以应用于云端服务器,具体包括:获取字符串序列,字符串序列包括M个字符串,每个字符串指示一个或多个候选词语;其中,字符串可以理解为字符的组合,是一种语言信息的载体,承载发音信息,用于生成词语或句子;对应不同种类的语言,字符串的形式不同,以中文为例,字符串可以包括一个拼音或多个拼音,M为正整数;根据字符串序列,通过编码器,得到M个第一字符串向量,每个第一字符串向量对应M个字符串中的一个字符串;编码器可以理解为一个深度学习网络模型,编码器的网络结构有多种,本申请实施例对此不做具体限定;具体地,编码器的网络结构可以采用Transformer网络的编码器部分的网络结构,或采用由Transformer网络的编码器部分得到的一系列其他网络的网络结构;基于M个第一字符串向量,获取M个字符串指示的每个候选词语的第一概率,候选词语的第一概率可以理解为,在用户输入字符串的情况下,用户从该字符串指示的所有候选词语中选择当前候选词语的概率;基于第一概率,生成目标词句,目标词句包括M个目标词语,每个目标词语为每个字符串指示的一个或多个候选词语中的一个,具体地,目标词句可以是一个词语,也可以是一个句子。The first aspect of the embodiment of the present application provides a method for generating words and sentences, which can be applied to terminal devices or cloud servers, and specifically includes: obtaining a character string sequence, the character string sequence includes M character strings, each character A string indicates one or more candidate words; among them, a string can be understood as a combination of characters, which is a carrier of language information, carries pronunciation information, and is used to generate words or sentences; corresponding to different types of languages, the form of a string is different , taking Chinese as an example, the string can include one pinyin or multiple pinyin, and M is a positive integer; according to the string sequence, through the encoder, M first string vectors are obtained, and each first string vector corresponds to M A character string in the character string; the encoder can be understood as a deep learning network model, and there are various network structures of the encoder, which are not specifically limited in the embodiment of the present application; specifically, the network structure of the encoder can be Transformer The network structure of the encoder part of the network, or the network structure of a series of other networks obtained by the encoder part of the Transformer network; based on the M first character string vectors, obtain the first word of each candidate word indicated by the M character strings One probability, the first probability of a candidate word can be understood as, in the case of a user inputting a character string, the user selects the probability of the current candidate word from all the candidate words indicated by the character string; based on the first probability, a target word and sentence is generated, and the target The words and sentences include M target words, and each target word is one of one or more candidate words indicated by each character string. Specifically, the target words and sentences may be a word or a sentence.
通过编码器对字符串序列进行编码,以得到第一字符串向量,该第一字符串向量是融合了整个字符串序列的信息后对字符串的表示,而不仅仅表示字符串本身,即第一字符串向量包含了较多的信息;所以基于第一字符串向量计算目标词语的第一概率,并基于第一概率生成目标词句,能够提高生成的目标词句的准确率,从而提高输入法的准确度。The string sequence is encoded by the encoder to obtain the first string vector, which is the representation of the string after fusing the information of the entire string sequence, not just the string itself, that is, the first A character string vector contains more information; so calculating the first probability of the target word based on the first character string vector and generating the target word and sentence based on the first probability can improve the accuracy of the generated target word and sentence, thereby improving the input method. Accuracy.
作为一种可实现的方式,根据字符串序列,通过编码器,得到M个第一字符串向量包 括:根据字符串序列获取M个第一位置向量和M个第二字符串向量,每个第一位置向量表示一个字符串在字符串序列中的位置,每个第二字符串向量表示一个字符串;根据M个第一位置向量和M个第二字符串向量,通过编码器,得到多个第一字符串向量。As an achievable way, obtaining M first character string vectors through the encoder according to the character string sequence includes: obtaining M first position vectors and M second character string vectors according to the character string sequence, each A position vector represents the position of a character string in the character string sequence, and each second character string vector represents a character string; according to M first position vectors and M second character string vectors, through an encoder, multiple The first string vector.
Bert模型需要根据词语的位置向量、词语的向量、用于区分词语位于第一个句子还是第二个句子的向量,以及与分割符“SEP”和标记“CLS”相关的向量,才能编码得到词语的向量,而在本申请实施例中,仅根据字符串的第一位置向量和第二字符串向量这两种向量,即可通过编码器得到第一字符串向量;因此,本申请实施例中的编码器需要处理的向量更少,编码效率较高,从而提高输入法的反应速度。The Bert model needs to encode the words based on the position vector of the word, the vector of the word, the vector used to distinguish whether the word is in the first sentence or the second sentence, and the vector related to the separator "SEP" and the tag "CLS". , and in the embodiment of the present application, the first character string vector can be obtained by the encoder only according to the two vectors of the first position vector and the second character string vector of the character string; therefore, in the embodiment of the present application The encoder needs to process fewer vectors, and the encoding efficiency is higher, thereby improving the response speed of the input method.
作为一种可实现的方式,编码器是基于转换任务训练得到的,其中,转换任务是将样本字符串序列转换成样本词句的任务。As an achievable way, the encoder is trained based on the conversion task, where the conversion task is the task of converting a sequence of sample strings into sample words and sentences.
在应用阶段,利用编码器将字符串转换成第一字符串向量,再利用第一字符串向量得到目标词句,由此可见,在应用阶段,编码器的功能与基于转换任务训练的过程中编码器的功能类似;因此,将基于转换任务训练得到的编码器,用于编码字符串序列,能够提高编码器的编码准确率,从而提高输入法的准确度。In the application phase, the encoder is used to convert the string into the first string vector, and then the first string vector is used to obtain the target words and sentences. It can be seen that in the application phase, the function of the encoder is the same as that of encoding in the process of training based on conversion tasks. The function of the encoder is similar; therefore, the encoder trained based on the conversion task is used to encode the string sequence, which can improve the encoding accuracy of the encoder, thereby improving the accuracy of the input method.
作为一种可实现的方式,基于M个第一字符串向量,获取M个字符串指示的每个候选词语的第一概率包括:基于M个第一字符串向量,通过概率模型,获取M个字符串指示的每个候选词语的第一概率,概率模型是基于转换任务训练得到的,其中,概率模型和编码器可以看成一个整体,即一个深度学习模型,而编码器可以看成是这个深度学习模型的前半部分,概率模型可以看成是这个深度学习模型的后半部分;其中,转换任务是将样本字符串序列转换成样本词句的任务。As an achievable way, based on the M first character string vectors, obtaining the first probability of each candidate word indicated by the M character strings includes: based on the M first character string vectors, through a probability model, obtaining M The first probability of each candidate word indicated by the string. The probability model is trained based on the conversion task. The probability model and the encoder can be regarded as a whole, that is, a deep learning model, and the encoder can be regarded as this The first half of the deep learning model, the probability model can be regarded as the second half of the deep learning model; among them, the conversion task is the task of converting the sequence of sample strings into sample words and sentences.
通过概率模型获取候选词语的第一概率,能够提高第一概率的准确性;并且,与编码器类似,在应用阶段,概率模型的功能与基于转换任务训练的过程中概率模型的功能类似,因此,将基于转换任务训练得到的概率模型,用于计算第一概率,可以提高第一概率的准确性,从而提高输入法的准确度。Obtaining the first probability of candidate words through the probability model can improve the accuracy of the first probability; and, similar to the encoder, in the application phase, the function of the probability model is similar to that of the probability model in the process of training based on the conversion task, so , using the probability model trained based on the conversion task to calculate the first probability, which can improve the accuracy of the first probability, thereby improving the accuracy of the input method.
作为一种可实现的方式,基于第一概率,生成目标词句包括:根据字符串序列,通过Ngram模型,获取M个字符串指示的每个候选词语的第三概率,其中,对于任意一个候选词语来说,该候选词语的第三概率表示在前面一个或多个候选词语出现的情况下,该候选词语出现的条件概率;基于第一概率,第三概率以及维特比Viterbi算法,生成目标词句,维特比Viterbi算法是一种动态规划算法,用于寻找最有可能产生观测事件序列的维特比路径,该维特比路径也可以称为最优路径,其中,Viterbi算法也可以称为有穷状态转换器(Finite State Transducers,FST)算法。As an achievable way, based on the first probability, generating the target word and sentence includes: according to the character string sequence, through the Ngram model, obtaining the third probability of each candidate word indicated by M character strings, wherein, for any candidate word For example, the third probability of the candidate word represents the conditional probability of the occurrence of the candidate word when one or more candidate words appear in front; based on the first probability, the third probability and the Viterbi algorithm, the target word and sentence is generated, The Viterbi algorithm is a dynamic programming algorithm, which is used to find the Viterbi path that is most likely to produce a sequence of observed events. The Viterbi path can also be called the optimal path, and the Viterbi algorithm can also be called a finite state transition. Transducer (Finite State Transducers, FST) algorithm.
候选词语的第一概率可以理解为候选词语在字符串序列出现的情况下的条件概率,而候选词语的第三概率可以理解为当前候选词语在其他候选词语出现的情况下的条件概率,所以在生成目标词句的过程中,既考虑候选词语的第一概率,又考虑通过Ngram模型计算得到的候选词语的第三概率,有利于生成准确率较高的目标词句。The first probability of a candidate word can be understood as the conditional probability of the candidate word in the presence of a string sequence, and the third probability of the candidate word can be understood as the conditional probability of the current candidate word in the presence of other candidate words, so in In the process of generating target words and sentences, both the first probability of candidate words and the third probability of candidate words calculated by the Ngram model are considered, which is conducive to generating target words and sentences with higher accuracy.
作为一种可实现的方式,基于第一概率,生成目标词句包括:从参考词典中获取参考词语,参考词典可以包括以下至少一种类型的词库:基础词库、短语词库、用户个人词库、 热点词库、各种领域词库,参考词语的数量可以为一个,也可以为多个,参考词语包括P个参考字符串指示的P个候选词语,每个参考字符串指示一个候选词语,P个参考字符串包含于字符串序列中,且在字符串序列中的位置连续,其中,P为大于1的整数;基于P个候选词语各自的第一概率,计算参考词语的第四概率,第四概率表示用户在输入P个参考字符串的情况下,选择参考词语的可能性;计算参考词语的第四概率方法有多种,例如,可以将P个候选词语的第一概率的几何平均值,作为参考词语的第四概率;基于第四概率以及字符串序列中除P个参考字符串外的其他字符串指示的每个候选词语的第一概率,生成目标词句。As an achievable manner, based on the first probability, generating target words and sentences includes: obtaining reference words from a reference dictionary, and the reference dictionary may include at least one of the following types of thesaurus: basic thesaurus, phrase thesaurus, user personal words Library, hotspot thesaurus, various field thesaurus, the quantity of reference word can be one, also can be multiple, and reference word includes P candidate words indicated by P reference character strings, and each reference character string indicates a candidate word , P reference strings are included in the string sequence, and the positions in the string sequence are continuous, where P is an integer greater than 1; based on the first probabilities of the P candidate words, calculate the fourth probability of the reference word , the fourth probability indicates the possibility of the user selecting a reference word when inputting P reference character strings; there are many ways to calculate the fourth probability of a reference word, for example, the geometry of the first probability of P candidate words The average value is used as the fourth probability of the reference word; based on the fourth probability and the first probability of each candidate word indicated by other character strings in the character string sequence except the P reference character strings, a target word and sentence is generated.
由于编码器和概率模型的训练和下发往往周期比较长,不能及时反映用户输入趋势的变化、用户输入场景的变化,且难以应对网络出现的新词和热词,而参考词典可以提供多种场景下的词语、新出现的词语或热词等作为参考词语,以协助生成目标词句,从而可以弥补编码器和概率模型的不足,提高目标词句的准确率。Since the training and distribution of encoders and probability models often takes a long period, it cannot reflect changes in user input trends and user input scenarios in a timely manner, and it is difficult to cope with new words and hot words that appear on the Internet. The reference dictionary can provide a variety of Words in the scene, new words or hot words are used as reference words to assist in the generation of target words and sentences, which can make up for the shortcomings of encoders and probability models and improve the accuracy of target words and sentences.
作为一种可实现的方式,基于第四概率以及字符串序列中除P个参考字符串外的其他字符串指示的每个候选词语的第一概率,生成目标词句包括:通过Ngram模型,获取字符串序列中除P个参考字符串外其他字符串指示的每个候选词语的第五概率,以及参考词语的第五概率;基于字符串序列中除P个参考字符串外其他字符串指示的每个候选词语的第一概率,第四概率、第五概率以及Viterbi算法,生成目标词句。As an achievable way, based on the fourth probability and the first probability of each candidate word indicated by other strings in the string sequence except the P reference strings, generating the target word and sentence includes: through the Ngram model, obtaining the character The fifth probability of each candidate word indicated by other strings except P reference strings in the string sequence, and the fifth probability of the reference word; based on each The first probability, the fourth probability, the fifth probability and the Viterbi algorithm of a candidate word to generate the target word and sentence.
其中,本申请实施例将参考词语中的所有候选词语看成一个整体,这样,就不需要通过Ngram模型计算参考词语内部的候选词语之间的条件概率,仅需通过Ngram模型计算参考词语的第五概率即可;在计算参考词语的第五概率的过程中,可以计算参考词语中排在第一位的候选词语的第五概率,并将排在第一位的候选词语的第五概率作为参考词语的第五概率。Wherein, the embodiment of the present application regards all candidate words in the reference words as a whole, so that it is not necessary to calculate the conditional probability between the candidate words inside the reference words through the Ngram model, and only need to calculate the first position of the reference words through the Ngram model. Five probabilities are enough; in the process of calculating the fifth probability of the reference word, the fifth probability of the first candidate word in the reference word can be calculated, and the fifth probability of the first candidate word can be used as The fifth probability of the reference word.
在该实现方式中,不仅利用了Ngram模型,还利用了参考词典,基于前文对Ngram模型和参考词典的相关说明可知,该实现方式能够融合参考词典和Ngram模型的优点,从而进一步提升目标词句的准确率。In this implementation, not only the Ngram model is used, but also the reference dictionary is used. Based on the previous descriptions of the Ngram model and the reference dictionary, it can be known that this implementation can combine the advantages of the reference dictionary and the Ngram model, thereby further improving the accuracy of the target words and sentences. Accuracy.
作为一种可实现的方式,目标字符串为字符串序列中排在P个参考字符串之后的字符串;目标字符串指示的每个候选词语的第五概率是,在Q个候选词语出现的情况下目标字符串指示的候选词语出现的条件概率,Q为正整数;Q个候选词语包括字符串序列中,排在目标字符串前的Q个连续字符串中的每个字符串指示的一个候选词语,且当Q个字符串包含参考字符串时,Q个候选词语包含参考字符串指示的参考词语中的候选词语。As an achievable way, the target character string is the character string after the P reference character strings in the character string sequence; the fifth probability of each candidate word indicated by the target character string is, among the Q candidate words that appear The conditional probability of the occurrence of the candidate word indicated by the target character string in the case, Q is a positive integer; the Q candidate words include one of each character string indicated by each character string in the Q consecutive character strings before the target character string in the character string sequence candidate words, and when the Q character strings include the reference character string, the Q candidate words include candidate words in the reference words indicated by the reference character string.
作为一种可实现的方式,在基于第一概率,生成目标词句之后,方法还包括:As an achievable manner, after generating the target word and sentence based on the first probability, the method further includes:
将目标词句作为首选词句进行提示,首选词句为输入法提示的多个词句中排在第一位的词句。Prompting the target word and sentence as the preferred word and sentence, the preferred word and sentence is the first word and sentence among the multiple words and sentences prompted by the input method.
在输入场景中,终端设备会提示多个词句,本申请实施例将目标词句作为首选词句进行提示,从而可以将用户选择的可能性最大的目标词句优先提示给用户,以提高用户的输入效率。In the input scene, the terminal device will prompt multiple words and sentences. In the embodiment of the present application, the target words and sentences are prompted as the preferred words and sentences, so that the target words and sentences with the highest possibility of the user's choice can be preferentially prompted to the user, so as to improve the user's input efficiency.
作为一种可实现的方式,字符串包括一个拼音或多个拼音。As an implementable manner, the character string includes one pinyin or multiple pinyins.
基于字符串包括一个或多个拼音,该实现方式为本申请实施例的方法提供了具体的中文应用场景。Based on the fact that the character string includes one or more Pinyin, this implementation provides a specific Chinese application scenario for the method in the embodiment of the present application.
本申请实施例第二方面提供了一种模型训练方法,包括:获取样本字符串序列,样本字符串序列包括K个样本字符串,每个样本字符串指示一个或多个样本候选词语,其中,K为正整数;根据样本字符串序列,通过编码器,得到K个第一样本字符串向量,每个样本字符串向量对应一个样本字符串;基于K个第一样本字符串向量,获取K个样本字符串指示的每个样本候选词语的第二概率;基于第二概率,对编码器进行调整。The second aspect of the embodiment of the present application provides a model training method, including: obtaining a sample string sequence, the sample string sequence includes K sample strings, and each sample string indicates one or more sample candidate words, wherein, K is a positive integer; according to the sequence of sample strings, through the encoder, K first sample string vectors are obtained, and each sample string vector corresponds to a sample string; based on the K first sample string vectors, obtain The second probability of each sample candidate word indicated by the K sample character strings; based on the second probability, the encoder is adjusted.
由于第一方面对字符串、编码器、字符串序列以及第一概率等进行了说明,所以可参照第一方面的相关说明,对第二方面中的字符串、编码器、字符串序列以及第二概率进行理解。Since the first aspect describes the character string, the encoder, the character string sequence, and the first probability, etc., the character string, the encoder, the character string sequence, and the second Two probability to understand.
通过编码器对样本字符串序列进行编码,以得到第一样本字符串向量,该第一样本字符串向量是融合了整个样本字符串序列的信息后对样本字符串的表示,而不仅仅表示样本字符串本身,即第一样本字符串向量包含了较多的信息;所以基于第一样本字符串向量计算目标样本词语的第二概率,并基于第二概率对编码器进行调整,能够提高训练出的编码器和概率模型的准确度,从而提高输入法的准确度。The sample string sequence is encoded by the encoder to obtain the first sample string vector, which is a representation of the sample string after fusing the information of the entire sample string sequence, not just Indicates the sample string itself, that is, the first sample string vector contains more information; so the second probability of the target sample word is calculated based on the first sample string vector, and the encoder is adjusted based on the second probability, It can improve the accuracy of the trained encoder and probability model, thereby improving the accuracy of the input method.
作为一种可实现的方式,根据样本字符串序列,通过编码器,得到K个第一样本字符串向量包括:根据样本字符串序列获取K个第二位置向量和K个第二样本字符串向量,每个第二位置向量表示一个样本字符串在样本字符串序列中的位置,每个第二样本字符串向量表示一个样本字符串;根据K个第二位置向量和K个第二样本字符串向量,通过编码器,得到K个第一样本字符串向量。As an achievable way, obtaining K first sample character string vectors through an encoder according to the sample character string sequence includes: obtaining K second position vectors and K second sample character strings according to the sample character string sequence Vector, each second position vector represents the position of a sample character string in the sample character string sequence, and each second sample character string vector represents a sample character string; according to K second position vectors and K second sample characters The string vectors are passed through the encoder to obtain K first sample string vectors.
在本申请实施例中,根据样本字符串的第二位置向量和第二样本字符串向量,即可通过编码器得到第一样本字符串向量;而Bert模型除了需要词语的位置向量、词语的向量外,还需要用于区分词语位于第一个句子还是第二个句子的向量,以及与分割符“SEP”和标记“CLS”相关的向量;因此,本申请实施例中的编码器需要处理的向量更少,编码效率更高,从而可以提高训练效率。In the embodiment of the present application, according to the second position vector of the sample character string and the second sample character string vector, the first sample character string vector can be obtained through the encoder; and the Bert model needs the position vector of the word, the word In addition to the vector, the vector used to distinguish whether the word is in the first sentence or the second sentence, and the vector related to the separator "SEP" and the tag "CLS" are also needed; therefore, the encoder in the embodiment of the application needs to process The number of vectors is less and the encoding efficiency is higher, which can improve the training efficiency.
作为一种可实现的方式,每个样本字符串指示的样本候选词语中包含一个目标样本词语,其中,目标样本词语相当于样本标签;相应地,基于第二概率,对编码器进行调整包括:调整编码器的参数,以使得目标样本词语的第二概率增加,和/或以使得除目标样本词语外的其他样本候选词语的第二概率降低。As an achievable way, the sample candidate words indicated by each sample string contain a target sample word, where the target sample word is equivalent to the sample label; correspondingly, based on the second probability, adjusting the encoder includes: The parameters of the encoder are adjusted so that the second probability of the target sample word increases, and/or so that the second probabilities of other sample candidate words except the target sample word decrease.
例如,样本字符串序列为“nuoyafangzhouhenbang”,对于其中的样本字符串“nuo”来说,对应的样本候选词语包括“诺”、“糯”、“懦”等,令“诺”为目标样本词语,则可以通过调整编码器的参数,使得“诺”的第二概率增加,“糯”和“懦”第二概率降低。For example, the sample string sequence is "nuoyafangzhouhenbang", for the sample string "nuo", the corresponding sample candidate words include "nuo", "waxy", "cowardly", etc., let "nuo" be the target sample word , then by adjusting the parameters of the encoder, the second probability of "nuo" can be increased, and the second probability of "waxy" and "cowardly" can be reduced.
在该实现方式中,预先设定目标样本词语,在训练过程中,通过调整编码器的参数,使得目标样本词语的第二概率增加和/或除目标样本词语外的其他样本候选词语的第二概率降低,进而使得目标样本词语的第二概率大于其他样本候选词语的第二概率,从而实现对编码器的训练。In this implementation, the target sample words are preset, and during the training process, by adjusting the parameters of the encoder, the second probability of the target sample words increases and/or the second probability of other sample candidate words except the target sample words increases. The probability is reduced, so that the second probability of the target sample word is greater than the second probability of other sample candidate words, thereby realizing the training of the encoder.
作为一种可实现的方式,基于K个第一样本字符串向量,获取K个样本字符串指示的 每个样本候选词语的第二概率包括:基于K个第一样本字符串向量,通过概率模型,获取K个样本字符串指示的每个样本候选词语的第二概率;相应地,在基于K个第一样本字符串向量,获取K个样本字符串指示的每个样本候选词语的第二概率之后,方法还包括:基于第二概率,对概率模型进行调整。As an achievable manner, based on the K first sample character string vectors, obtaining the second probability of each sample candidate word indicated by the K sample character strings includes: based on the K first sample character string vectors, by The probability model obtains the second probability of each sample candidate word indicated by K sample strings; correspondingly, based on K first sample string vectors, obtains the second probability of each sample candidate word indicated by K sample strings After the second probability, the method further includes: adjusting the probability model based on the second probability.
通过概率模型获取样本候选词语的第二概率,能够提高第二概率的准确性;而基于第二概率对概率模型进行调整,可以提高概率模型输出的第二概率的准确性。Obtaining the second probability of the sample candidate words through the probability model can improve the accuracy of the second probability; and adjusting the probability model based on the second probability can improve the accuracy of the second probability output by the probability model.
作为一种可实现的方式,每个样本字符串指示的样本候选词语中包含一个目标样本词语;基于第二概率,对概率模型进行调整包括:调整概率模型的参数,以使得目标样本词语的第二概率增加,和/或以使得除目标样本词语外的其他样本候选词语的第二概率降低。As an achievable way, the sample candidate words indicated by each sample character string contain a target sample word; based on the second probability, adjusting the probability model includes: adjusting the parameters of the probability model so that the second probability of the target sample word The second probability is increased, and/or the second probability of other sample candidate words except the target sample word is decreased.
在该实现方式中,预先设定目标样本词语,在训练过程中,通过调整概率模型的参数,使得目标样本词语的第二概率增加和/或除目标样本词语外的其他样本候选词语的第二概率降低,进而使得目标样本词语的第二概率大于其他样本候选词语的第二概率,从而实现对概率模型的训练。In this implementation, the target sample words are preset, and during the training process, by adjusting the parameters of the probability model, the second probability of the target sample words increases and/or the second probability of other sample candidate words except the target sample words increases. The probability is reduced, so that the second probability of the target sample word is greater than the second probability of other sample candidate words, thereby realizing the training of the probability model.
作为一种可实现的方式,获取样本字符串序列包括:基于K个目标样本词语获取样本字符串序列中的K个样本字符串。As an achievable manner, obtaining the sample character string sequence includes: obtaining K sample character strings in the sample character string sequence based on K target sample words.
基于目标样本词语获取样本字符串,可以提高样本字符串的获取效率。Obtaining sample character strings based on target sample words can improve the efficiency of obtaining sample character strings.
作为一种可实现的方式,样本字符串包括一个拼音或多个拼音。As an implementable manner, the sample character string includes one pinyin or multiple pinyins.
基于字符串包括一个或多个拼音,该实现方式为本申请实施例的方法提供了具体的中文应用场景。Based on the fact that the character string includes one or more Pinyin, this implementation provides a specific Chinese application scenario for the method in the embodiment of the present application.
本申请实施例第三方面提供了一种词句生成装置,包括:第一获取单元,用于获取字符串序列,字符串序列包括M个字符串,每个字符串指示一个或多个候选词语,其中,M为正整数;第一编码单元,用于根据字符串序列,通过编码器,得到M个第一字符串向量,每个第一字符串向量对应M个字符串中的一个字符串;第二获取单元,用于基于M个第一字符串向量,获取M个字符串指示的每个候选词语的第一概率;生成单元,用于基于第一概率,生成目标词句,目标词句包括M个目标词语,每个目标词语为每个字符串指示的一个或多个候选词语中的一个。The third aspect of the embodiment of the present application provides a word and sentence generation device, including: a first acquisition unit, configured to acquire a character string sequence, the character string sequence includes M character strings, each character string indicates one or more candidate words, Wherein, M is a positive integer; the first encoding unit is used to obtain M first character string vectors through an encoder according to the character string sequence, and each first character string vector corresponds to one of the M character strings; The second acquisition unit is used to obtain the first probability of each candidate word indicated by M character strings based on M first character string vectors; the generation unit is used to generate target words and sentences based on the first probability, and the target words and sentences include M target words, each target word is one of one or more candidate words indicated by each character string.
作为一种可实现的方式,第一编码单元,用于根据字符串序列获取M个第一位置向量和M个第二字符串向量,每个第一位置向量表示一个字符串在字符串序列中的位置,每个第二字符串向量表示一个字符串;根据M个第一位置向量和M个第二字符串向量,通过编码器,得到多个第一字符串向量。As an achievable way, the first encoding unit is used to obtain M first position vectors and M second character string vectors according to the character string sequence, and each first position vector represents a character string in the character string sequence Each second character string vector represents a character string; according to the M first position vectors and the M second character string vectors, a plurality of first character string vectors are obtained through an encoder.
作为一种可实现的方式,编码器是基于转换任务训练得到的,转换任务是将样本字符串序列转换成样本词句的任务。As an achievable way, the encoder is trained based on the conversion task, which is the task of converting a sequence of sample strings into sample words and sentences.
作为一种可实现的方式,第二获取单元,用于基于M个第一字符串向量,通过概率模型,获取M个字符串指示的每个候选词语的第一概率,概率模型是基于转换任务训练得到的,转换任务是将样本字符串序列转换成样本词句的任务。As an achievable way, the second acquisition unit is used to obtain the first probability of each candidate word indicated by the M character strings based on the M first character string vectors through a probability model, and the probability model is based on the conversion task After training, the conversion task is the task of converting the sequence of sample strings into sample words and sentences.
作为一种可实现的方式,生成单元,用于根据字符串序列,通过Ngram模型,获取M个字符串指示的每个候选词语的第三概率;基于第一概率,第三概率以及维特比Viterbi 算法,生成目标词句。As an achievable way, the generation unit is used to obtain the third probability of each candidate word indicated by the M strings through the Ngram model according to the string sequence; based on the first probability, the third probability and Viterbi Algorithm to generate target words and sentences.
作为一种可实现的方式,生成单元,用于从参考词典中获取参考词语,参考词语包括P个参考字符串指示的P个候选词语,每个参考字符串指示一个候选词语,P个参考字符串包含于字符串序列中,且在字符串序列中的位置连续,其中,P为大于1的整数;基于P个候选词语各自的第一概率,计算参考词语的第四概率;基于第四概率以及字符串序列中除P个参考字符串外的其他字符串指示的每个候选词语的第一概率,生成目标词句。As an achievable manner, the generation unit is used to obtain reference words from the reference dictionary, the reference words include P candidate words indicated by P reference character strings, each reference character string indicates a candidate word, and P reference characters The string is included in the string sequence, and the position in the string sequence is continuous, wherein, P is an integer greater than 1; based on the first probabilities of the P candidate words, calculate the fourth probability of the reference word; based on the fourth probability and the first probability of each candidate word indicated by other character strings in the character string sequence except the P reference character strings, to generate target words and sentences.
作为一种可实现的方式,生成单元,用于通过Ngram模型,获取字符串序列中除P个参考字符串外其他字符串指示的每个候选词语的第五概率,以及参考词语的第五概率;基于字符串序列中除P个参考字符串外其他字符串指示的每个候选词语的第一概率,第四概率、第五概率以及Viterbi算法,生成目标词句。As an achievable way, the generation unit is used to obtain the fifth probability of each candidate word indicated by other strings in the string sequence except P reference strings, and the fifth probability of the reference word through the Ngram model ; Based on the first probability, the fourth probability, the fifth probability and the Viterbi algorithm of each candidate word indicated by other character strings except P reference character strings in the character string sequence, generate target words and sentences.
作为一种可实现的方式,目标字符串为字符串序列中排在P个参考字符串之后的字符串;目标字符串指示的每个候选词语的第五概率是,在Q个候选词语出现的情况下目标字符串指示的候选词语出现的条件概率,Q为正整数;Q个候选词语包括字符串序列中,排在目标字符串前的Q个连续字符串中的每个字符串指示的一个候选词语,且当Q个字符串包含参考字符串时,Q个候选词语包含参考字符串指示的参考词语中的候选词语。As an achievable way, the target character string is the character string after the P reference character strings in the character string sequence; the fifth probability of each candidate word indicated by the target character string is, among the Q candidate words that appear The conditional probability of the occurrence of the candidate word indicated by the target character string in the case, Q is a positive integer; the Q candidate words include one of each character string indicated by each character string in the Q consecutive character strings before the target character string in the character string sequence candidate words, and when the Q character strings include the reference character string, the Q candidate words include candidate words in the reference words indicated by the reference character string.
作为一种可实现的方式,该装置还包括提示单元,用于将目标词句作为首选词句进行提示,首选词句为输入法提示的多个词句中排在第一位的词句。As an achievable manner, the device further includes a prompting unit, configured to prompt the target word and sentence as a preferred word and sentence, and the preferred word and sentence is the first word and sentence among the multiple words and sentences prompted by the input method.
作为一种可实现的方式,字符串包括一个拼音或多个拼音。As an implementable manner, the character string includes one pinyin or multiple pinyins.
其中,以上各单元的具体实现、相关说明以及技术效果请参考本申请实施例第一方面的描述。For the specific implementation, related description and technical effects of the above units, please refer to the description of the first aspect of the embodiment of the present application.
本申请实施例第四方面提供了一种模型训练装置,包括:第三获取单元,用于获取样本字符串序列,样本字符串序列包括K个样本字符串,每个样本字符串指示一个或多个样本候选词语,其中,K为正整数;第二编码单元,用于根据样本字符串序列,通过编码器,得到K个第一样本字符串向量,每个样本字符串向量对应一个样本字符串;第四获取单元,用于基于K个第一样本字符串向量,获取K个样本字符串指示的每个样本候选词语的第二概率;调整单元,用于基于第二概率,对编码器进行调整。The fourth aspect of the embodiment of the present application provides a model training device, including: a third acquisition unit, configured to acquire a sequence of sample strings, the sequence of sample strings includes K sample strings, and each sample string indicates one or more sample candidate words, wherein K is a positive integer; the second coding unit is used to obtain K first sample string vectors through an encoder according to the sample string sequence, and each sample string vector corresponds to a sample character string; the fourth acquisition unit is used to obtain the second probability of each sample candidate word indicated by K sample strings based on K first sample string vectors; the adjustment unit is used to encode based on the second probability device to adjust.
作为一种可实现的方式,第二编码单元,用于根据样本字符串序列获取K个第二位置向量和K个第二样本字符串向量,每个第二位置向量表示一个样本字符串在样本字符串序列中的位置,每个第二样本字符串向量表示一个样本字符串;根据K个第二位置向量和K个第二样本字符串向量,通过编码器,得到K个第一样本字符串向量。As an achievable way, the second coding unit is used to obtain K second position vectors and K second sample string vectors according to the sample string sequence, and each second position vector represents a sample string in the sample The position in the string sequence, each second sample string vector represents a sample string; according to K second position vectors and K second sample string vectors, K first sample characters are obtained through an encoder string vector.
作为一种可实现的方式,每个样本字符串指示的样本候选词语中包含一个目标样本词语;调整单元,用于调整编码器的参数,以使得目标样本词语的第二概率增加,和/或以使得除目标样本词语外的其他样本候选词语的第二概率降低。As an achievable manner, the sample candidate words indicated by each sample string contain a target sample word; the adjustment unit is configured to adjust the parameters of the encoder so that the second probability of the target sample word increases, and/or In order to reduce the second probability of other sample candidate words except the target sample word.
作为一种可实现的方式,第四获取单元,用于基于K个第一样本字符串向量,通过概率模型,获取K个样本字符串指示的每个样本候选词语的第二概率;调整单元,还用于基于第二概率,对概率模型进行调整。As an achievable way, the fourth acquisition unit is used to obtain the second probability of each sample candidate word indicated by the K sample strings through a probability model based on the K first sample string vectors; the adjustment unit , is also used to adjust the probability model based on the second probability.
作为一种可实现的方式,每个样本字符串指示的样本候选词语中包含一个目标样本词 语;调整单元,用于调整概率模型的参数,以使得目标样本词语的第二概率增加,和/或以使得除目标样本词语外的其他样本候选词语的第二概率降低。As an achievable manner, the sample candidate words indicated by each sample string contain a target sample word; the adjustment unit is configured to adjust the parameters of the probability model, so that the second probability of the target sample word increases, and/or In order to reduce the second probability of other sample candidate words except the target sample word.
作为一种可实现的方式,第三获取单元,用于基于K个目标样本词语获取样本字符串序列中的K个样本字符串。As an implementable manner, the third acquiring unit is configured to acquire K sample character strings in the sample character string sequence based on the K target sample words.
作为一种可实现的方式,样本字符串包括一个拼音或多个拼音。As an implementable manner, the sample character string includes one pinyin or multiple pinyins.
其中,以上各单元的具体实现、相关说明以及技术效果请参考本申请实施例第二方面的描述。For the specific implementation, related description and technical effects of the above units, please refer to the description of the second aspect of the embodiment of the present application.
本申请实施例第五方面提供了一种计算机设备,包括:一个或多个处理器和存储器;其中,存储器中存储有计算机可读指令;一个或多个处理器读取计算机可读指令,以使计算机设备实现如第一方面任一实现方式的方法。The fifth aspect of the embodiment of the present application provides a computer device, including: one or more processors and a memory; wherein, computer-readable instructions are stored in the memory; one or more processors read the computer-readable instructions to The computer device is made to implement the method in any implementation manner of the first aspect.
本申请实施例第六方面提供了一种训练设备,包括:一个或多个处理器和存储器;其中,存储器中存储有计算机可读指令;一个或多个处理器读取计算机可读指令,以使训练设备实现如第二方面任一实现方式的方法。The sixth aspect of the embodiment of the present application provides a training device, including: one or more processors and memory; wherein, computer-readable instructions are stored in the memory; one or more processors read the computer-readable instructions to Make the training device implement the method in any implementation manner of the second aspect.
本申请实施例第七方面提供了一种计算机可读存储介质,包括计算机可读指令,当计算机可读指令在计算机上运行时,使得计算机执行如第一方面或第二方面任一实现方式的方法。The seventh aspect of the embodiment of the present application provides a computer-readable storage medium, including computer-readable instructions. When the computer-readable instructions are run on the computer, the computer is executed according to any implementation manner of the first aspect or the second aspect. method.
本申请实施例第八方面提供了一种芯片,包括一个或多个处理器。处理器中的部分或全部用于读取并执行存储器中存储的计算机程序,以执行上述第一方面或第二方面任意可能的实现方式中的方法。The eighth aspect of the embodiment of the present application provides a chip, including one or more processors. Part or all of the processor is used to read and execute the computer program stored in the memory, so as to execute the method in any possible implementation manner of the first aspect or the second aspect above.
可选地,该芯片该包括存储器,该存储器与该处理器通过电路或电线与存储器连接。进一步可选地,该芯片还包括通信接口,处理器与该通信接口连接。通信接口用于接收需要处理的数据和/或信息,处理器从该通信接口获取该数据和/或信息,并对该数据和/或信息进行处理,并通过该通信接口输出处理结果。该通信接口可以是输入输出接口。Optionally, the chip includes a memory, and the memory and the processor are connected to the memory through a circuit or wires. Further optionally, the chip further includes a communication interface, and the processor is connected to the communication interface. The communication interface is used to receive data and/or information to be processed, and the processor obtains the data and/or information from the communication interface, processes the data and/or information, and outputs the processing result through the communication interface. The communication interface may be an input-output interface.
在一些实现方式中,一个或多个处理器中还可以有部分处理器是通过专用硬件的方式来实现以上方法中的部分步骤,例如涉及神经网络模型的处理可以由专用神经网络处理器或图形处理器来实现。In some implementations, some of the one or more processors can also implement some steps in the above method through dedicated hardware. For example, the processing related to the neural network model can be performed by a dedicated neural network processor or graphics processor to achieve.
本申请实施例提供的方法可以由一个芯片实现,也可以由多个芯片协同实现。The method provided in the embodiment of the present application may be implemented by one chip, or may be implemented by multiple chips in cooperation.
本申请实施例第九方面提供了一种计算机程序产品,该计算机程序产品包括计算机软件指令,该计算机软件指令可通过处理器进行加载来实现上述第一方面或第二方面中任意一种实现方式的方法。The ninth aspect of the embodiments of the present application provides a computer program product, the computer program product includes computer software instructions, and the computer software instructions can be loaded by a processor to implement any one of the above first or second aspects. Methods.
附图说明Description of drawings
图1为本申请实施例的应用场景示意图;FIG. 1 is a schematic diagram of an application scenario of an embodiment of the present application;
图2为本申请实施例中词序列的示意图;Fig. 2 is the schematic diagram of word sequence in the embodiment of the present application;
图3为预训练语言模型的示意图;Fig. 3 is the schematic diagram of pre-training language model;
图4为本申请实施例的系统架构示意图;FIG. 4 is a schematic diagram of the system architecture of the embodiment of the present application;
图5为本申请实施例提供的模型训练方法的一个实施例的示意图;FIG. 5 is a schematic diagram of an embodiment of the model training method provided by the embodiment of the present application;
图6为本申请实施例中的编码器和Bert模型的原始输入的对比示意图;Fig. 6 is a comparative schematic diagram of the original input of the encoder and the Bert model in the embodiment of the present application;
图7为本申请实施例中的编码器和Bert模型的直接输入的对比示意图;Fig. 7 is a comparative schematic diagram of the direct input of the encoder and the Bert model in the embodiment of the present application;
图8为本申请实施例提供的词句生成方法的一个实施例的示意图;FIG. 8 is a schematic diagram of an embodiment of a method for generating words and sentences provided by an embodiment of the present application;
图9为本申请实施例中候选词语的实施例示意图;FIG. 9 is a schematic diagram of an embodiment of candidate words in the embodiment of the present application;
图10为本申请实施例中第一概率和第三概率结合的示意图;Fig. 10 is a schematic diagram of the combination of the first probability and the third probability in the embodiment of the present application;
图11为本申请实施例中生成目标词句的实施例示意图;FIG. 11 is a schematic diagram of an embodiment of generating target words and sentences in the embodiment of the present application;
图12为本申请实施例参考词典的使用示意图;FIG. 12 is a schematic diagram of the use of the reference dictionary in the embodiment of the present application;
图13为本申请实施例提供的一种词句生成装置的结构示意图;FIG. 13 is a schematic structural diagram of a device for generating words and sentences provided by an embodiment of the present application;
图14为本申请实施例提供的一种模型训练装置的结构示意图;FIG. 14 is a schematic structural diagram of a model training device provided in an embodiment of the present application;
图15为本申请实施例提供的计算机设备的一种结构示意图。FIG. 15 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
具体实施方式detailed description
本申请实施例提供了一种词句生成方法、模型训练方法及相关设备,该方法能够提升生成的词句的准确率,进而提升输入法的准确率以及用户的体验。Embodiments of the present application provide a method for generating words and sentences, a method for training models, and related equipment. The method can improve the accuracy of generated words and sentences, thereby improving the accuracy of input methods and user experience.
本申请实施例可以应用于图1所示的输入场景中。The embodiment of the present application can be applied to the input scenario shown in FIG. 1 .
在该输入场景中,用户可以在终端设备上输入字符串,相应地,部署在终端设备内部的输入法编辑器(Input Method Editor,IME)会接收到用户输入的字符串,并根据字符串生成相应的词句,然后向用户提示该词句。In this input scenario, the user can input a character string on the terminal device. Correspondingly, the input method editor (Input Method Editor, IME) deployed inside the terminal device will receive the character string entered by the user and generate a character string based on the character string. corresponding phrase, and then prompt the phrase to the user.
其中,字符串可以理解为字符的组合,是一种语言信息的载体,用于生成词句;该词句可以是一个词语,也可以是多个词语,一个字也可以成为词语。Among them, a character string can be understood as a combination of characters, which is a carrier of language information and is used to generate a sentence; the sentence can be one word or multiple words, and a word can also become a word.
上述输入场景可以是中文、日语、汉语等多种语言的输入场景;对应不同种类的语言,字符串的形式不同;以中文为例,字符串可以包括一个拼音或多个拼音,具体地,如图1所示,当输入字符串nuoyafangzhou后,输入法编辑器提示的词句分别为诺亚方舟很棒、诺亚方舟和诺亚。The above-mentioned input scene can be an input scene of multiple languages such as Chinese, Japanese, and Chinese; corresponding to different types of languages, the form of the character string is different; taking Chinese as an example, the character string can include one pinyin or multiple pinyins, specifically, as As shown in Figure 1, when the character string nuoyafangzhou is input, the words and sentences suggested by the input method editor are Noah's Ark is great, Noah's Ark and Noah.
在本申请实施例中,终端设备可以为台式电脑、笔记本电脑、平板电脑、智能手机、智能电视,除此之外,终端设备还可以为车载电脑等其他任意可以部署输入法编辑器的设备。In this embodiment of the application, the terminal device may be a desktop computer, a notebook computer, a tablet computer, a smart phone, or a smart TV. In addition, the terminal device may also be any other device that can deploy an input method editor such as a vehicle computer.
可以理解的是,在图1所示的示例中,提示的词句包括诺亚方舟很棒,由此可见,提示的词句较准确,明显可以提升用户的输入效率和用户体验。It can be understood that in the example shown in FIG. 1 , the suggested words and sentences include Noah's Ark, which is great. It can be seen that the suggested words and sentences are more accurate, which can obviously improve the user's input efficiency and user experience.
然而,随着移动互联网的发展,一方面,用户所采用的语言越来越丰富,网络新词层出不穷;另一方面,输入法应用的场景也越来越广泛、越来越多样化。因此,输入法编辑器提示词句的难度大大增加。However, with the development of the mobile Internet, on the one hand, the language used by users is becoming more and more abundant, and new words on the Internet emerge in an endless stream; on the other hand, the application scenarios of input methods are becoming more and more extensive and diverse. Therefore, the difficulty of prompting words and sentences of the input method editor is greatly increased.
为了能够准确地向用户提示词句,本申请实施例提供了一种词句生成方法,该方法是利用编码器将用户输入的字符串(例如拼音)编码为字符串向量,然后基于字符串向量生成目标词句,以提高生成的词句的准确性。In order to be able to accurately prompt words and sentences to the user, an embodiment of the present application provides a method for generating words and sentences, which uses an encoder to encode a character string (such as pinyin) input by the user into a character string vector, and then generates a target based on the character string vector phrases to improve the accuracy of the generated phrases.
为了便于理解,下面先对本申请实施例提及的专业术语进行解释。For ease of understanding, the technical terms mentioned in the embodiments of the present application are firstly explained below.
输入法首选词:当用户输入字符串的时候,输入法编辑器会提供给用户一个候选列表, 该候选列表用于向用户提示词句,排在候选列表第一位的被称为输入法的首选词。Input method preferred word: When the user enters a character string, the input method editor will provide the user with a candidate list, which is used to prompt the user for words and sentences, and the first in the candidate list is called the preferred input method word.
Transformer网络结构:一种深度神经网络结构,包含输入层、self-attention层、Feed-forward层、归一化层等子结构。Transformer network structure: a deep neural network structure, including input layer, self-attention layer, feed-forward layer, normalization layer and other substructures.
Bert模型:具有Transformer网络结构的一种模型,并且,在Transformer网络结构的基础上提出了“预训练+微调”的学习范式,设计了Masked Language Model和Next Sentence Prediction两个预训练任务。Bert model: A model with a Transformer network structure, and on the basis of the Transformer network structure, a "pre-training + fine-tuning" learning paradigm is proposed, and two pre-training tasks, Masked Language Model and Next Sentence Prediction, are designed.
Ngram模型:一种被广泛应用在汉语输入法任务中的模型。Ngram model: A model widely used in Chinese input method tasks.
零概率问题:在Ngram模型的使用过程中,某些情况下,概率的值会被计算为零,零值的概率会造成很多工程实现方面的问题;例如,因为零概率,无法比较概率之间的大小,只能随机返回结果。Zero probability problem: In the process of using the Ngram model, in some cases, the value of probability will be calculated as zero, and the probability of zero value will cause many problems in engineering implementation; for example, because of zero probability, it is impossible to compare the probability between size, only random results can be returned.
平滑算法(smoothing algorithm):为了解决Ngram模型的零概率问题而设计的算法,当判断有零概率风险的时候,平滑算法通常采用稳定的、但不准确的低阶Ngram模型概率,通过某种方式来拟合不稳定的、但准确的高阶Ngram模型概率。Smoothing algorithm (smoothing algorithm): An algorithm designed to solve the zero probability problem of the Ngram model. When judging that there is a zero probability risk, the smoothing algorithm usually uses a stable but inaccurate low-order Ngram model probability. to fit unstable but accurate higher-order Ngram model probabilities.
维特比Viterbi算法:是一种动态规划算法,用于寻找最有可能产生观测事件序列的维特比路径,或者说是隐含状态序列,特别是在马尔可夫信息源上下文和隐马尔可夫模型中,现常常用于语音识别、关键字识别、计算语言学和生物信息学中;其中,Viterbi算法也可以称为有穷状态转换器(Finite State Transducers,FST)算法。Viterbi Algorithm: is a dynamic programming algorithm for finding the Viterbi path that is most likely to produce the sequence of observed events, or the sequence of hidden states, especially in the context of Markov information sources and hidden Markov models Among them, it is often used in speech recognition, keyword recognition, computational linguistics and bioinformatics; among them, the Viterbi algorithm can also be called the Finite State Transducers (FST) algorithm.
下面分别对Ngram模型进行具体介绍。The Ngram model is introduced in detail below.
对于一个语言序列(例如:一个词句就是一个词序列)来讲,序列的概率P(w 1,w 2,……w n)可以分解为条件概率的乘积,如下:P(w 1,w 2,……w n)=P(w 1)*P(w 2|w 1)*P(w 3|w 1,w 2)*……P(w n|w 1,……w n-1),其中,w 1,w 2……w n分别表示序列中的词语,P表示概率。 For a language sequence (for example: a sentence is a word sequence), the sequence probability P(w 1 ,w 2 ,...w n ) can be decomposed into the product of conditional probabilities, as follows: P(w 1 ,w 2 ,...w n )=P(w 1 )*P(w 2 |w 1 )*P(w 3 |w 1 ,w 2 )*...P(w n |w 1 ,...w n-1 ), where w 1 , w 2 ... w n represent the words in the sequence respectively, and P represents the probability.
然而,用统计方法很难准确获得概率P(w n|w 1,……w n-1)的值。因此,Ngram模型做了马尔科夫假设,即当前词语的概率仅仅与有限的N个词语相关。当N取不同的值的时候,得到了一系列具体的Ngram模型。例如:当N=2的时候,当前词语的概率仅仅与过去一个词相关,P(w n|w 1,……w n-1)的值退化为P(w n|w n-1)的值,即
Figure PCTCN2022104334-appb-000001
Figure PCTCN2022104334-appb-000002
此时的Ngram模型称为Bigram模型;同理,当N=3的时候,Ngram模型称为Trigram模型;当N=4的时候,Ngram模型称为Fourgram模型。
However, it is difficult to accurately obtain the value of the probability P(w n |w 1 ,... wn-1 ) by statistical methods. Therefore, the Ngram model makes the Markov assumption that the probability of the current word is only related to a limited number of N words. When N takes different values, a series of specific Ngram models are obtained. For example: when N=2, the probability of the current word is only related to the past word, and the value of P(w n |w 1 ,...w n-1 ) degenerates to that of P(w n |w n-1 ) value, ie
Figure PCTCN2022104334-appb-000001
Figure PCTCN2022104334-appb-000002
The Ngram model at this time is called a Bigram model; similarly, when N=3, the Ngram model is called a Trigram model; when N=4, the Ngram model is called a Fourgram model.
在使用过程中,Ngram模型存在一个问题。在应用场景中,某些词语的组合并没有在训练集合中出现,此时Ngram对这些词语组合估计出的概率值为0,在工程上会引发一系列问题。为了避免这种0概率的情况出现,产生了各种平滑算法。There is a problem with the Ngram model during use. In the application scenario, some word combinations do not appear in the training set. At this time, the probability value estimated by Ngram for these word combinations is 0, which will cause a series of problems in engineering. In order to avoid this zero probability situation, various smoothing algorithms have been developed.
平滑算法可以简单理解为,当Ngram模型的概率是0的时候,将一定的权重与(N-1)gram模型的概率的乘积作为(N)gram模型的概率。The smoothing algorithm can be simply understood as, when the probability of the Ngram model is 0, the product of a certain weight and the probability of the (N-1)gram model is used as the probability of the (N)gram model.
下面以具体的示例对Ngram模型进行说明。The Ngram model is described below with a specific example.
具体地,假设词序列为:诺亚的技术强;词序列的概率可以分解为条件概率的乘积,即P(诺,亚,的,技,术,强)=P(诺)*P(亚|诺)*P(的|诺,亚)*P(技|诺,亚,的)*P(术|诺,亚,的,技)*P(强|诺,亚,的,技,术);在采用N=2的gram模型后,P(诺,亚,的,技,术,强)=P(诺|B)*P(亚|诺)*P(的|亚)*P(技| 的)*P(术|技)*P(强|术);在采用N=3的gram模型后,P(诺,亚,的,技,术,强)=P(诺|A,B)*P(亚|诺,B)*P(的|诺,亚)*P(技|诺,亚,的)*P(术|诺,亚,的,技)*P(强|诺,亚,的,技,术)。Specifically, it is assumed that the word sequence is: Noah’s technology is strong; the probability of the word sequence can be decomposed into the product of conditional probabilities, that is, P (Noah, Asia,, technology, technology, strong) = P (Noah) * P (Asia | Nuo) * P (of | Nuo, Ya) * P (Technology | Nuo, Ya, of) * P (Technology | Nuo, Ya, of, technology) * P (strong | Nuo, Ya, of, technology, technology ); after adopting the gram model of N=2, P (Nuo, Asia, of, technology, technology, strong)=P (Nuo | B)*P (Ya|Nuo)*P(of | Asia)*P( technology|of)*P(technology|technology)*P(strong|technology); after adopting the gram model of N=3, P(nuo, ya, of, technology, technology, strong)=P(nuo|A, B)*P(A|Nuo, B)*P(的|Nuo, A)*P(Technology|Nuo, A, of)*P(Technology|Nuo, A, of, technology)*P(strong|Nuo , ya, of, technology, technology).
需要说明的是,当N=1时,由于“诺”前面没有其他字,在利用Ngram模型计算的过程中,会自动添加一个字(上述示例采用A表示)作为参考;同样地,当N=2时,由于“诺”前面没有其他字,在利用Ngram模型计算的过程中,会自动添加两个字(上述示例采用A和B表示)作为参考。It should be noted that when N=1, since there are no other words in front of "nuo", a word (expressed as A in the above example) will be automatically added as a reference during the calculation process using the Ngram model; similarly, when N= At 2 o'clock, since there are no other words in front of "nuo", two words (A and B are used in the above example) will be automatically added as a reference during the calculation process using the Ngram model.
下面对Viterbi算法进行说明。The Viterbi algorithm is described below.
以拼音输入法为例,如图2所示,最下面一行表示拼音节点,上面四行的节点是与拼音节点对应的汉字,这些汉字组成了用户输入的各种可能性。利用Ngram模型可以计算各个汉字节点的概率,由于汉字节点的概率实际是在前面N个汉字节点出现的情况下的条件概率,因此该概率也可以看成是汉字节点之间的路径转移概率。Taking the pinyin input method as an example, as shown in Figure 2, the bottom line represents pinyin nodes, and the upper four lines of nodes are Chinese characters corresponding to pinyin nodes. These Chinese characters constitute various possibilities for user input. The probability of each Chinese character node can be calculated by using the Ngram model. Since the probability of the Chinese character node is actually the conditional probability of the occurrence of the previous N Chinese character nodes, this probability can also be regarded as the path transition probability between Chinese character nodes.
例如,当N=2时,利用Ngram模型可以计算概率P(亚|诺)、P(亚|懦)、P(亚|糯)、P(亚|挪),这些概率也可以称为“诺”到“亚”的路径转移概率,“懦”到“亚”的路径转移概率,“糯”到“亚”的路径转移概率,“挪”到“亚”的路径转移概率。For example, when N=2, the Ngram model can be used to calculate the probabilities P (Ya | Nuo), P (Ya | Cowardly), P (Ya | Waxy), P (Ya | No), these probabilities can also be called "Nuo The path transition probability from "" to "Asia", the path transition probability from "cowardly" to "Asia", the path transition probability from "waxy" to "Asia", and the path transition probability from "No" to "Asia".
对应“nuo”、“ya”、“de”、“ji”、“shu”、“qiang”六个拼音中的每个拼音,汉字的选择均有四种,因此这些汉字组合的数量为4*4*4*4*4*4;利用Viterbi算法和汉字间的路径转移概率,则可以搜索到一条概率最大的节点路径,这条节点路径也可以称为最优路径,具体可以为图2所示的“诺亚的技术强”。Corresponding to each of the six pinyins of "nuo", "ya", "de", "ji", "shu", and "qiang", there are four choices of Chinese characters, so the number of combinations of these Chinese characters is 4* 4*4*4*4*4; using the Viterbi algorithm and the path transition probability between Chinese characters, a node path with the highest probability can be searched, and this node path can also be called the optimal path, which can be specifically shown in Figure 2 "Noah's technology is strong".
下面对预训练语言模型(pertrained language model,PLM)和Bert模型进行说明。The following describes the pretrained language model (pertrained language model, PLM) and Bert model.
预训练语言模型是近年来兴起的自然语言处理(natural language processing,NLP)领域的一个重要的通用模型,其中,NLP是让计算机理解并处理人类自然语言的技术,是实现人工智能(artificial intelligence,AI)的重要技术手段。The pre-trained language model is an important general model in the field of natural language processing (NLP) that has emerged in recent years. Among them, NLP is a technology that allows computers to understand and process human natural language. AI) is an important technical means.
如图3所示,预训练语言模型主要包含三个方面:网络结构,学习范式和(预)训练任务。As shown in Figure 3, the pre-trained language model mainly includes three aspects: network structure, learning paradigm and (pre-)training tasks.
预训练语言模型的网络结构采用了Transformer网络的编码器encoder部分的网络结构,编码器encoder部分包含输入层、self-attention层、Feed-forward层、归一化层。The network structure of the pre-trained language model adopts the network structure of the encoder part of the Transformer network. The encoder part includes an input layer, a self-attention layer, a feed-forward layer, and a normalization layer.
预训练语言模型的种类有很多,其中具有代表性的属于Bert模型。There are many types of pre-trained language models, among which the representative one belongs to the Bert model.
Bert模型是在编码器encoder部分的基础上,采用了“预训练+微调”的学习范式,即在大量无标注语料上用预训练任务来学习基础模型,然后在目标任务上微调该基础模型,以得到Bert模型,其中,预训练任务主要是指Masked Language Model任务和Next Sentence Prediction任务。The Bert model is based on the encoder part of the encoder, using the "pre-training + fine-tuning" learning paradigm, that is, using pre-training tasks on a large amount of unlabeled corpus to learn the basic model, and then fine-tuning the basic model on the target task. In order to obtain the Bert model, the pre-training tasks mainly refer to the Masked Language Model task and the Next Sentence Prediction task.
下面对本申请实施例的系统架构进行介绍。The system architecture of the embodiment of the present application is introduced below.
如图4所示,本申请实施例的系统架构包括训练阶段和应用阶段,下面以汉语为例对此进行说明。As shown in FIG. 4 , the system architecture of the embodiment of the present application includes a training phase and an application phase, which will be described below using Chinese as an example.
在训练阶段,汉字语料经过分词器得到分词语料。接下来,在分词语料的基础上,训练得到Ngram模型。与此同时,通过字音转换器将分词预料从汉字转成拼音,获得拼音语 料。然后,在拼音预料的基础上,训练得到编码器,该编码器用于将拼音编码为向量;由于编码器也是采用Transformer网络的编码器encoder部分,这与已有的Bert模型类似,且用于编码拼音,因此也可以将编码器称为拼音Bert模型。In the training phase, the Chinese character corpus is passed through the word segmentation device to obtain the word segmentation data. Next, on the basis of the word segmentation data, train the Ngram model. At the same time, the word segmentation is expected to be converted from Chinese characters to pinyin through a phonetic converter to obtain pinyin corpus. Then, on the basis of the Pinyin prediction, the encoder is trained to encode the Pinyin into a vector; since the encoder also uses the encoder part of the Transformer network, it is similar to the existing Bert model and is used for encoding Pinyin, so the encoder can also be called a Pinyin Bert model.
在应用阶段,将拼音Bert模型和Ngram模型结合,再结合各种外部资源库,如:基础词库、短语词库、用户词库、各种领域词库(图4示出了领域词语1、领域词语2和领域词语3)等,从而得到输入引擎,该输入引擎用于响应于用户输入的拼音,提示相应的词句。In the application stage, the Pinyin Bert model is combined with the Ngram model, and then combined with various external resource banks, such as: basic thesaurus, phrase thesaurus, user thesaurus, various domain thesaurus (Figure 4 shows domain words 1, Field words 2 and field words 3), etc., to obtain an input engine, which is used to prompt corresponding words and sentences in response to the pinyin input by the user.
下面结合图5,先从训练阶段对本申请实施例提供的模型训练方法进行介绍。Referring to FIG. 5 , the model training method provided by the embodiment of the present application will be introduced from the training stage first.
具体地,本申请实施例提供了一种模型训练方法的一个实施例,该实施例可以应用于中文、日文、韩文等多种语言,由于模型训练的过程需要较大的运算量,因此,该实施例通常由服务器执行。Specifically, the embodiment of the present application provides an embodiment of a model training method, which can be applied to multiple languages such as Chinese, Japanese, and Korean. Since the process of model training requires a large amount of computation, this Embodiments are typically performed by a server.
如图5所示,该实施例包括:As shown in Figure 5, this embodiment includes:
步骤101,获取样本字符串序列。 Step 101, acquire a sample character string sequence.
样本字符串序列包括K个样本字符串,其中,K为正整数。The sample character string sequence includes K sample character strings, where K is a positive integer.
在本申请实施例中,字符串可以理解为字符的组合,是一种语言信息的载体,用于生成词句;该词句可以是一个词语,也可以是多个词语,一个字也可以成为词语。In the embodiment of the present application, a character string can be understood as a combination of characters, which is a carrier of language information and is used to generate a sentence; the sentence can be one word, or multiple words, and a word can also become a word.
上述输入场景可以是中文、日语、汉语等多种语言的输入场景;对应不同种类的语言,字符串的形式不同;以中文为例,字符串可以包括一个拼音或多个拼音,此时,字符串也可以称为拼音串,例如,字符串可以为“nuoyafangzhou”。The above-mentioned input scene can be an input scene of multiple languages such as Chinese, Japanese, and Chinese; corresponding to different types of languages, the form of the character string is different; taking Chinese as an example, the character string can include one pinyin or multiple pinyin, at this time, the characters A string can also be called a pinyin string, for example, the string can be "nuoyafangzhou".
样本字符串是指作为样本且用于训练的字符串。A sample character string refers to a character string used as a sample and used for training.
每个样本字符串指示一个或多个样本候选词语,该样本候选词语可以是一个字,也可以是多个字。Each sample character string indicates one or more sample candidate words, and the sample candidate words may be one character or multiple characters.
例如,当样本字符串为“nuo”时,对应的样本候选词语可以为“诺”、“糯”、“懦”等;当样本字符串为“ya”时,对应的样本候选词语可以为“亚”、“压”、“呀”等。For example, when the sample character string is "nuo", the corresponding sample candidate words can be "nuo", "waxy", "cowardly" etc.; when the sample character string is "ya", the corresponding sample candidate words can be " Asia", "pressure", "ah" and so on.
获取样本字符串序列的方法有多种,本申请实施例对此不做具体限定。There are many methods for obtaining the sample character string sequence, which are not specifically limited in this embodiment of the present application.
示例性地,步骤101包括:基于K个目标样本词语获取样本字符串序列中的K个样本字符串。Exemplarily, step 101 includes: acquiring K sample character strings in the sample character string sequence based on the K target sample words.
例如,如图4所示,当样本字符串为拼音时,则可以通过字音转换器将目标样本词语从汉字转成拼音,以得到样本字符串。For example, as shown in FIG. 4 , when the sample character string is pinyin, the target sample word can be converted from Chinese characters to pinyin by a phonetic converter to obtain the sample character string.
步骤102,根据样本字符串序列,通过编码器,得到K个第一样本字符串向量,每个第一样本字符串向量对应一个样本字符串。Step 102: Obtain K first sample character string vectors through an encoder according to the sample character string sequence, and each first sample character string vector corresponds to a sample character string.
编码器可以理解为一个深度学习网络模型,编码器的网络结构有多种,本申请实施例对此不做具体限定;具体地,编码器的网络结构可以采用Transformer网络的编码器部分的网络结构,或采用由Transformer网络的编码器部分得到的一系列其他网络的网络结构。The encoder can be understood as a deep learning network model, and there are various network structures of the encoder, which are not specifically limited in the embodiment of the present application; specifically, the network structure of the encoder can adopt the network structure of the encoder part of the Transformer network , or adopt the network structure of a series of other networks obtained from the encoder part of the Transformer network.
本申请实施例中的编码器的网络结构虽然与Bert模型的网络结构类似,也采用了Transformer网络的编码器部分的网络结构,但实际却大不相同,下文会通过多次对比以说明本申请实施例中的编码器与Bert模型的不同。Although the network structure of the encoder in the embodiment of this application is similar to the network structure of the Bert model, the network structure of the encoder part of the Transformer network is also used, but the actual situation is quite different. The following will illustrate this application through multiple comparisons. The encoder in the embodiment is different from that of the Bert model.
例如,以样本字符串为拼音串为例;如图6所示,左侧的模型表示Bert模型,其原始输入是两个汉词句子“诺亚方舟”和“很棒”,并且用分割符“SEP”分隔,除此之外,原始输入还包括用于文本分类的标记“CLS”;右侧的模型表示本申请实施例中的编码器,其原始输入不再是两个汉词句子,而是样本字符串序列“nuo ya fang zhou hen bang”,不需要分割符“SEP”,并且,由于编码器不需要对文本进行分类,因此编码器的原始输入也不需要标记“CLS”。For example, take the sample string as a pinyin string as an example; as shown in Figure 6, the model on the left represents the Bert model, and its original input is two Chinese sentences "Noah's Ark" and "Great", and the separator "SEP" separation, in addition, the original input also includes the label "CLS" for text classification; the model on the right represents the encoder in the embodiment of the application, and its original input is no longer two Chinese sentences, Instead, the sample string sequence "nuo ya fang zhou hen bang" does not require the separator "SEP", and, since the encoder does not need to classify the text, the original input of the encoder does not need the token "CLS".
作为一种实现方式,步骤102包括:As an implementation, step 102 includes:
根据样本字符串序列获取K个第二位置向量和K个第二样本字符串向量;根据K个第二位置向量和K个第二样本字符串向量,通过编码器,得到K个第一样本字符串向量。Obtain K second position vectors and K second sample string vectors according to the sample character string sequence; obtain K first samples through an encoder based on the K second position vectors and K second sample character string vectors String vector.
其中,每个第二位置向量表示一个样本字符串在样本字符串序列中的位置,以样本字符串序列“nuo ya fang zhou hen bang”为例,样本字符串“fang”对应的第二位置向量表示“fang”在样本字符串序列“nuo ya fang zhou hen bang”中的位置。Wherein, each second position vector represents the position of a sample string in the sample string sequence, taking the sample string sequence "nuo ya fang zhou hen bang" as an example, the second position vector corresponding to the sample string "fang" Indicates the position of "fang" in the sample string sequence "nuo ya fang zhou hen bang".
每个第二样本字符串向量表示一个样本字符串,其中,第二样本字符串向量既可以通过随机初始化得到,也可以利用Word2Vector等算法进行预训练得到。Each second sample character string vector represents a sample character string, wherein the second sample character string vector can be obtained through random initialization, or can be obtained through pre-training using an algorithm such as Word2Vector.
需要说明的是,第二样本字符串向量与第一样本字符串向量是不同的,第二样本字符串向量仅仅是基于一个样本字符串生成的,所以仅包含一个样本字符串的本身信息;而第一样本字符串向量是基于编码器生成的,编码器在生成第一样本字符串向量的过程中,结合了多个样本字符串的信息,因此,第一样本字符串向量不仅包含一个样本字符串本身的信息,还包含了其他样本字符串的信息。It should be noted that the second sample string vector is different from the first sample string vector, and the second sample string vector is only generated based on one sample string, so it only contains the information of one sample string itself; The first sample string vector is generated based on the encoder, and the encoder combines the information of multiple sample strings during the process of generating the first sample string vector. Therefore, the first sample string vector not only Contains information about a sample string itself, as well as information about other sample strings.
下面以图6所示的样本字符串为拼音串为例,并结合图7,说明本申请实施例中的编码器与Bert模型的不同。Taking the sample character string shown in FIG. 6 as a pinyin string as an example, and referring to FIG. 7 , the difference between the encoder in the embodiment of the present application and the Bert model will be described below.
具体地,如图7所示,图7左侧表示Bert模型的直接输入(即由原始输入转换得到的),具体包含三个嵌入embedding层;对应图6所示的原始输入,这三个embedding层从下至上依次为位置嵌入position embedding层、分段嵌入segment embedding层和标记嵌入token embedding层,其中,position embedding用来区分token在序列中的不同位置;segment embedding用来区分token是在输入的第一个汉词句子(“诺亚方舟”)中,还是在第二个汉词句子(“很棒”)中,为接下来做Next sentence prediction任务做准备;token embedding表示token的语义。Specifically, as shown in Figure 7, the left side of Figure 7 represents the direct input of the Bert model (that is, converted from the original input), specifically including three embedding embedding layers; corresponding to the original input shown in Figure 6, these three embedding Layers from bottom to top are position embedding layer, segment embedding layer and token embedding layer, where position embedding is used to distinguish different positions of the token in the sequence; segment embedding is used to distinguish whether the token is input In the first Chinese sentence ("Noah's Ark"), or in the second Chinese sentence ("Great"), prepare for the next sentence prediction task; token embedding represents the semantics of the token.
在Bert模型中,token是汉词句子中的汉字,例如,token可以是汉字“诺”;token也可以是“SEP”或“CLS”。In the Bert model, the token is a Chinese character in a Chinese sentence. For example, the token can be the Chinese character "nuo"; the token can also be "SEP" or "CLS".
图7右侧表示本申请实施例中的编码器的直接输入,具体包括位置嵌入position embedding层和标记嵌入token embedding层,而不包括分段嵌入segment embedding层,其中,position embedding用来区分token在序列中的不同位置,token embedding表示token的语义。The right side of Figure 7 shows the direct input of the encoder in the embodiment of the present application, specifically including the position embedding position embedding layer and the mark embedding token embedding layer, but not including the segment embedding segment embedding layer, wherein the position embedding is used to distinguish between tokens At different positions in the sequence, the token embedding represents the semantics of the token.
在本申请实施例中的编码器中,token是一个拼音或多个拼音,例如,token可以是“nuo”,也可以是“ya”。In the encoder in the embodiment of the present application, the token is a pinyin or multiple pinyins, for example, the token can be "nuo" or "ya".
当token是“nuo”时,位置嵌入position embedding层中的E0则表示“nuo”的位 置向量,标记嵌入token embedding层中的Enuo则表示“nuo”的字符向量。When the token is "nuo", E0 in the position embedding layer represents the position vector of "nuo", and Enuo in the token embedding layer represents the character vector of "nuo".
除此之外,从图7中可以看出,本申请实施例中的编码器各直接输入的长度小于Bert模型各直接输入的长度。In addition, it can be seen from FIG. 7 that the length of each direct input of the encoder in the embodiment of the present application is smaller than the length of each direct input of the Bert model.
需要说明的是,Bert模型的最终目标是做各种文档或者词句相关的任务,如:文本分类、阅读理解、问答系统等,因此,Bert模型的原始输入的长度要涵盖大部分的文档或者词句,通常被设定为512个token,相应地,Bert模型的的直接输入的长度也为512个token(图7仅示出了9个token);而本申请实施例中的编码器的最终目标是用于输入法,即接收用户在终端设备上的输入,一般来讲,用户的输入相对比较短,因此,本申请实施例中的编码器的原始输入的长度不需要太长,通常被设定为16个或者32个token(图7仅示出了6个token),相应地,本申请实施例中的编码器的直接输入的长度也为16个或者32个token。It should be noted that the ultimate goal of the Bert model is to do various tasks related to documents or sentences, such as: text classification, reading comprehension, question answering system, etc. Therefore, the length of the original input of the Bert model should cover most of the documents or sentences , is usually set to 512 tokens, correspondingly, the length of the direct input of the Bert model is also 512 tokens (Figure 7 only shows 9 tokens); and the final goal of the encoder in the embodiment of the present application It is used for the input method, that is, to receive the user’s input on the terminal device. Generally speaking, the user’s input is relatively short. Therefore, the length of the original input of the encoder in the embodiment of the present application does not need to be too long, and is usually set to 16 or 32 tokens (only 6 tokens are shown in FIG. 7 ), correspondingly, the length of the direct input of the encoder in the embodiment of the present application is also 16 or 32 tokens.
编码器的直接输入的长度的较小,所以即输入到编码器的参数量较少;并且,以字符串为拼音为例,拼音的总数远小于汉字的总数,所以编码器需要处理的token的总数较小;这能够降低训练过程中的工作量,提高训练效率。The length of the direct input of the encoder is small, so the number of parameters input to the encoder is small; and, taking the character string as pinyin as an example, the total number of pinyin is much smaller than the total number of Chinese characters, so the number of tokens that the encoder needs to process The total number is small; this can reduce the workload in the training process and improve the training efficiency.
步骤103,基于K个第一样本字符串向量,获取K个样本字符串指示的每个样本候选词语的第二概率。 Step 103, based on the K first sample character string vectors, obtain the second probability of each sample candidate word indicated by the K sample character strings.
其中,样本候选词语的第二概率表示,根据第一样本字符串向量得到样本候选词语的概率。Wherein, the second probability of the sample candidate word represents the probability of obtaining the sample candidate word according to the first sample character string vector.
计算第二概率的方法有多种,本申请实施例对此不做具体限定。There are multiple methods for calculating the second probability, which are not specifically limited in this embodiment of the present application.
作为一种可实现的方式,步骤103还可以包括:As an implementable manner, step 103 may also include:
基于K个第一样本字符串向量,通过概率模型,获取K个样本字符串指示的每个样本候选词语的第二概率。Based on the K first sample character string vectors, the second probability of each sample candidate word indicated by the K sample character strings is obtained through a probability model.
具体地,可以将K个第一样本字符串向量输入概率模型,概率模型则会输出该第二概率。Specifically, the K first sample character string vectors may be input into the probability model, and the probability model will output the second probability.
此时,概率模型和编码器可以看成一个整体,即一个深度学习模型,而编码器可以看成是这个深度学习模型的前半部分,概率模型可以看成是这个深度学习模型的后半部分。At this time, the probability model and the encoder can be regarded as a whole, that is, a deep learning model, and the encoder can be regarded as the first half of the deep learning model, and the probability model can be regarded as the second half of the deep learning model.
步骤104,基于第二概率,对编码器进行调整。 Step 104, adjust the encoder based on the second probability.
需要说明的是,基于第二概率对编码器进行调整的方法有多种,本申请实施例对此不做具体限定。It should be noted that there are many methods for adjusting the encoder based on the second probability, which are not specifically limited in this embodiment of the present application.
作为一种可实现的方式,每个样本字符串指示的样本候选词语中包含一个目标样本词语,相应地,步骤104包括:调整编码器的参数,以使得目标样本词语的第二概率增加,和/或以使得除目标样本词语外的其他样本候选词语的第二概率降低。As an achievable manner, the sample candidate words indicated by each sample string contain a target sample word, and accordingly, step 104 includes: adjusting the parameters of the encoder so that the second probability of the target sample word increases, and /or to reduce the second probabilities of other sample candidate words except the target sample word.
例如,样本字符串序列为“nuoyafangzhouhenbang”,对于其中的样本字符串“nuo”来说,对应的样本候选词语包括“诺”、“糯”、“懦”等,令“诺”为目标样本词语,则可以通过调整编码器的参数,以使得“诺”的第二概率增加,而使“糯”的第二概率以及“懦”的第二概率降低。For example, the sample string sequence is "nuoyafangzhouhenbang", for the sample string "nuo", the corresponding sample candidate words include "nuo", "waxy", "cowardly", etc., let "nuo" be the target sample word , then the parameters of the encoder can be adjusted so that the second probability of "nuo" increases, while the second probability of "waxy" and "cowardly" decrease.
在该实施例中,目标样本词语相当于样本标签,通过调整编码器的参数,使得目标样 本词语的第二概率尽可能地增加,而使除目标样本词语外的其他样本候选词语的第二概率尽可能地降低;理想状态下,通过调整编码器的参数,使得目标样本词语的第二概率大于其他样本候选词语的第二概率。In this embodiment, the target sample word is equivalent to the sample label. By adjusting the parameters of the encoder, the second probability of the target sample word can be increased as much as possible, while the second probability of other sample candidate words except the target sample word Reduce as much as possible; ideally, by adjusting the parameters of the encoder, the second probability of the target sample word is greater than the second probability of other sample candidate words.
步骤105,基于第二概率,对概率模型进行调整。 Step 105, adjust the probability model based on the second probability.
示例性地,步骤105包括:调整概率模型的参数,以使得目标样本词语的第二概率增加,和/或以使得除目标样本词语外的其他样本候选词语的第二概率降低。Exemplarily, step 105 includes: adjusting the parameters of the probability model to increase the second probability of the target sample word, and/or to decrease the second probability of other sample candidate words except the target sample word.
概率模型的参数的调整过程与编码器的参数的调整过程类似,具体参照步骤104的相关说明进行理解。The process of adjusting the parameters of the probability model is similar to the process of adjusting the parameters of the encoder. For details, refer to the related description of step 104 for understanding.
需要说明的是,步骤105是可选的,具体地,在通过概率模型实现步骤103的情况下执行步骤105。It should be noted that step 105 is optional, specifically, step 105 is performed when step 103 is realized by a probability model.
另外,在训练阶段,会重复执行步骤102至步骤105,直到满足条件,训练停止;本申请实施例对条件的内容不做具体限定,例如,该条件可以是损失函数的值小于阈值,其中,损失函数的值可以根据第二概率计算得到,该条件也可以是重复执行的次数达到预设次数。In addition, in the training phase, step 102 to step 105 will be repeatedly executed until the condition is met, and the training will stop; the embodiment of the present application does not specifically limit the content of the condition, for example, the condition may be that the value of the loss function is less than the threshold, where, The value of the loss function may be calculated according to the second probability, and the condition may also be that the number of repeated executions reaches a preset number of times.
在本申请实施例中,通过编码器对样本字符串序列进行编码,以得到第一样本字符串向量,该第一样本字符串向量是融合了整个样本字符串序列的信息后对样本字符串的表示,而不仅仅表示样本字符串本身,即第一样本字符串向量包含了较多的信息;所以基于第一样本字符串向量计算目标样本词语的第二概率,并基于第二概率对编码器和概率模型进行调整,能够提高训练出的编码器和概率模型的准确度,从而提高输入法的准确度。In the embodiment of the present application, the sample character string sequence is encoded by an encoder to obtain the first sample character string vector, which is a combination of the information of the entire sample character string sequence and the sample characters The representation of the string, not just the sample string itself, that is, the first sample string vector contains more information; so the second probability of the target sample word is calculated based on the first sample string vector, and based on the second Probability Adjusting the encoder and probability model can improve the accuracy of the trained encoder and probability model, thereby improving the accuracy of the input method.
上面对编码器和概率模型的训练过程进行了说明,除此之外,在采用本申请实施例提供的词句生成方法生成词句的过程中,还可能会用到Ngram模型;所以,下面对Ngram模型的训练过程进行说明。The above describes the training process of the encoder and the probability model. In addition, the Ngram model may also be used in the process of generating words and sentences using the method of generating words and sentences provided by the embodiment of the present application; therefore, the following The training process of the Ngram model will be described.
Ngram模型的训练过程可以理解为,计算词语间的条件概率的过程。The training process of the Ngram model can be understood as the process of calculating the conditional probability between words.
具体地,以拼音输入法为例,首先将汉语语料经过分词器转换成汉语词语序列,然后通过技术方法统计词语间的条件概率;例如,汉语预料为“华为公司近期发布最新旗舰手机”,经过分词器后,得到汉语词语序列“华为/公司/近期/发布/最新/旗舰手机”。Specifically, taking the pinyin input method as an example, the Chinese corpus is first converted into a sequence of Chinese words through a tokenizer, and then the conditional probability between words is counted through technical methods; After the tokenizer, the Chinese word sequence "Huawei/company/recent/released/latest/flagship phone" is obtained.
若N=2,则词语间的条件概率的计算方式为
Figure PCTCN2022104334-appb-000003
其中,C(w n-1)是词语w n-1在所有语料中出现的总次数,C(w n-1,w n)是两个词语w n-1和w n在所有语料中同时出现的次数;相应地,
Figure PCTCN2022104334-appb-000004
If N=2, the calculation method of the conditional probability between words is
Figure PCTCN2022104334-appb-000003
Among them, C(w n-1 ) is the total number of occurrences of word w n-1 in all corpora, C(w n-1 ,w n ) is the simultaneous occurrence of two words w n-1 and w n in all corpora the number of occurrences; correspondingly,
Figure PCTCN2022104334-appb-000004
下面从应用阶段对本申请实施例提供的词句生成方法进行介绍。The method for generating words and sentences provided by the embodiments of the present application will be introduced below from the application stage.
具体地,本申请实施例提供了一种词句生成方法的一个实施例,该实施例可以应用于中文、日文、韩文等多种语言的输入法系统中;该输入法系统可以部署在终端设备中,也可以部署在云服务器中;当输入法系统部署在云服务器中时,该实施例由云服务器执行,并由云服务器将生成目标词句发送至终端设备,以在终端设备上显示。Specifically, the embodiment of the present application provides an embodiment of a method for generating words and sentences, which can be applied to input method systems in multiple languages such as Chinese, Japanese, and Korean; the input method system can be deployed in terminal devices , can also be deployed in the cloud server; when the input method system is deployed in the cloud server, this embodiment is executed by the cloud server, and the cloud server sends the generated target words to the terminal device for display on the terminal device.
如图8所示,该实施例包括:As shown in Figure 8, this embodiment includes:
步骤201,获取字符串序列,字符串序列包括M个字符串,每个字符串指示一个或多 个候选词语,其中,M为正整数。 Step 201, obtain a character string sequence, the character string sequence includes M character strings, each character string indicates one or more candidate words, wherein, M is a positive integer.
具体地,步骤201可以包括:根据用户的输入得到字符串序列。Specifically, step 201 may include: obtaining a character string sequence according to user input.
由于前文对字符串进行说明,故在此不做详述,具体可参照步骤101的相关说明对步骤201进行理解。Since the character string is described above, it is not described in detail here, and step 201 can be understood by referring to the relevant description of step 101 for details.
为了能够提示给用户更多种目标词句,通常情况下,字符串都指示多个候选词语;少数情况下,字符串指示一个候选词语,例如,字符串较为生僻,仅存在一个词语与该字符串对应,那么该字符串则指示一个候选词语。In order to be able to prompt the user with more kinds of target words and sentences, under normal circumstances, a character string indicates multiple candidate words; Corresponds, then the character string indicates a candidate word.
步骤202,根据字符串序列,通过编码器,得到M个第一字符串向量,每个第一字符串向量对应M个字符串中的一个字符串。Step 202: Obtain M first character string vectors through an encoder according to the character string sequence, and each first character string vector corresponds to one of the M character strings.
示例性地,编码器是基于转换任务训练得到的,其中,转换任务是将样本字符串序列转换成样本词句的任务。Exemplarily, the encoder is trained based on a conversion task, wherein the conversion task is a task of converting sample character string sequences into sample words and sentences.
需要说明的是,基于转换任务训练的过程可以理解为编码器在训练阶段的训练过程,具体可参阅前文训练阶段的相关说明进行理解。It should be noted that the training process based on the conversion task can be understood as the training process of the encoder in the training phase. For details, please refer to the relevant description of the training phase above for understanding.
作为一种可实现的方式,步骤202包括:As an implementable manner, step 202 includes:
根据字符串序列获取M个第一位置向量和M个第二字符串向量,每个第一位置向量表示一个字符串在字符串序列中的位置,每个第二字符串向量表示一个字符串;Acquiring M first position vectors and M second character string vectors according to the string sequence, each first position vector represents the position of a character string in the character string sequence, and each second character string vector represents a character string;
根据M个第一位置向量和M个第二字符串向量,通过编码器,得到多个第一字符串向量。According to the M first position vectors and the M second character string vectors, multiple first character string vectors are obtained through an encoder.
步骤202与步骤102类似,具体可参照步骤102的相关说明进行理解,不同的是,步骤202中第一字符串向量的数量M与第一样本字符串向量的数量N可能不同。Step 202 is similar to step 102, which can be understood with reference to the relevant description of step 102, except that the number M of the first character string vectors in step 202 may be different from the number N of the first sample character string vectors.
步骤203,基于M个第一字符串向量,获取M个字符串指示的每个候选词语的第一概率。 Step 203, based on the M first character string vectors, the first probability of each candidate word indicated by the M character strings is obtained.
作为一种可实现的方式,步骤203包括:As an implementable manner, step 203 includes:
基于M个第一字符串向量,通过概率模型,获取M个字符串指示的每个候选词语的第一概率,概率模型是基于转换任务训练得到的。Based on the M first character string vectors, the first probability of each candidate word indicated by the M character strings is obtained through a probability model, and the probability model is obtained based on conversion task training.
其中,转换任务是将样本字符串序列转换成样本词句的任务。Among them, the conversion task is the task of converting sample character string sequences into sample words and sentences.
需要说明的是,基于转换任务训练的过程可以理解为概率模型在训练阶段的训练过程,具体可参阅前文训练阶段的相关说明进行理解。It should be noted that the training process based on the conversion task can be understood as the training process of the probability model in the training phase. For details, please refer to the relevant description of the training phase above for understanding.
步骤203与步骤103类似,具体可参照步骤103的相关说明进行理解,不同的是,步骤203中第一字符串向量的数量M与第一样本字符串向量的数量N可能不同。Step 203 is similar to step 103, which can be understood with reference to the related description of step 103, except that the number M of the first character string vectors in step 203 may be different from the number N of the first sample character string vectors.
步骤204,基于第一概率,生成目标词句,目标词句包括M个目标词语,每个目标词语为每个字符串指示的一个或多个候选词语中的一个。 Step 204, based on the first probability, generate target words and sentences, the target words and sentences include M target words, and each target word is one of one or more candidate words indicated by each character string.
具体地,对于每个字符串,可以基于第一概率从该字符串对应的所有候选词语中选择出一个候选词语;这样,对于M个字符串,则可以选择出M个候选词语,这M个候选词语则可以组成目标词句。Specifically, for each character string, a candidate word can be selected from all candidate words corresponding to the character string based on the first probability; thus, for M character strings, M candidate words can be selected, and these M Candidate words can form target words and sentences.
通常情况下,会从该字符串对应的所有候选词语中选择第一概率最大的候选词语,以生成目标词句。Usually, the candidate word with the highest probability is selected from all the candidate words corresponding to the character string to generate the target word and sentence.
例如,如图9所示,字符串“nuo”、“ya”、“fang”、“zhou”、“hen”和“bang”中的每个字符串都指示是三个候选词语;对于字符串“nuo”,选择第一概率最大的“诺”,同理,对于其他字符串,选择第一概率最大的候选词语分别为“亚”、“方”、“舟”、“很”和“棒”;基于此,便可以生成目标词句“诺亚方舟很棒”。For example, as shown in Figure 9, each string in the strings "nuo", "ya", "fang", "zhou", "hen" and "bang" indicates three candidate words; "nuo", choose "nuo" with the highest first probability. Similarly, for other character strings, choose the candidate words with the highest first probability as "ya", "fang", "zhou", "very" and "stick "; Based on this, the target phrase "Noah's Ark is great" can be generated.
步骤205,将目标词句作为首选词句进行提示,首选词句为输入法提示的多个词句中排在第一位的词句。 Step 205, prompting the target word and sentence as the preferred word and sentence, which is the first word and sentence among the multiple words and sentences prompted by the input method.
输入场景中,终端设备会提示多个词句,本申请实施例将目标词句作为首选词句进行提示;以图1为例,终端设备提示了三个词句,其中,首选词句为:诺亚方舟很棒。In the input scene, the terminal device will prompt multiple words and sentences, and the embodiment of the present application uses the target word and sentence as the preferred word and sentence for prompting; taking Figure 1 as an example, the terminal device prompts three words and sentences, among which the preferred words and sentences are: Noah's Ark is great .
需要说明的是,生成目标词句的方法有多种,除了前文中提及的方法,还存在其他多种方法,下面对此进行介绍。It should be noted that there are many methods for generating target words and sentences. In addition to the methods mentioned above, there are many other methods, which will be introduced below.
作为一种可实现的方式,可以将编码器和Ngram模型结合,以基于编码器输出的第一概率并利用Ngram模型生成目标词句,以提高生成的目标词句的准确性。As an achievable way, the encoder and the Ngram model can be combined to generate target words and sentences based on the first probability output by the encoder and using the Ngram model, so as to improve the accuracy of the generated target words and sentences.
首先,以字符串为拼音为例,对编码器和Ngram模型的结合进行理论分析。First, taking the character string as pinyin as an example, a theoretical analysis is made on the combination of the encoder and the Ngram model.
本申请实施例可以看成是将拼音序列y 1,y 2……y n转成对应的词语序列w 1,w 2……w n(也可以理解为词句),实际是从所有词语序列中选择条件概率P(w 1,w 2……w n|y 1,y 2……y n)最大的词语序列作为目标词句。 The embodiment of the present application can be regarded as converting the pinyin sequence y 1 , y 2 ... y n into the corresponding word sequence w 1 , w 2 ... w n (also can be understood as words and sentences), actually from all word sequences Select the word sequence with the largest conditional probability P(w 1 , w 2 ... w n |y 1 , y 2 ... y n ) as the target word.
根据贝叶斯原理,这个条件概率可以做如下的分解和转化:According to Bayesian principle, this conditional probability can be decomposed and transformed as follows:
P(w 1,w 2……w n|y 1,y 2……y n)=P(w 1|y 1,y 2……y n)×P(w 2|y 1,y 2……y n,w 1)×P(w 3|y 1,y 2……y n,w 1,w 2)×……×P(w i|y 1,y 2……y n,w 1,w 2…,w i-1)×……×P(w n|y 1,y 2……y n,w 1,w 2…,w n-1); P(w 1 ,w 2 ...w n |y 1 ,y 2 ...y n )=P(w 1 |y 1 ,y 2 ...y n )×P(w 2 |y 1 ,y 2 ... …y n ,w 1 )×P(w 3 |y 1 ,y 2 …y n ,w 1 ,w 2 )×……×P(w i |y 1 ,y 2 …y n ,w 1 ,w 2 ...,w i-1 )×...×P(w n |y 1 ,y 2 ...y n ,w 1 ,w 2 ...,w n-1 );
上述公式是将条件概率P(w 1,w 2……w n|y 1,y 2……y n)转化成词语概率P(w i|y 1,y 2……y n,w 1,w 2…,w i-1)的连乘积的形式。其中,代表词语的条件概率P(w i|y 1,y 2……y n,w 1,w 2…,w i-1),可以做进一步分解,如下所示: The above formula converts the conditional probability P(w 1 ,w 2 ...w n |y 1 ,y 2 ...y n ) into word probability P(w i |y 1 ,y 2 ...y n ,w 1 , w 2 ...,w i-1 ) in the form of continuous product. Among them, the conditional probability P(w i |y 1 ,y 2 ...y n ,w 1 ,w 2 ...,w i-1 ) representing the word can be further decomposed, as follows:
P(w i|y 1,y 2……y n,w 1,w 2…,w i-1)=P(w i|y 1,y 2……y n)×P(w i|w 1,w 2…,w i-1)=P(w i|y 1,y 2……y n)×P(w i|w i-n…,w i-1); P(w i |y 1 ,y 2 ...y n ,w 1 ,w 2 ...,w i-1 )=P(w i |y 1 ,y 2 ...y n )×P(w i |w 1 ,w 2 ...,w i-1 )=P(w i |y 1 ,y 2 ...y n )×P(w i |w in ...,w i-1 );
其中,P(w i|y 1,y 2……y n)是前文中计算出来的第一概率,P(w i|w i-n…,w i-1)是Ngram模型计算出来的概率。在上述公式的最后一步推导中,采用了Ngram模型的马尔科夫假设, 将概率P(w i|w 1,w 2…,w i-1)简化为只和w i的前N个词相关,即将概率P(w i|w i-n…,w i-1)退化成P(w i|w i-n…,w i-1),具体可以表示为
Figure PCTCN2022104334-appb-000005
Among them, P(w i |y 1 ,y 2 ... y n ) is the first probability calculated above, and P( wi |w in ...,w i-1 ) is the probability calculated by the Ngram model. In the last derivation of the above formula, the Markov assumption of the Ngram model is adopted, and the probability P(w i |w 1 ,w 2 ...,wi -1 ) is simplified to be only related to the first N words of w i , which is to degenerate the probability P( wi |w in …,wi -1 ) into P( wi |w in …,wi -1 ), which can be specifically expressed as
Figure PCTCN2022104334-appb-000005
基于上述分析可知,可以将前文中计算出来的第一概率与Ngram模型计算出来的条件概率结合,以得到更准确的词语的概率,从而可以提示更准确的目标词句。Based on the above analysis, it can be seen that the first probability calculated above can be combined with the conditional probability calculated by the Ngram model to obtain a more accurate probability of words, thereby prompting more accurate target words and sentences.
具体地,步骤204包括:Specifically, step 204 includes:
根据字符串序列,通过Ngram模型,获取M个字符串指示的每个候选词语的第三概率;According to the string sequence, through the Ngram model, the third probability of each candidate word indicated by the M strings is obtained;
基于第一概率,第三概率以及维特比Viterbi算法,生成目标词句。Based on the first probability, the third probability and the Viterbi algorithm, the target words and sentences are generated.
基于前文对Ngram模型的相关说明可知,候选词语的第三概率实际上是在前N个候选词语出现的情况下的条件概率,其中,N的取值可以根据实际需要进行设定,例如,N可以取1,也可以取2。Based on the previous description of the Ngram model, it can be seen that the third probability of candidate words is actually the conditional probability of the occurrence of the first N candidate words, where the value of N can be set according to actual needs, for example, N It can be 1 or 2.
基于前文的理论分析可知,对于每个候选词语,可以将该候选词语对应的第一概率和第三概率相乘,以得到组合概率(实际也为条件概率),并利用组合概率和维特比Viterbi算法,生成目标词句。Based on the previous theoretical analysis, it can be seen that for each candidate word, the first probability and the third probability corresponding to the candidate word can be multiplied to obtain the combined probability (actually also the conditional probability), and the combined probability and Viterbi Viterbi Algorithm to generate target words and sentences.
下面结合图10对上述过程进行具体说明。The above process will be specifically described below in conjunction with FIG. 10 .
如图10所示,基于编码模型的输出可以计算得到第一概率,以汉字“方”为例,汉字“方”的第一概率=P(方|nuo,ya,fang,zhou,hen,bang);基于Ngram模型可以得到第三概率,以汉字“方”为例,假设N=2,汉字“方”的第三概率=P(方|亚)。As shown in Figure 10, the first probability can be calculated based on the output of the coding model. Taking the Chinese character "square" as an example, the first probability of the Chinese character "square"=P(square | nuo, ya, fang, zhou, hen, bang ); the third probability can be obtained based on the Ngram model, taking the Chinese character "fang" as an example, assuming N=2, the third probability of the Chinese character "fang"=P(fang|ya).
基于此,将第一概率P(方|nuo,ya,fang,zhou,hen,bang)与第三概率P(方|亚)相乘,即可得到汉字“方”的组合概率。Based on this, the combination probability of the Chinese character "fang" can be obtained by multiplying the first probability P(方|nuo, ya, fang, zhou, hen, bang) and the third probability P(方|亚).
采用上述方法可以得到所有汉字的组合概率,再利用Viterbi算法,便可以得到一条概率最大路径,即目标词句。Using the above method, the combination probability of all Chinese characters can be obtained, and then the Viterbi algorithm can be used to obtain a path with the highest probability, that is, the target sentence.
可以理解的是,编码器和概率模型的训练和下发往往周期比较长,不能及时反映用户输入趋势的变化、用户输入场景的变化,且难以应对网络出现的新词和热词。为此,在应用阶段,可以加在多种类型的词典以弥补编码器和概率模型的不足。It is understandable that the training and distribution of encoders and probability models often take a long period of time, which cannot reflect changes in user input trends and user input scenarios in a timely manner, and it is difficult to cope with new words and hot words that appear on the network. For this reason, in the application phase, various types of dictionaries can be added to make up for the shortcomings of encoders and probability models.
其中,该词典也可以称为词库,词库可以包括以下至少一种类型的词库:基础词库、短语词库、用户个人词库、热点词库、各种领域词库,领域词库可以为操作系统领域的词库、人工智能技术领域的词库等。Wherein, the dictionary can also be called a thesaurus, and the thesaurus can include at least one of the following types of thesaurus: basic thesaurus, phrase thesaurus, user personal thesaurus, hotspot thesaurus, various domain thesaurus, domain thesaurus It may be a thesaurus in the field of operating systems, a thesaurus in the field of artificial intelligence technology, and the like.
相应地,作为一种可实现的方式,如图11所示,步骤204包括:Correspondingly, as an achievable manner, as shown in FIG. 11, step 204 includes:
步骤301,从参考词典中获取参考词语。 Step 301, obtain reference words from a reference dictionary.
参考词语包括P个参考字符串指示的P个候选词语,每个参考字符串指示一个候选词语,P个参考字符串包含于字符串序列中,且在字符串序列中的位置连续,其中,P为大于1的整数。The reference words include P candidate words indicated by P reference character strings, each reference character string indicates a candidate word, the P reference character strings are included in the character string sequence, and the positions in the character string sequence are continuous, wherein, P is an integer greater than 1.
本申请实施例对参考词语的数量不做具体限定,参考词语的数量可以为一个,也可以为多个。The embodiment of the present application does not specifically limit the number of reference words, and the number of reference words may be one or multiple.
下面通过具体的示例对参考词语进行说明。The reference words are described below through specific examples.
具体地,参考字符串为“nuoyafangzhouhenbang”;如图12所示,从参考词典中获取的参考词语可以为,参考字符串“nuoyafangzhou”指示的“诺亚方舟”。Specifically, the reference character string is "nuoyafangzhouhenbang"; as shown in FIG. 12 , the reference word obtained from the reference dictionary may be "Noah's Ark" indicated by the reference character string "nuoyafangzhou".
步骤302,基于P个候选词语各自的第一概率,计算参考词语的第四概率。Step 302: Calculate the fourth probability of the reference word based on the respective first probabilities of the P candidate words.
需要说明的是,计算第四概率的方法有多种,本申请实施例对此不做具体限定。It should be noted that there are multiple methods for calculating the fourth probability, which are not specifically limited in this embodiment of the present application.
示例性地,可以将P个候选词语的第一概率的几何平均值,作为参考词语的第四概率。Exemplarily, the geometric mean of the first probabilities of the P candidate words may be used as the fourth probability of the reference word.
例如,仍以图12为例,参考词语“诺亚方舟”的第四概率
Figure PCTCN2022104334-appb-000006
Figure PCTCN2022104334-appb-000007
其中,P(诺)、P(亚)、P(方)和P(舟)分别表示候选词语“诺”、“亚”、“方”和“舟”的第一概率。
For example, still taking Figure 12 as an example, refer to the fourth probability of the word "Noah's Ark"
Figure PCTCN2022104334-appb-000006
Figure PCTCN2022104334-appb-000007
Wherein, P(nuo), P(ya), P(fang) and P(zhou) represent the first probabilities of the candidate words "nuo", "ya", "fang" and "zhou" respectively.
步骤303,基于第四概率以及字符串序列中除P个参考字符串外的其他字符串指示的每个候选词语的第一概率,生成目标词句。Step 303: Generate target words and sentences based on the fourth probability and the first probability of each candidate word indicated by other character strings in the character string sequence except the P reference character strings.
具体地,可以基于第四概率以及其他字符串指示的每个候选词语的第一概率,计算参考词语与其他字符串指示的候选词语构成的所有第一词语组合的概率;基于各个字符串指示的每个候选词语的第一概率可以得到各个字符串指示的候选词语构成的所有第二词语组合的概率;最后,从所有第一词语组合和所有第二词语组合中选择概率最大的词语组合作为目标词句。Specifically, based on the fourth probability and the first probability of each candidate word indicated by other character strings, the probability of all first word combinations formed by the reference word and the candidate words indicated by other character strings can be calculated; The first probability of each candidate word can obtain the probability of all second word combinations formed by the candidate words indicated by each character string; finally, select the word combination with the highest probability from all first word combinations and all second word combinations as the target words.
以图9为例,参考词语“诺亚方舟”与字符串“hen”指示的三个候选词语、字符串“bang”指示的三个候选词语构成9种第一词语组合,基于第四概率、字符串“hen”指示的三个候选词语的第一概率、字符串“bang”指示的三个候选词语的第一概率可以计算这9种第一词语组合的概率。Taking Figure 9 as an example, the reference word "Noah's Ark" and the three candidate words indicated by the character string "hen" and the three candidate words indicated by the character string "bang" form nine first word combinations, based on the fourth probability, The first probabilities of the three candidate words indicated by the character string "hen" and the first probabilities of the three candidate words indicated by the character string "bang" can calculate the probabilities of the nine first word combinations.
而基于字符串“nuo”、“ya”、“fang”、“zhou”、“hen”和“bang”中的每个字符串都对应三个候选词语,共构成3*3*3*3*3*3种第二词语组合;根据候选词语的第一概率可以计算每种第二词语组合的概率。Based on the strings "nuo", "ya", "fang", "zhou", "hen" and "bang", each string corresponds to three candidate words, forming a total of 3*3*3*3* 3*3 second word combinations; the probability of each second word combination can be calculated according to the first probability of the candidate words.
最终,从9种第一词语组合和3*3*3*3*3*3种第二词语组合中选择概率最大的词语组合作为目标词句。Finally, the word combination with the highest probability is selected from the 9 first word combinations and the 3*3*3*3*3*3 second word combinations as the target words and sentences.
可以理解的是,第一词语组合包含于第二词语组合内;由于第一词语组合中包含参考词语,而参考词语包含于参考词典中,所以可以优先选择包含参考词语的词语组合作为目标词句。It can be understood that the first word combination is included in the second word combination; since the first word combination contains reference words, and the reference words are included in the reference dictionary, so the word combination containing the reference words can be preferentially selected as the target word and sentence.
具体地,可以在步骤302中设定相应的第四概率的计算方法,使得得到的参考词语的第四概率大于参考词语中各个候选词语的第一概率的乘积,这样就会使得包含参考词语的词语组合的概率变大,从而可以被优先选择。Specifically, the calculation method of the corresponding fourth probability can be set in step 302, so that the fourth probability of the obtained reference word is greater than the product of the first probabilities of each candidate word in the reference word, so that the reference word contains The probability of the word combination becomes larger, so it can be preferentially selected.
例如,将P个候选词语的第一概率的几何平均值,作为参考词语的第四概率,则可以保证参考词语的第四概率大于参考词语中P个候选词语的第一概率的乘积。For example, if the geometric mean of the first probabilities of the P candidate words is used as the fourth probability of the reference word, it can be ensured that the fourth probability of the reference word is greater than the product of the first probabilities of the P candidate words in the reference word.
另外,当参考词语的第四概率大于参考词语中各个候选词语的第一概率的乘积时,在利用第一概率计算第二词语组合的概率时,便可以不计算第一词语组合的概率,仅利用第一概率计算第二词语组合中除第一词语组合外的其他第二词语组合的概率。In addition, when the fourth probability of the reference word is greater than the product of the first probabilities of each candidate word in the reference word, when using the first probability to calculate the probability of the second word combination, the probability of the first word combination can not be calculated, only The first probability is used to calculate the probability of other second word combinations in the second word combination except the first word combination.
在该实施例中,通过增加参考词库弥补编码器和概率模型的不足,从而可以提高目标词句的准确率。In this embodiment, the insufficiency of the encoder and the probability model is made up for by adding a reference lexicon, so that the accuracy of the target words and sentences can be improved.
为了进一步提高目标词句的准确率,可以将编码器、参考词库以及Ngram模型三者结合,以生成目标词句。In order to further improve the accuracy of the target words and sentences, the encoder, the reference lexicon and the Ngram model can be combined to generate the target words and sentences.
具体地,作为一种可实现的方式,步骤303包括:Specifically, as an implementable manner, step 303 includes:
通过Ngram模型,获取字符串序列中除P个参考字符串外其他字符串指示的每个候选词语的第五概率以及参考词语的第五概率;Through the Ngram model, obtain the fifth probability of each candidate word and the fifth probability of the reference word indicated by other character strings in the character string sequence except P reference character strings;
基于字符串序列中除P个参考字符串外其他字符串指示的每个候选词语的第一概率,第四概率、第五概率以及Viterbi算法,生成目标词句。Based on the first probability, the fourth probability, the fifth probability and the Viterbi algorithm of each candidate word indicated by other character strings in the character string sequence except the P reference character strings, the target words and sentences are generated.
需要说明的是,本申请实施例将参考词语中的所有候选词语看成一个整体,这样,就不需要通过Ngram模型计算参考词语内部的候选词语之间的条件概率,仅需通过Ngram模型计算参考词语的第五概率即可;在计算参考词语的第五概率的过程中,可以计算参考词语中排在第一位的候选词语的第五概率,并将排在第一位的候选词语的第五概率作为参考词语的第五概率。It should be noted that the embodiment of the present application regards all candidate words in the reference words as a whole, so that the conditional probability between the candidate words in the reference words does not need to be calculated through the Ngram model, and only the reference words need to be calculated through the Ngram model The fifth probability of the word is enough; in the process of calculating the fifth probability of the reference word, the fifth probability of the candidate word ranked first in the reference word can be calculated, and the first candidate word ranked first Five probability as the fifth probability of the reference word.
下面通过具体的示例对上述过程进行说明。The above process will be described below through a specific example.
例如,仍以图9为例,参考词语为“诺亚方舟”;通过步骤302可以计算“诺亚方舟”的第四概率,通过步骤203可以计算字符串“hen”指示的三个候选词语的第一概率、字符串“bang”指示的三个候选词语的第一概率;接下来,通过Ngram模型计算参考词语中排在第一位的候选词语“诺”的第五概率,并将“诺”的第五概率作为参考词语“诺亚方舟”的第五概率,通过Ngram模型计算字符串“hen”指示的三个候选词语的第五概率、字符串“bang”指示的三个候选词语的第五概率;最终,基于字符串序列中除P个参考字符串外其他字符串指示的每个候选词语的第一概率,第四概率、第五概率以及Viterbi算法便可以得到概率最大的词语组合,并将概率最大的词语组合作为目标词句。For example, still taking Fig. 9 as an example, the reference word is "Noah's Ark"; the fourth probability of "Noah's Ark" can be calculated by step 302, and the three candidate words indicated by the character string "hen" can be calculated by step 203 The first probability, the first probability of the three candidate words indicated by the string "bang"; Next, calculate the fifth probability of the candidate word "nuo" that ranks first in the reference words through the Ngram model, and put "nuo" The fifth probability of " is used as the fifth probability of the reference word "Noah's Ark", and the fifth probability of the three candidate words indicated by the string "hen" and the fifth probability of the three candidate words indicated by the string "bang" are calculated through the Ngram model. The fifth probability; finally, based on the first probability of each candidate word indicated by other strings in the string sequence except P reference strings, the fourth probability, the fifth probability and the Viterbi algorithm can obtain the most probable word combination , and the word combination with the highest probability is used as the target word.
需要说明的是,由于参考词典提供了参考词语,所以在通过Ngram模型计算参考词语后面的候选词语的概率的过程中,若需要用到参考字符串所指示的候选词语,则可以仅考虑参考词语中的候选词语。It should be noted that since the reference dictionary provides reference words, in the process of calculating the probability of the candidate words following the reference words through the Ngram model, if the candidate words indicated by the reference string are needed, only the reference words can be considered candidate words in .
具体地,作为一种可实现的方式,目标字符串为字符串序列中排在P个参考字符串之后的字符串。Specifically, as an achievable manner, the target character string is a character string that is ranked after the P reference character strings in the character string sequence.
目标字符串指示的每个候选词语的第五概率是,在Q个候选词语出现的情况下目标字符串指示的候选词语出现的条件概率,其中,Q为正整数,具体是基于不同的Ngram模型确定的。The fifth probability of each candidate word indicated by the target string is the conditional probability of the occurrence of the candidate word indicated by the target string when Q candidate words appear, wherein Q is a positive integer, specifically based on different Ngram models definite.
Q个候选词语包括字符串序列中,排在目标字符串前的Q个连续字符串中的每个字符串指示的一个候选词语,且当Q个字符串包含参考字符串时,Q个候选词语包含参考字符串指示的参考词语中的候选词语。The Q candidate words include a candidate word indicated by each character string in the Q continuous character strings before the target character string in the string sequence, and when the Q character strings contain the reference character string, the Q candidate words Contains candidate words from the reference words indicated by the reference string.
以图9为例,在计算候选词语“痕”的第五概率的过程中,若Q=1,则“痕”的第五概率表示在候选词语“舟”出现的情况下的条件概率;在计算候选词语“榜”的第五概率的过程中,若Q=2,则“痕”的第五概率表示在候选词语“舟”以及字符串“hen”指示的 一个候选词语(例如恨)出现的情况下的条件概率。Taking Fig. 9 as an example, in the process of calculating the fifth probability of the candidate word "hen", if Q=1, then the fifth probability of "hen" represents the conditional probability under the occurrence of the candidate word "舟"; In the process of calculating the fifth probability of the candidate word "list", if Q=2, then the fifth probability of "mark" indicates that a candidate word (such as hate) indicated by the candidate word "zhou" and the character string "hen" appears The conditional probability for the case of .
请参阅图13,本申请实施例还提供了一种词句生成装置,包括:第一获取单元401,用于获取字符串序列,字符串序列包括M个字符串,每个字符串指示一个或多个候选词语,其中,M为正整数;第一编码单元402,用于根据字符串序列,通过编码器,得到M个第一字符串向量,每个第一字符串向量对应M个字符串中的一个字符串;第二获取单元403,用于基于M个第一字符串向量,获取M个字符串指示的每个候选词语的第一概率;生成单元404,用于基于第一概率,生成目标词句,目标词句包括M个目标词语,每个目标词语为每个字符串指示的一个或多个候选词语中的一个。Please refer to FIG. 13 , the embodiment of the present application also provides a device for generating words and sentences, including: a first acquisition unit 401, configured to acquire a character string sequence, the character string sequence includes M character strings, and each character string indicates one or more candidate words, wherein M is a positive integer; the first encoding unit 402 is used to obtain M first character string vectors through an encoder according to the character string sequence, and each first character string vector corresponds to M character strings a character string; the second acquisition unit 403 is used to obtain the first probability of each candidate word indicated by the M character strings based on the M first character string vectors; the generation unit 404 is used to generate based on the first probability The target word and sentence, the target word and sentence includes M target words, and each target word is one of one or more candidate words indicated by each character string.
作为一种可实现的方式,第一编码单元402,用于根据字符串序列获取M个第一位置向量和M个第二字符串向量,每个第一位置向量表示一个字符串在字符串序列中的位置,每个第二字符串向量表示一个字符串;根据M个第一位置向量和M个第二字符串向量,通过编码器,得到多个第一字符串向量。As an achievable manner, the first encoding unit 402 is configured to obtain M first position vectors and M second character string vectors according to the character string sequence, and each first position vector represents a character string in the character string sequence Each second character string vector represents a character string; according to M first position vectors and M second character string vectors, multiple first character string vectors are obtained through an encoder.
作为一种可实现的方式,编码器是基于转换任务训练得到的,转换任务是将样本字符串序列转换成样本词句的任务。As an achievable way, the encoder is trained based on the conversion task, which is the task of converting a sequence of sample strings into sample words and sentences.
作为一种可实现的方式,第二获取单元403,用于基于M个第一字符串向量,通过概率模型,获取M个字符串指示的每个候选词语的第一概率,概率模型是基于转换任务训练得到的,转换任务是将样本字符串序列转换成样本词句的任务。As an achievable way, the second acquiring unit 403 is configured to acquire the first probability of each candidate word indicated by the M character strings based on the M first character string vectors through a probability model, and the probability model is based on the conversion The conversion task is the task of converting the sequence of sample strings into sample words and sentences.
作为一种可实现的方式,生成单元404,用于根据字符串序列,通过Ngram模型,获取M个字符串指示的每个候选词语的第三概率;基于第一概率,第三概率以及维特比Viterbi算法,生成目标词句。As an achievable manner, the generating unit 404 is configured to obtain the third probability of each candidate word indicated by the M character strings through the Ngram model according to the character string sequence; based on the first probability, the third probability and the Viterbi Viterbi algorithm to generate target words and sentences.
作为一种可实现的方式,生成单元404,用于从参考词典中获取参考词语,参考词语包括P个参考字符串指示的P个候选词语,每个参考字符串指示一个候选词语,P个参考字符串包含于字符串序列中,且在字符串序列中的位置连续,其中,P为大于1的整数;基于P个候选词语各自的第一概率,计算参考词语的第四概率;基于第四概率以及字符串序列中除P个参考字符串外的其他字符串指示的每个候选词语的第一概率,生成目标词句。As an achievable manner, the generation unit 404 is configured to acquire reference words from the reference dictionary, the reference words include P candidate words indicated by P reference character strings, each reference character string indicates a candidate word, and the P reference words The string is contained in the string sequence, and the position in the string sequence is continuous, wherein, P is an integer greater than 1; based on the first probabilities of each of the P candidate words, the fourth probability of the reference word is calculated; based on the fourth probabilities and the first probability of each candidate word indicated by other character strings in the character string sequence except the P reference character strings, to generate target words and sentences.
作为一种可实现的方式,生成单元404,用于通过Ngram模型,获取字符串序列中除P个参考字符串外其他字符串指示的每个候选词语的第五概率,以及参考词语的第五概率;基于字符串序列中除P个参考字符串外其他字符串指示的每个候选词语的第一概率,第四概率、第五概率以及Viterbi算法,生成目标词句。As an achievable way, the generation unit 404 is used to obtain the fifth probability of each candidate word indicated by other strings in the string sequence except the P reference character strings, and the fifth probability of the reference word through the Ngram model. Probability; based on the first probability, fourth probability, fifth probability and Viterbi algorithm of each candidate word indicated by other character strings except P reference character strings in the character string sequence, generate target words and sentences.
作为一种可实现的方式,目标字符串为字符串序列中排在P个参考字符串之后的字符串;目标字符串指示的每个候选词语的第五概率是,在Q个候选词语出现的情况下目标字符串指示的候选词语出现的条件概率,Q为正整数;Q个候选词语包括字符串序列中,排在目标字符串前的Q个连续字符串中的每个字符串指示的一个候选词语,且当Q个字符串包含参考字符串时,Q个候选词语包含参考字符串指示的参考词语中的候选词语。As an achievable way, the target character string is the character string after the P reference character strings in the character string sequence; the fifth probability of each candidate word indicated by the target character string is, among the Q candidate words that appear The conditional probability of the occurrence of the candidate word indicated by the target character string in the case, Q is a positive integer; the Q candidate words include one of each character string indicated by each character string in the Q consecutive character strings before the target character string in the character string sequence candidate words, and when the Q character strings include the reference character string, the Q candidate words include candidate words in the reference words indicated by the reference character string.
作为一种可实现的方式,该装置还包括提示单元405,用于将目标词句作为首选词句进行提示,首选词句为输入法提示的多个词句中排在第一位的词句。As a practicable manner, the device further includes a prompting unit 405, configured to prompt the target word and sentence as a preferred word and sentence, and the preferred word and sentence is the first word and sentence among the multiple words and sentences prompted by the input method.
作为一种可实现的方式,字符串包括一个拼音或多个拼音。As an implementable manner, the character string includes one pinyin or multiple pinyins.
其中,以上各单元的具体实现、相关说明以及技术效果请参考本申请实施例应用阶段的描述。For the specific implementation, related description and technical effects of the above units, please refer to the description of the application stage of the embodiment of the present application.
请参阅图14,本申请实施例还提供了一种模型训练装置,包括:第三获取单元501,用于获取样本字符串序列,样本字符串序列包括K个样本字符串,每个样本字符串指示一个或多个样本候选词语,其中,K为正整数;第二编码单元502,用于根据样本字符串序列,通过编码器,得到K个第一样本字符串向量,每个样本字符串向量对应一个样本字符串;第四获取单元503,用于基于K个第一样本字符串向量,获取K个样本字符串指示的每个样本候选词语的第二概率;调整单元504,用于基于第二概率,对编码器进行调整。Please refer to FIG. 14 , the embodiment of the present application also provides a model training device, including: a third acquisition unit 501, configured to acquire a sequence of sample strings, the sequence of sample strings includes K sample strings, each sample string Indicate one or more sample candidate words, wherein K is a positive integer; the second encoding unit 502 is used to obtain K first sample string vectors through an encoder according to the sample string sequence, and each sample string The vector corresponds to a sample character string; the fourth acquisition unit 503 is used to obtain the second probability of each sample candidate word indicated by the K sample character strings based on the K first sample character string vectors; the adjustment unit 504 is used to Based on the second probability, the encoder is adjusted.
作为一种可实现的方式,第二编码单元502,用于根据样本字符串序列获取K个第二位置向量和K个第二样本字符串向量,每个第二位置向量表示一个样本字符串在样本字符串序列中的位置,每个第二样本字符串向量表示一个样本字符串;根据K个第二位置向量和K个第二样本字符串向量,通过编码器,得到K个第一样本字符串向量。As an implementable manner, the second encoding unit 502 is configured to obtain K second position vectors and K second sample character string vectors according to the sample character string sequence, and each second position vector represents a sample character string in position in the sample string sequence, each second sample string vector represents a sample string; according to K second position vectors and K second sample string vectors, K first samples are obtained through an encoder String vector.
作为一种可实现的方式,每个样本字符串指示的样本候选词语中包含一个目标样本词语;调整单元504,用于调整编码器的参数,以使得目标样本词语的第二概率增加,和/或以使得除目标样本词语外的其他样本候选词语的第二概率降低。As an achievable manner, the sample candidate words indicated by each sample string contain a target sample word; the adjustment unit 504 is configured to adjust the parameters of the encoder so that the second probability of the target sample word increases, and/or Or to reduce the second probabilities of other sample candidate words except the target sample word.
作为一种可实现的方式,第四获取单元503,用于基于K个第一样本字符串向量,通过概率模型,获取K个样本字符串指示的每个样本候选词语的第二概率;调整单元504,还用于基于第二概率,对概率模型进行调整。As an achievable manner, the fourth obtaining unit 503 is configured to obtain the second probability of each sample candidate word indicated by the K sample strings through a probability model based on the K first sample string vectors; adjust Unit 504 is further configured to adjust the probability model based on the second probability.
作为一种可实现的方式,每个样本字符串指示的样本候选词语中包含一个目标样本词语;调整单元504,用于调整概率模型的参数,以使得目标样本词语的第二概率增加,和/或以使得除目标样本词语外的其他样本候选词语的第二概率降低。As an achievable manner, the sample candidate words indicated by each sample string contain a target sample word; the adjustment unit 504 is configured to adjust the parameters of the probability model so that the second probability of the target sample word increases, and/or Or to reduce the second probabilities of other sample candidate words except the target sample word.
作为一种可实现的方式,第三获取单元501,用于基于K个目标样本词语获取样本字符串序列中的K个样本字符串。As an implementable manner, the third acquiring unit 501 is configured to acquire K sample character strings in the sample character string sequence based on the K target sample words.
作为一种可实现的方式,样本字符串包括一个拼音或多个拼音。As an implementable manner, the sample character string includes one pinyin or multiple pinyins.
其中,以上各单元的具体实现、相关说明以及技术效果请参考本申请实施例训练阶段的描述。Wherein, for the specific implementation, related descriptions and technical effects of the above units, please refer to the description of the training stage in the embodiment of the present application.
请参阅图15,图15是本申请实施例提供的计算机设备的一种结构示意图,该计算机设备可以是终端设备,也可以是服务器,具体用于实现图13对应实施例中词句生成装置的功能或图14对应实施例中模型训练装置的功能;计算机设备1800可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)1822(例如,一个或一个以上处理器)和存储器1832,一个或一个以上存储应用程序1842或数据1844的存储介质1830(例如一个或一个以上海量存储设备)。其中,存储器1832和存储介质1830可以是短暂存储或持久存储。存储在存储介质1830的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对计算机设备中的一系列指令操作。更进一步地,中央处理器1822可以设置为与存储介质1830通信,在计算机设备1800上执行存储介质1830中的一系列指令操作。Please refer to Figure 15. Figure 15 is a schematic structural diagram of a computer device provided by an embodiment of the present application. The computer device can be a terminal device or a server, and is specifically used to implement the function of the word-sentence generation device in the embodiment corresponding to Figure 13 Or Fig. 14 corresponds to the function of the model training device in the embodiment; the computer equipment 1800 may have relatively large differences due to different configurations or performances, and may include one or more central processing units (central processing units, CPU) 1822 (for example, one or more than one processor) and memory 1832, and one or more storage media 1830 (such as one or more mass storage devices) for storing application programs 1842 or data 1844. Wherein, the memory 1832 and the storage medium 1830 may be temporary storage or persistent storage. The program stored in the storage medium 1830 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the computer device. Furthermore, the central processing unit 1822 may be configured to communicate with the storage medium 1830 , and execute a series of instruction operations in the storage medium 1830 on the computer device 1800 .
计算机设备1800还可以包括一个或一个以上电源1826,一个或一个以上有线或无线 网络接口1850,一个或一个以上输入输出接口1858,和/或,一个或一个以上操作系统1841,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。 Computer device 1800 can also include one or more power supplies 1826, one or more wired or wireless network interfaces 1850, one or more input and output interfaces 1858, and/or, one or more operating systems 1841, such as Windows Server™, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
本申请实施例中,中央处理器1822,可以用于执行图13对应实施例中词句生成装置执行的检索方法。具体的,中央处理器1822,可以用于:In the embodiment of the present application, the central processing unit 1822 may be used to execute the retrieval method performed by the word and sentence generating device in the embodiment corresponding to FIG. 13 . Specifically, the central processing unit 1822 can be used for:
获取字符串序列,字符串序列包括M个字符串,每个字符串指示一个或多个候选词语,其中,M为正整数;Obtain a character string sequence, the character string sequence includes M character strings, each character string indicates one or more candidate words, where M is a positive integer;
根据字符串序列,通过编码器,得到M个第一字符串向量,每个第一字符串向量对应M个字符串中的一个字符串;According to the string sequence, M first string vectors are obtained through an encoder, and each first string vector corresponds to one of the M strings;
基于M个第一字符串向量,获取M个字符串指示的每个候选词语的第一概率;Based on the M first character string vectors, the first probability of each candidate word indicated by the M character strings is obtained;
基于第一概率,生成目标词句,目标词句包括M个目标词语,每个目标词语为每个字符串指示的一个或多个候选词语中的一个。Based on the first probability, target words and sentences are generated, and the target words and sentences include M target words, and each target word is one of one or more candidate words indicated by each character string.
本申请实施例中,中央处理器1822,可以用于执行图14对应实施例中模型训练装置执行的模型训练方法。具体的,中央处理器1822,可以用于:In the embodiment of the present application, the central processing unit 1822 may be used to execute the model training method performed by the model training device in the embodiment corresponding to FIG. 14 . Specifically, the central processing unit 1822 can be used for:
获取样本字符串序列,样本字符串序列包括K个样本字符串,每个样本字符串指示一个或多个样本候选词语,其中,K为正整数;Obtain a sequence of sample strings, the sequence of sample strings includes K sample strings, each sample string indicates one or more sample candidate words, where K is a positive integer;
根据样本字符串序列,通过编码器,得到K个第一样本字符串向量,每个样本字符串向量对应一个样本字符串;According to the sample character string sequence, K first sample character string vectors are obtained through an encoder, and each sample character string vector corresponds to a sample character string;
基于K个第一样本字符串向量,获取K个样本字符串指示的每个样本候选词语的第二概率;Based on the K first sample character string vectors, obtain the second probability of each sample candidate word indicated by the K sample character strings;
基于第二概率,对编码器进行调整。Based on the second probability, the encoder is adjusted.
本申请实施例还提供一种芯片,包括一个或多个处理器。所述处理器中的部分或全部用于读取并执行存储器中存储的计算机程序,以执行前述各实施例的方法。The embodiment of the present application also provides a chip, including one or more processors. Part or all of the processor is used to read and execute the computer program stored in the memory, so as to execute the methods of the foregoing embodiments.
可选地,该芯片该包括存储器,该存储器与该处理器通过电路或电线与存储器连接。进一步可选地,该芯片还包括通信接口,处理器与该通信接口连接。通信接口用于接收需要处理的数据和/或信息,处理器从该通信接口获取该数据和/或信息,并对该数据和/或信息进行处理,并通过该通信接口输出处理结果。该通信接口可以是输入输出接口。Optionally, the chip includes a memory, and the memory and the processor are connected to the memory through a circuit or wires. Further optionally, the chip further includes a communication interface, and the processor is connected to the communication interface. The communication interface is used to receive data and/or information to be processed, and the processor obtains the data and/or information from the communication interface, processes the data and/or information, and outputs the processing result through the communication interface. The communication interface may be an input-output interface.
在一些实现方式中,所述一个或多个处理器中还可以有部分处理器是通过专用硬件的方式来实现以上方法中的部分步骤,例如涉及神经网络模型的处理可以由专用神经网络处理器或图形处理器来实现。In some implementations, some of the one or more processors may implement some of the steps in the above method through dedicated hardware, for example, the processing related to the neural network model may be performed by a dedicated neural network processor or graphics processor to achieve.
本申请实施例提供的方法可以由一个芯片实现,也可以由多个芯片协同实现。The method provided in the embodiment of the present application may be implemented by one chip, or may be implemented by multiple chips in cooperation.
本申请实施例还提供了一种计算机存储介质,该计算机存储介质用于储存为上述计算机设备所用的计算机软件指令,其包括用于执行为计算机设备所设计的程序。The embodiment of the present application also provides a computer storage medium, which is used for storing computer software instructions used by the above-mentioned computer equipment, which includes a program for executing a program designed for the computer equipment.
该计算机设备可以如前述图13对应实施例中词句生成装置或图14对应实施例中模型训练装置。The computer device may be the word-sentence generating device in the embodiment corresponding to FIG. 13 or the model training device in the embodiment corresponding to FIG. 14 .
本申请实施例还提供了一种计算机程序产品,该计算机程序产品包括计算机软件指令,该计算机软件指令可通过处理器进行加载来实现前述各个实施例所示的方法中的流程。The embodiment of the present application also provides a computer program product, the computer program product includes computer software instructions, and the computer software instructions can be loaded by a processor to implement the procedures in the methods shown in the foregoing embodiments.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装 置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, and will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, device and method can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .

Claims (23)

  1. 一种词句生成方法,其特征在于,包括:A method for generating words and sentences, characterized in that, comprising:
    获取字符串序列,所述字符串序列包括M个字符串,每个所述字符串指示一个或多个候选词语,其中,M为正整数;Obtain a character string sequence, the character string sequence includes M character strings, each of which indicates one or more candidate words, where M is a positive integer;
    根据所述字符串序列,通过编码器,得到M个第一字符串向量,每个所述第一字符串向量对应所述M个字符串中的一个字符串;According to the string sequence, M first string vectors are obtained through an encoder, and each of the first string vectors corresponds to one of the M strings;
    基于所述M个第一字符串向量,获取所述M个字符串指示的每个候选词语的第一概率;Based on the M first character string vectors, obtaining a first probability of each candidate word indicated by the M character strings;
    基于所述第一概率,生成目标词句,所述目标词句包括M个目标词语,每个所述目标词语为所述每个字符串指示的一个或多个候选词语中的一个。Based on the first probability, generate target words and sentences, where the target words and sentences include M target words, each of which is one of the one or more candidate words indicated by each character string.
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述字符串序列,通过编码器,得到M个第一字符串向量包括:The method according to claim 1, wherein said obtaining M first character string vectors through an encoder according to said character string sequence comprises:
    根据所述字符串序列获取M个第一位置向量和M个第二字符串向量,每个所述第一位置向量表示一个所述字符串在所述字符串序列中的位置,每个所述第二字符串向量表示一个所述字符串;Acquire M first position vectors and M second character string vectors according to the character string sequence, each of the first position vectors represents a position of the character string in the character string sequence, and each of the a second character string vector representing one of said character strings;
    根据所述M个第一位置向量和所述M个第二字符串向量,通过编码器,得到所述多个第一字符串向量。Obtain the multiple first string vectors through an encoder according to the M first position vectors and the M second string vectors.
  3. 根据权利要求1或2所述的方法,其特征在于,所述编码器是基于转换任务训练得到的,所述转换任务是将样本字符串序列转换成样本词句的任务。The method according to claim 1 or 2, wherein the encoder is trained based on a conversion task, and the conversion task is a task of converting sample character string sequences into sample words and sentences.
  4. 根据权利要求1至3中任意一项所述方法,其特征在于,所述基于所述M个第一字符串向量,获取所述M个字符串指示的每个候选词语的第一概率包括:The method according to any one of claims 1 to 3, wherein said obtaining the first probability of each candidate word indicated by said M character strings based on said M first character string vectors comprises:
    基于所述M个第一字符串向量,通过概率模型,获取所述M个字符串指示的每个候选词语的第一概率,所述概率模型是基于转换任务训练得到的,所述转换任务是将样本字符串序列转换成样本词句的任务。Based on the M first character string vectors, the first probability of each candidate word indicated by the M character strings is obtained through a probability model, the probability model is obtained based on conversion task training, and the conversion task is The task of converting a sequence of sample strings into sample words.
  5. 根据权利要求1至4中任意一项所述方法,其特征在于,所述基于所述第一概率,生成目标词句包括:According to the method described in any one of claims 1 to 4, it is characterized in that said generating target words and sentences based on said first probability comprises:
    根据所述字符串序列,通过Ngram模型,获取所述M个字符串指示的每个候选词语的第三概率;According to the character string sequence, through the Ngram model, obtain the third probability of each candidate word indicated by the M character strings;
    基于所述第一概率,所述第三概率以及维特比Viterbi算法,生成目标词句。Based on the first probability, the third probability and the Viterbi algorithm, target words and sentences are generated.
  6. 根据权利要求1至4中任意一项所述方法,其特征在于,所述基于所述第一概率,生成目标词句包括:According to the method described in any one of claims 1 to 4, it is characterized in that said generating target words and sentences based on said first probability comprises:
    从参考词典中获取参考词语,所述参考词语包括P个参考字符串指示的P个候选词语,每个所述参考字符串指示一个所述候选词语,所述P个参考字符串包含于所述字符串序列 中,且在所述字符串序列中的位置连续,其中,P为大于1的整数;Obtain reference words from a reference dictionary, the reference words include P candidate words indicated by P reference character strings, each of the reference character strings indicates one candidate word, and the P reference character strings are included in the In the character string sequence, and the positions in the character string sequence are continuous, wherein, P is an integer greater than 1;
    基于所述P个候选词语各自的第一概率,计算所述参考词语的第四概率;Based on the respective first probabilities of the P candidate words, calculating a fourth probability of the reference word;
    基于所述第四概率以及所述字符串序列中除所述P个参考字符串外的其他字符串指示的每个候选词语的第一概率,生成目标词句。Based on the fourth probability and the first probability of each candidate word indicated by other character strings in the character string sequence except the P reference character strings, a target word and sentence is generated.
  7. 根据权利要求6所述的方法,其特征在于,所述基于所述第四概率以及所述字符串序列中除所述P个参考字符串外的其他字符串指示的每个候选词语的第一概率,生成目标词句包括:The method according to claim 6, characterized in that, the first of each candidate word indicated based on the fourth probability and other character strings in the character string sequence except the P reference character strings. Probability, the generated target words and sentences include:
    通过Ngram模型,获取所述字符串序列中除所述P个参考字符串外其他字符串指示的每个候选词语的第五概率,以及所述参考词语的第五概率;Through the Ngram model, obtain the fifth probability of each candidate word indicated by other character strings except the P reference character strings in the character string sequence, and the fifth probability of the reference word;
    基于所述字符串序列中除所述P个参考字符串外其他字符串指示的每个候选词语的第一概率,所述第四概率、所述第五概率以及Viterbi算法,生成目标词句。Based on the first probability of each candidate word indicated by other character strings in the character string sequence except the P reference character strings, the fourth probability, the fifth probability and the Viterbi algorithm, a target word and sentence is generated.
  8. 根据权利要求7所述的方法,其特征在于,目标字符串为所述字符串序列中排在所述P个参考字符串之后的字符串;The method according to claim 7, wherein the target character string is a character string after the P reference character strings in the character string sequence;
    所述目标字符串指示的每个候选词语的第五概率是,在Q个候选词语出现的情况下所述目标字符串指示的候选词语出现的条件概率,Q为正整数;The fifth probability of each candidate word indicated by the target character string is the conditional probability that the candidate word indicated by the target character string occurs when Q candidate words appear, and Q is a positive integer;
    所述Q个候选词语包括所述字符串序列中,排在所述目标字符串前的Q个连续字符串中的每个字符串指示的一个候选词语,且当所述Q个字符串包含所述参考字符串时,所述Q个候选词语包含所述参考字符串指示的所述参考词语中的候选词语。The Q candidate words include a candidate word indicated by each character string in the Q consecutive character strings before the target character string in the character string sequence, and when the Q character strings include the When the reference character string is mentioned above, the Q candidate words include candidate words in the reference words indicated by the reference character string.
  9. 根据权利要求1至8中任意一项所述方法,其特征在于,在所述基于所述第一概率,生成目标词句之后,所述方法还包括:将所述目标词句作为首选词句进行提示,所述首选词句为输入法提示的多个词句中排在第一位的词句。According to the method according to any one of claims 1 to 8, it is characterized in that, after said generating target words and sentences based on said first probability, said method further comprises: prompting said target words and sentences as preferred words and sentences, The preferred word and sentence is the first word and sentence among the multiple words and sentences prompted by the input method.
  10. 根据权利要求1至9中任意一项所述方法,其特征在于,所述字符串包括一个拼音或多个拼音。The method according to any one of claims 1 to 9, wherein the character string includes one pinyin or multiple pinyins.
  11. 一种模型训练方法,其特征在于,包括:A model training method, characterized in that, comprising:
    获取样本字符串序列,所述样本字符串序列包括K个样本字符串,每个所述样本字符串指示一个或多个样本候选词语,其中,K为正整数;Acquiring a sequence of sample character strings, the sequence of sample character strings includes K sample character strings, each of which indicates one or more sample candidate words, where K is a positive integer;
    根据所述样本字符串序列,通过编码器,得到K个第一样本字符串向量,每个所述样本字符串向量对应一个所述样本字符串;Obtain K first sample string vectors through an encoder according to the sample string sequence, each of the sample string vectors corresponds to one of the sample strings;
    基于所述K个第一样本字符串向量,获取所述K个样本字符串指示的每个样本候选词语的第二概率;Based on the K first sample character string vectors, acquiring a second probability of each sample candidate word indicated by the K sample character strings;
    基于所述第二概率,对所述编码器进行调整。Based on the second probability, the encoder is adjusted.
  12. 根据权利要求11所述的方法,其特征在于,所述根据样本字符串序列,通过编码器,得到K个第一样本字符串向量包括:The method according to claim 11, wherein said obtaining K first sample character string vectors through an encoder according to the sample character string sequence comprises:
    根据所述样本字符串序列获取K个第二位置向量和K个第二样本字符串向量,每个所述第二位置向量表示一个所述样本字符串在所述样本字符串序列中的位置,每个所述第二样本字符串向量表示一个所述样本字符串;Acquiring K second position vectors and K second sample character string vectors according to the sample character string sequence, each of the second position vectors represents a position of the sample character string in the sample character string sequence, Each of the second sample string vectors represents one of the sample strings;
    根据所述K个第二位置向量和所述K个第二样本字符串向量,通过编码器,得到所述K个第一样本字符串向量。According to the K second position vectors and the K second sample character string vectors, the K first sample character string vectors are obtained through an encoder.
  13. 根据权利要求11或12所述的方法,其特征在于,每个所述样本字符串指示的样本候选词语中包含一个目标样本词语;The method according to claim 11 or 12, wherein each sample candidate word indicated by the sample string contains a target sample word;
    所述基于所述第二概率,对所述编码器进行调整包括:The adjusting the encoder based on the second probability includes:
    调整所述编码器的参数,以使得所述目标样本词语的第二概率增加,和/或以使得除所述目标样本词语外的其他样本候选词语的第二概率降低。Adjusting the parameters of the encoder so that the second probability of the target sample word increases, and/or so that the second probabilities of other sample candidate words except the target sample word decrease.
  14. 根据权利要求11至13中任意一项所述方法,其特征在于,所述基于所述K个第一样本字符串向量,获取所述K个样本字符串指示的每个样本候选词语的第二概率包括:The method according to any one of claims 11 to 13, characterized in that, based on the K first sample character string vectors, obtaining the first number of each sample candidate word indicated by the K sample character strings Two probabilities include:
    基于所述K个第一样本字符串向量,通过概率模型,获取所述K个样本字符串指示的每个样本候选词语的第二概率;Based on the K first sample character string vectors, the second probability of each sample candidate word indicated by the K sample character strings is obtained through a probability model;
    在所述基于所述K个第一样本字符串向量,获取所述K个样本字符串指示的每个样本候选词语的第二概率之后,所述方法还包括:After obtaining the second probability of each sample candidate word indicated by the K sample strings based on the K first sample string vectors, the method further includes:
    基于所述第二概率,对所述概率模型进行调整。Based on the second probability, the probability model is adjusted.
  15. 根据权利要求14所述的方法,其特征在于,每个所述样本字符串指示的样本候选词语中包含一个目标样本词语;The method according to claim 14, wherein each sample candidate word indicated by the sample string contains a target sample word;
    所述基于所述第二概率,对所述概率模型进行调整包括:The adjusting the probability model based on the second probability includes:
    调整所述概率模型的参数,以使得所述目标样本词语的第二概率增加,和/或以使得除所述目标样本词语外的其他样本候选词语的第二概率降低。Adjusting the parameters of the probability model to increase the second probability of the target sample word, and/or to decrease the second probability of other sample candidate words except the target sample word.
  16. 根据权利要求11至15中任意一项所述方法,其特征在于,所述获取样本字符串序列包括:The method according to any one of claims 11 to 15, wherein said acquiring a sample character string sequence comprises:
    基于K个目标样本词语获取所述样本字符串序列中的K个样本字符串。K sample character strings in the sequence of sample character strings are obtained based on the K target sample words.
  17. 根据权利要求11至16中任意一项所述方法,其特征在于,所述样本字符串包括一个拼音或多个拼音。The method according to any one of claims 11 to 16, wherein the sample character string includes one pinyin or multiple pinyins.
  18. 一种词句生成装置,其特征在于,包括:A word and sentence generating device is characterized in that, comprising:
    第一获取单元,用于获取字符串序列,所述字符串序列包括M个字符串,每个所述字 符串指示一个或多个候选词语,其中,M为正整数;The first obtaining unit is used to obtain a sequence of character strings, the sequence of character strings includes M character strings, and each of the character strings indicates one or more candidate words, wherein M is a positive integer;
    第一编码单元,用于根据所述字符串序列,通过编码器,得到M个第一字符串向量,每个所述第一字符串向量对应所述M个字符串中的一个字符串;The first encoding unit is configured to obtain M first character string vectors through an encoder according to the character string sequence, and each of the first character string vectors corresponds to one of the M character strings;
    第二获取单元,用于基于所述M个第一字符串向量,获取所述M个字符串指示的每个候选词语的第一概率;A second acquiring unit, configured to acquire the first probability of each candidate word indicated by the M character strings based on the M first character string vectors;
    生成单元,用于基于所述第一概率,生成目标词句,所述目标词句包括M个目标词语,每个所述目标词语为所述每个字符串指示的一个或多个候选词语中的一个。A generation unit, configured to generate target words and sentences based on the first probability, the target words and sentences include M target words, and each of the target words is one of the one or more candidate words indicated by each character string .
  19. 一种模型训练装置,其特征在于,包括:A model training device, characterized in that it comprises:
    第三获取单元,用于获取样本字符串序列,所述样本字符串序列包括K个样本字符串,每个所述样本字符串指示一个或多个样本候选词语,其中,K为正整数;A third acquiring unit, configured to acquire a sequence of sample character strings, the sequence of sample character strings includes K sample character strings, each of which indicates one or more sample candidate words, where K is a positive integer;
    第二编码单元,用于根据所述样本字符串序列,通过编码器,得到K个第一样本字符串向量,每个所述样本字符串向量对应一个所述样本字符串;The second encoding unit is configured to obtain K first sample string vectors through an encoder according to the sample string sequence, and each of the sample string vectors corresponds to one of the sample strings;
    第四获取单元,用于基于所述K个第一样本字符串向量,获取所述K个样本字符串指示的每个样本候选词语的第二概率;A fourth acquiring unit, configured to acquire the second probability of each sample candidate word indicated by the K sample character strings based on the K first sample character string vectors;
    调整单元,用于基于所述第二概率,对所述编码器进行调整。An adjusting unit, configured to adjust the encoder based on the second probability.
  20. 一种计算机设备,其特征在于,包括:一个或多个处理器和存储器;其中,所述存储器中存储有计算机可读指令;A computer device, characterized by comprising: one or more processors and a memory; wherein computer-readable instructions are stored in the memory;
    所述一个或多个处理器读取所述计算机可读指令,以使所述训练设备实现如权利要求1至10中任一项所述的方法。The one or more processors read the computer readable instructions to cause the training device to implement the method of any one of claims 1-10.
  21. 一种训练设备,其特征在于,包括:一个或多个处理器和存储器;其中,所述存储器中存储有计算机可读指令;A training device, characterized in that it comprises: one or more processors and a memory; wherein computer-readable instructions are stored in the memory;
    所述一个或多个处理器读取所述计算机可读指令,以使所述计算机设备实现如权利要求11至17中任一项所述的方法。The one or more processors read the computer readable instructions to cause the computer device to implement the method of any one of claims 11-17.
  22. 一种计算机可读存储介质,其特征在于,包括计算机可读指令,当所述计算机可读指令在计算机上运行时,使得所述计算机执行如权利要求1至17中任一项所述的方法。A computer-readable storage medium, characterized in that it includes computer-readable instructions, and when the computer-readable instructions are run on a computer, the computer executes the method according to any one of claims 1 to 17 .
  23. 一种计算机程序产品,其特征在于,包括计算机可读指令,当所述计算机可读指令在计算机上运行时,使得所述计算机执行如权利要求1至17中任一项所述的方法。A computer program product, characterized by comprising computer-readable instructions, which, when the computer-readable instructions are run on a computer, cause the computer to execute the method according to any one of claims 1 to 17.
PCT/CN2022/104334 2021-07-08 2022-07-07 Word or sentence generation method, model training method and related device WO2023280265A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110775982.1A CN113655893A (en) 2021-07-08 2021-07-08 Word and sentence generation method, model training method and related equipment
CN202110775982.1 2021-07-08

Publications (1)

Publication Number Publication Date
WO2023280265A1 true WO2023280265A1 (en) 2023-01-12

Family

ID=78489258

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/104334 WO2023280265A1 (en) 2021-07-08 2022-07-07 Word or sentence generation method, model training method and related device

Country Status (2)

Country Link
CN (1) CN113655893A (en)
WO (1) WO2023280265A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113655893A (en) * 2021-07-08 2021-11-16 华为技术有限公司 Word and sentence generation method, model training method and related equipment
CN116306612A (en) * 2021-12-21 2023-06-23 华为技术有限公司 Word and sentence generation method and related equipment
CN115114915B (en) * 2022-05-25 2024-04-12 腾讯科技(深圳)有限公司 Phrase identification method, device, equipment and medium
CN117408650B (en) * 2023-12-15 2024-03-08 辽宁省网联数字科技产业有限公司 Digital bidding document making and evaluating system based on artificial intelligence

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107678560A (en) * 2017-08-31 2018-02-09 科大讯飞股份有限公司 The candidate result generation method and device of input method, storage medium, electronic equipment
CN109739370A (en) * 2019-01-10 2019-05-10 北京帝派智能科技有限公司 A kind of language model training method, method for inputting pinyin and device
CN110286778A (en) * 2019-06-27 2019-09-27 北京金山安全软件有限公司 Chinese deep learning input method and device and electronic equipment
CN110874145A (en) * 2018-08-30 2020-03-10 北京搜狗科技发展有限公司 Input method and device and electronic equipment
US20200335096A1 (en) * 2018-04-19 2020-10-22 Boe Technology Group Co., Ltd. Pinyin-based method and apparatus for semantic recognition, and system for human-machine dialog
CN111967248A (en) * 2020-07-09 2020-11-20 深圳价值在线信息科技股份有限公司 Pinyin identification method and device, terminal equipment and computer readable storage medium
CN113655893A (en) * 2021-07-08 2021-11-16 华为技术有限公司 Word and sentence generation method, model training method and related equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101071342A (en) * 2007-06-01 2007-11-14 腾讯科技(深圳)有限公司 Method for providing candidate whole sentence in input method and word input system
CN110569505B (en) * 2019-09-04 2023-07-28 平顶山学院 Text input method and device
CN110673748B (en) * 2019-09-27 2023-04-28 北京百度网讯科技有限公司 Method and device for providing candidate long sentences in input method
CN112506359B (en) * 2020-12-21 2023-07-21 北京百度网讯科技有限公司 Method and device for providing candidate long sentences in input method and electronic equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107678560A (en) * 2017-08-31 2018-02-09 科大讯飞股份有限公司 The candidate result generation method and device of input method, storage medium, electronic equipment
US20200335096A1 (en) * 2018-04-19 2020-10-22 Boe Technology Group Co., Ltd. Pinyin-based method and apparatus for semantic recognition, and system for human-machine dialog
CN110874145A (en) * 2018-08-30 2020-03-10 北京搜狗科技发展有限公司 Input method and device and electronic equipment
CN109739370A (en) * 2019-01-10 2019-05-10 北京帝派智能科技有限公司 A kind of language model training method, method for inputting pinyin and device
CN110286778A (en) * 2019-06-27 2019-09-27 北京金山安全软件有限公司 Chinese deep learning input method and device and electronic equipment
CN111967248A (en) * 2020-07-09 2020-11-20 深圳价值在线信息科技股份有限公司 Pinyin identification method and device, terminal equipment and computer readable storage medium
CN113655893A (en) * 2021-07-08 2021-11-16 华为技术有限公司 Word and sentence generation method, model training method and related equipment

Also Published As

Publication number Publication date
CN113655893A (en) 2021-11-16

Similar Documents

Publication Publication Date Title
WO2023280265A1 (en) Word or sentence generation method, model training method and related device
US11210306B2 (en) Dialogue system, a method of obtaining a response from a dialogue system, and a method of training a dialogue system
US11928439B2 (en) Translation method, target information determining method, related apparatus, and storage medium
US11741109B2 (en) Dialogue system, a method of obtaining a response from a dialogue system, and a method of training a dialogue system
JP6916264B2 (en) Real-time speech recognition methods based on disconnection attention, devices, equipment and computer readable storage media
CN106910497B (en) Chinese word pronunciation prediction method and device
US10719668B2 (en) System for machine translation
WO2022121166A1 (en) Method, apparatus and device for predicting heteronym pronunciation, and storage medium
CN113205817B (en) Speech semantic recognition method, system, device and medium
WO2019154210A1 (en) Machine translation method and device, and computer-readable storage medium
Tran et al. A hierarchical neural model for learning sequences of dialogue acts
WO2022121251A1 (en) Method and apparatus for training text processing model, computer device and storage medium
US11475225B2 (en) Method, system, electronic device and storage medium for clarification question generation
KR101627428B1 (en) Method for establishing syntactic analysis model using deep learning and apparatus for perforing the method
US10152298B1 (en) Confidence estimation based on frequency
CN111581970B (en) Text recognition method, device and storage medium for network context
CN113053367B (en) Speech recognition method, speech recognition model training method and device
JP7337979B2 (en) Model training method and apparatus, text prediction method and apparatus, electronic device, computer readable storage medium, and computer program
US20230104228A1 (en) Joint Unsupervised and Supervised Training for Multilingual ASR
CN112818118A (en) Reverse translation-based Chinese humor classification model
Hsueh et al. A Task-oriented Chatbot Based on LSTM and Reinforcement Learning
CN113823265A (en) Voice recognition method and device and computer equipment
CN113362809B (en) Voice recognition method and device and electronic equipment
CN112466282B (en) Speech recognition system and method oriented to aerospace professional field
CN115129819A (en) Text abstract model production method and device, equipment and medium thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22837010

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE