CN114417892B

CN114417892B - Generation model of small sample multi-turn conversation for E-commerce live broadcast scene

Info

Publication number: CN114417892B
Application number: CN202210091152.1A
Authority: CN
Inventors: 宫明
Original assignee: Beijing Zhongke Shenzhi Technology Co ltd
Current assignee: Beijing Zhongke Shenzhi Technology Co ltd
Priority date: 2022-01-27
Filing date: 2022-01-27
Publication date: 2022-08-02
Anticipated expiration: 2042-01-27
Also published as: CN114417892A

Abstract

The invention discloses a generation model of small sample multi-turn conversation for E-commerce live broadcast scene, comprising: constructing a Chinese word list containing characters and words by using a unary language model, segmenting the input text by using a jieba according to the word list, and representing the input by using the characters and words obtained after segmentation; the embedded sum of characters or words, roles, turns and positions is input to the model as an embedded representation; the model comprises 12 transform blocks in total, wherein a decoder and an encoder are fused together in each block, and parameter sharing can be realized by realizing context understanding and generating a reply; two self-attention masks are used in each block to control the access of the current word to the context words; the words in the context position can see all the context words, and the words in the reply position can only see the previous words; and outputting the corresponding hidden state of each word at the last layer. The invention uses the conversation of the real scene in the live broadcast process of the E-commerce, adopts the prompt mode and realizes a conversation system on a data set based on a small amount of samples.

Description

Generation model of small sample multi-turn conversation for E-commerce live broadcast scene

Technical Field

The invention belongs to the technical field of conversation systems, and particularly relates to a small-sample multi-turn conversation generation model for an E-commerce live broadcast scene.

Background

Current dialog systems, whether chatting (e.g., blend-Bot) or task-oriented (e.g., MinTL) systems, require a large set of dialog data to be fine-tuned on the language generation model. Fine-tuning these generative models using large data sets is expensive, takes a lot of manpower and material resources to collect a large number of domain-specific data sets, and requires large computational resources and a large amount of time. In order to solve the problem of overhead caused by using a large number of training samples and performing fine tuning, a learning method that does not need fine tuning on a gradient and uses a small number of samples as a context of a generated model, namely a prompt based on a small number of samples, is adopted.

The existing pre-trained generation models of the open domain are trained based on data sets such as persona-chat, DailyDialog, Wizard of Internet (WiT) and the like, and data of live telecast scenes are not contained, so that the pre-trained generation models cannot well solve the conversation task of live telecast scenes of E-commerce.

Therefore, how to provide a generation model of a small sample multi-turn conversation for an e-commerce live broadcast scene becomes a problem to be solved urgently by those skilled in the art.

Disclosure of Invention

In view of this, the present invention trains a session generation model by using sessions of real scenes in the live e-commerce process and collecting and processing session data of live e-commerce, and implements a session system on a data set based on a small number of samples in a prompt manner.

In order to achieve the purpose, the invention adopts the following technical scheme:

a generation model of a small sample multi-turn conversation for a live E-commerce scene is disclosed, wherein a Chinese word list containing characters and words is constructed for an input text by using a unary language model, the input text is subjected to word segmentation by using a jieba, and the input is represented by using the characters and words obtained by word segmentation; the embedded sum of characters or words, roles, turns and positions is input to the model as an embedded representation; the model comprises 12 transform blocks in total, wherein a decoder and an encoder are fused together in each block, and parameter sharing can be realized by realizing context understanding and generating a reply; two self-attention masks are used in each block to control the access of the current word to the context words; the words in the context position can see all the context words, and the words in the reply position can only see the previous words; and outputting the corresponding hidden state of each word at the last layer.

Further, the model comprises 12 layers, each layer comprising two parts, context understanding and reply generation. The context understanding part adopts an encoder structure, and the current word can see the contents before and after the current word; one-way decoding is used in reply generation, and each word can only see the content before it.

Further, the trained objective function is to minimize the negative log-likelihood loss function:

wherein θ represents a training parameter of the dialog generation model and D represents training data; the context c and target reply r of the dialog are entered in pairs into the network; where T represents the length of the generated target reply r, r _＜t Representing the word generating the reply before the t-th word; p is a radical of _θ (r _t |c，r _＜t ) Generating probability distribution of the t-th word representing given context c and position t-1 and the words replied before; from the first word of the generated reply to the Tth word of the generated reply, T probability distributions need to be multiplied, and the log logarithm of the whole is taken to obtain

And carrying out average calculation and taking an inverse number to obtain a formula:

furthermore, the input part imitates the pretreatment process of BERT, the input text is participated by jieba, a Chinese word list containing characters and words is constructed by referring to a unary language model during word segmentation, and the input is characterized by using the participated characters and words; the input different from BERT is characterized by the sum of three parts of word embedding, role embedding and position embedding, and the input part of the model fuses the round number embedding.

Further, for generating the model, in the inference stage, a reply is generated by using a decoding method, a beam search algorithm is used in decoding, and k is 4, which represents the hyper-parameter beam size in the beam search algorithm.

After the model is trained using e-commerce data, the reasoning can be done directly using prompt. After initializing the model, using the sample data, splicing the sample data according to the context + the sample data, splicing the sample data with the input data of the user, feeding the trained model, and generating. For e-commerce dialogs, the generation can be directly performed in such a way, and fine adjustment is not required to be performed by using a large amount of data.

The specific process is as follows:

model trained by E-commerce data

1. Initializing good models

2. Using the sample data, for the case of 1 sample, we splice into:

prompt1 ═ context "+" user: the owner has raised the pet "+" the owner: a reared beagle dog "+" user: too skillfully, i also want to raise a bullfight dog "+" shop owner: wa, Tai Ji. "

If there are multiple samples, prompt1+ prompt2+. + prompt _ n

3. Prompt the user to input usr _ input

4.input＝prompt+usr_input

5. Input is subjected to tokenizer and digitization, and input into a model for generation.

The invention has the beneficial effects that:

the method is characterized in that a generation model based on the E-commerce live broadcast field is trained by using an unefielded LM model based on an E-commerce live broadcast data set, a prompt method is used in a fusion mode during model training, a conversation task can be completed by using a small amount of samples, and the problems that a large data set is used in traditional fine tuning and the cost of fine tuning the model on hardware resources and time are solved.

Drawings

In order to illustrate the present invention or the technical solutions in the prior art more clearly, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only the present embodiments of the invention, and other drawings can be obtained by those skilled in the art without creative efforts based on the provided drawings.

Fig. 1 is a diagram of the network architecture of the present invention.

FIG. 2 is a first drawing illustrating the attention between the layers of the generative model of the invention.

FIG. 3 is a second drawing of the attention between layers of the generative model of the invention.

FIG. 4 is a representation of the sum of the three parts of word embedding, segment embedding, and position embedding of the input of the present invention.

FIG. 5 is a diagram of training data transformation according to the present invention.

Fig. 6 is a learning diagram of the sample-based prompt method of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention provides a generation model of a small sample multi-turn conversation for e-commerce live broadcast scene, which uses a unary language model to construct a chinese vocabulary including characters and words, uses jieba to perform word segmentation on an input text with reference to the vocabulary, and uses the characters and words obtained after word segmentation to characterize the input; the embedded sum of characters or words, roles, number of turns and positions is input to the model as an embedded representation; the model comprises 12 transform blocks in total, wherein a decoder and an encoder are fused together in each block, and parameter sharing can be realized by realizing context understanding and generating a reply; two self-attention masks are used in each block to control the access of the current word to the context words; the words in the context position can see all the context words, and the words in the reply position can only see the previous words; and outputting the corresponding hidden state of each word at the last layer.

The present invention employs a flexible model that combines the unidirectional and bidirectional attention mechanisms, i.e., the codec is integrated into the same module, which is inspired by the united LM. Different from the UnifiedLM which uses wordPiece to perform word segmentation on input, the word list is constructed by adopting a method based on a unary language model, and the word is segmented by using jieba to refer to the word list. Because the Chinese words are different from English with spaces as separators, the Chinese words usually contain a plurality of characters, if the word Picec separation mode is adopted, the semantic information of some words is lost, and in order to solve the problem, the invention uses a unary language model to construct a Chinese sub-word dictionary, and the dictionary contains both characters and words. The input sentence is participled using jieba.

In addition, the Unifield LM uses BERT _Large And initializing, and training by using English corpora in a mode of a mask language model. The invention adopts a large number of multi-turn conversations collected from various scenes in Chinese and E-commerce live broadcast as the training corpus of the model, and directly generates and trains the model. For a scene containing n number of dialogs, the training model generates the n-th dialog by taking n-1 number of dialogs as the contextual input.

Generative training of the UnifiedLM using multiple rounds of dialog corpus has 2 advantages. First, sentences of similar length can be grouped together, reducing the number of padding, which can reduce invalid calculations. Second, context understanding and reply generation may enable parameter sharing through a flexible sub-attention mask mechanism.

Referring to fig. 2-3, the present invention generates an attention-handling approach between each layer of the model, which comprises 12 layers, each layer comprising two parts, context understanding and reply generation. The context understanding part adopts an encoder structure, and the current word can see the contents before and after the current word; one-way decoding is used in reply generation, and each word can only see the content before it.

The trained objective function is the minimized negative log-likelihood loss function:

wherein θ represents a training parameter of the dialog generation model and D represents training data; the context c and target reply r of the dialog are entered in pairs into the network; where T represents the length of the generated target reply r, r _＜t Representing the word generating the reply before the t-th word; p is a radical of formula _θ (r _t |c，r _＜t ) Generating probability distribution of the t-th word representing given context c and position t-1 and the words replied before; from the first word of the generated reply to the Tth word of the generated reply, T probability distributions need to be multiplied, and the log logarithm of the whole is taken to obtain

And then carrying out average calculation. The training model minimizes the loss and therefore takes a negative number, i.e., minimizes the negative log-likelihood function. The formula is obtained:

the input part of the invention imitates the pretreatment process of BERT, a unary language model is used for constructing a Chinese word list containing characters and words, jieba is used for dividing the words of the input text according to the word list, and the characters and words obtained after word division are used for representing the input. The input of the model is characterized by the sum of three parts of word embedding, role embedding and position embedding, and the input part of the model fuses the round number embedding. As shown in fig. 4.

Word embedding: the context contains previous rounds of dialog, a special end-of-authority [ EOU ] marker is spliced at the end of the dialog, a begin-of-authority [ BOU ] marker is spliced at the beginning of the reply, and the hidden state of the marker is used for predicting the next word in the generation process. The end of reply is spliced with the [ EOU ] mark, and the generation time indicates the generation end.

Role embedding: for distinguishing roles in a conversation, E _A Represents a reversion, E _B Representing another role. Guiding deviceEmbedding role into feature information that allows the model to better distinguish contexts.

Embedding the number of wheels: representing the number of turns in the interactive dialog. I.e. the currently generated reply is denoted as E _[0] The last sentence is denoted as E _[-1] In the previous sentence, use E _[-2] And so on. Here, the use of relative round number embedding rather than absolute round number enables E to be embedded _[0] The replies are assigned to ensure that the reply generation is not affected by other rounds in the dialog.

Position embedding: it is added to the corresponding dialog, as in the context of fig. 4, which contains two utterances, each identified starting from position 0.

The embedding of the input is the sum of the word, the role, the number of rounds and the position embedding.

The four embeddings are added and input to the transform Block.

The idea of prompt is adopted to perform the transformation as shown in fig. 5 for each training data, i.e. embedding the dialogue data into a fixed template, and then using the dialogue data as the training data.

Fig. 6 shows data of viewers and a conversation system for pet merchandise in an e-commerce live scene. The model was learned in 3 samples. The 0 sample represents that no sample is given, and the pre-trained model is directly used to generate the dialog generation. A 1 sample indicates that dialog generation occurs given a sample. A small number of sample representations gives several samples to produce dialog generation.

One or more samples are randomly selected from the verification data set and converted into the format on the right side of the figure 5, and the samples are input into the model to generate output before being spliced into sentences of the test set. On the test set, the degree of confusion (perplexity) was used as an evaluation criterion. The value of the confusion is smaller than the value of PPL of the results of the fine tuning using the validation dataset under this pre-trained generative model, demonstrating the feasibility of the prompt method based on a small number of samples.

In the process of dialogue inference, the sample size is finally set to 3 due to the limitation that the maximum input length of the model is 256 words, namely 3 rounds of historical dialogue are spliced before sentences of the input model, and the format is shown in the right side of fig. 5.

The method is characterized in that a generation model based on the E-commerce live broadcast field is trained by using an unefielded LM model based on an E-commerce live broadcast data set, a prompt method is used in a fusion mode during model training, a conversation task can be completed by using a small amount of samples, and the problems that a large data set is used in traditional fine tuning and the cost of fine tuning the model on hardware resources and time are solved. The invention uses the conversation of the real scene in the live E-commerce process, trains a conversation generation model through the collected and processed conversation data of live E-commerce, and realizes a conversation system on a data set based on a small amount of samples by adopting a prompt mode.

Live telecast takes goods more and more fierce in recent years, and the number of people that each live telecast watched is numerous, if every question of spectator all needs the anchor to answer, then the anchor can not put into the atmosphere of adjusting the live telecast and the explanation to the commodity. Aiming at the problem, a large amount of dialogue and question-answer data in the E-commerce live broadcast process under a real scene are collected, and a generation model based on the E-commerce live broadcast scene is trained. Using this model, a multi-turn conversation system based on live E-commerce was developed. The system can automatically reply some chatting problems which are not related to the live broadcast commodities per se and are provided by audiences when the live broadcast is carried out. If a question is associated with the item, the system prompts the group anchor to answer. An E-commerce generation model based on E-commerce live broadcast data training solves the problem that the existing generation model cannot be competent for E-commerce live broadcast chatting.

The current mainstream mode is to fine tune the pre-training model, and the fine tuning needs two basic conditions, namely a better data set suitable for the solution scene and better computing resources, because the requirements on the computing resources are improved along with the increase of the parameters of the current pre-training model. Collecting and sorting data sets is a matter that costs a lot of manpower and material resources, and good computing resources also need to invest a lot of economic cost. In order to solve the problems, a method of generating a prompt based on a small number of samples is proposed, and the question and answer in the E-commerce live broadcast field can be solved.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A generation model of a small sample multi-turn conversation for a live E-commerce scene is characterized in that a unary language model is used for constructing a Chinese word list containing characters and words, a jieba is used for segmenting input texts by referring to the word list, and the characters and words obtained after segmentation are used for representing input; the embedded sum of characters or words, roles, turns and positions is input to the model as an embedded representation; the model comprises 12 transform blocks in total, wherein a decoder and an encoder are fused together in each block, and parameter sharing can be realized by realizing context understanding and generating a reply; two self-attention masks are used in each block to control the access of the current word to the context words; the words in the context position can see all the context words, and the words in the reply position can only see the previous words; outputting a hidden state corresponding to each word at the last layer; the trained objective function is the minimized negative log-likelihood loss function:

wherein θ represents a training parameter of the dialog generation model and D represents training data; the context c and target reply r of the dialog are entered in pairs into the network; where T represents the length of the generated target reply r, r _＜t Representing the word generating the reply before the t-th word; p is a radical of _θ (r _t |c，r _＜t ) Generating an outline of the tth word representing the given context c and the word returned before and at the position t-1Rate distribution; from the first word of the generated reply to the Tth word of the generated reply, T probability distributions need to be multiplied, and the log logarithm of the whole is taken to obtain

for the generated model, in an inference stage, a reply is generated by using a decoding method, a beam search algorithm is used in decoding, and k is 4, which represents a hyper-parameter beam size in the beam search algorithm;

after the E-commerce data are used for training the model, the prompt can be used for directly reasoning, after the model is initialized, sample data are used and spliced according to the context + the sample data, and then the sample data are spliced with user input data and fed to the trained model for generation.

2. The generation model of the small-sample multi-turn dialog for the E-commerce live scene is characterized in that the model comprises 12 layers, each layer comprises a context understanding part and a reply generation part, the context understanding part adopts an encoder structure, and a current word can see the contents before and after the context understanding part; one-way decoding is used in reply generation, and each word can only see the content before it.

3. The generative model of a small sample multi-turn dialog for an e-commerce live broadcast scene as claimed in claim 1, wherein the input part emulates a pre-processing procedure of BERT, the input text is participled with a jieba library with reference to a word list constructed using a unigram language model; the input different from BERT is characterized by the sum of three parts of word embedding, role embedding and position embedding, and the input part of the model fuses the round number embedding.