Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments in this disclosure without inventive faculty, are intended to be within the scope of this disclosure.
In addition, the term "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
In the method, the corresponding text is generated by expanding the keywords contained in the text generation request input by the user and the seq2seq model of the Transformer structure, so that the speed and accuracy of text generation are improved.
FIG. 1 illustrates a schematic diagram of an exemplary operating environment 100 in which embodiments of the present disclosure can be implemented. As shown in fig. 1, system architecture 100 may include a terminal device 102, a network 104, and a server 106. The network 104 is the medium used to provide communication links between the terminal devices 102 and the server 106. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others. The terminal device 102 may be a variety of electronic devices with a display screen including, but not limited to, a desktop computer, a portable computer, a smart phone, a tablet computer, and the like. It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 106 may be a server cluster or cloud server formed by a plurality of servers.
The document generation method provided by the embodiments of the present disclosure is generally executed by the server 106, and accordingly, the target document generation apparatus is generally disposed in the server 106. However, it is easily understood by those skilled in the art that the document generating method provided in the embodiment of the present disclosure may be performed by the terminal device 102, and accordingly, the document generating apparatus may also be provided in the terminal device 102, which is not particularly limited in the present exemplary embodiment. For example, in an exemplary embodiment, the user may upload the video and/or the picture to the server 106 through the terminal device 102, and the server 106 generates a corresponding document through the document generating method provided by the embodiment of the present disclosure, and transmits the generated document to the terminal device 102.
It should be understood that the application scenario shown in fig. 1 is only one example in which embodiments of the present invention may be implemented. The application scope of the embodiment of the invention is not limited by any aspect of the application scene.
Fig. 2 shows a flowchart of a document generation method 200 according to an embodiment of the present disclosure. The method 200 may be performed by the terminal device 102 in fig. 1.
At block 210, a document generation request sent by a user terminal is received, where the document generation request includes a video and/or a picture input by a user;
in some embodiments, the document generation request includes a video and/or picture of the target item corresponding, for example, a video of a dress taken and/or a picture of a dress taken. Wherein, the pictures can be one or more.
In some embodiments, the target object may refer to an object that needs to be displayed or introduced to the target object, for example, the target object may be a commodity used as a transaction, or may be a newly produced product, or may be a tradable virtual object, which is not particularly limited in the embodiments of the present disclosure.
At block 220, information extraction is performed on the video and/or the picture input by the user, so as to obtain a corresponding embedded vector;
in some embodiments, information extraction is performed through a pre-trained neural network model, where the input of the neural network model is a picture, and therefore, it is necessary to extract a picture from a video input by a user, and perform information extraction on the picture from which the video input by the user was extracted and/or the picture input by the user.
In some embodiments, the truncation may have a number of different processing approaches, such as truncating only the first frame, sampling by frame (e.g., truncating according to frame interval, time interval), etc.
In some embodiments, the pre-trained neural network model is a CNN convolutional neural network model, a picture is input into the pre-trained CNN convolutional neural network, and the obtained output is a 512-dimensional embedded vector, which is a representation of picture information, and can reflect information of a subject, color, shape and the like. If there are multiple pictures, a vector sequence is composed of the multiple 512-dimensional embedded vectors.
Wherein the output is the output of a fully connected layer before the output layer of the pre-trained CNN convolutional neural network. In this embodiment, the pre-trained CNN convolutional neural network is not required to output the classification result of the picture, that is, the output layer is not required to output the classification result, and only the output of the full connection layer is required to be obtained.
At block 230, the embedded vector is input into a pre-trained document generation model, generating a document.
In some embodiments, the 512-dimensional embedded vector is input into a pre-trained document generation model, which is output as a document. For example, corresponding to a video or picture of a dress, the output of the document generation model is "XX brand dress is very slim and elegant.
In some embodiments, the document generation model is a seq2seq model comprising a self-attention mechanism based encoder, a self-attention mechanism based decoder, a linear layer, and a softmax layer connected in sequence. The encoder, decoder of the seq2seq model may be a CNN, RNN or a transducer structure.
In the embodiment of the disclosure, the encoder and the decoder of the seq2seq model are in a transducer structure. The encoder is composed of a plurality of encoding modules, for example 6, each of which is identical in structure and comprises a self-Attention layer and a feed-forward layer (feed-forward neural network); the lowest coding module receives the vector sequence, transmits the vector sequence to the self-attention layer for processing, then transmits the vector sequence to the feedforward layer, and transmits the output result to the next coding module. The output of the last encoder serves as the input to the decoder. In the self-attention layer, query (word vector) and Key (Key vector) -Value (Value vector) pairs are mapped onto outputs, which are weights of all values in Value, where weights are calculated by Query and each Key.
Similarly to an encoder consisting of a plurality of encoding modules, a decoder likewise consists of a corresponding number of decoding modules, for example 6. Each decoding module is identical in structure and comprises a self-Attention (self-Attention) layer, an encoding-decoding Attention (Encoder-Decoder Attention) layer and a feedforward layer (feedforward neural network).
In the decoder section, an output sequence is generated using the output of the encoder as input data to the decoder section. The output of the top encoding module is then transformed into a set of attention vectors containing the vectors Key (Key vector) and Value (Value vector). These vectors will be used by each decoding module for its own "encode-decode attention layer" which can help the decoding module to keep track of which positions of the input sequence are appropriate. Each decoding module outputs an element of the output sequence until the last decoding module completes the output.
The output of each decoding module is the probability distribution of the output word corresponding to the i position, and the input is the output of the encoder and the output of the decoding module corresponding to the i-1 position; i.e., the output word of the i position needs to consider the word previously output from the 1 to i-1 position; meanwhile, on the basis, a cache mechanism is arranged in each decoding module, key and Value matrixes of words at all positions output before are cached, and when the output of the ith position is calculated, the cached Key and Value matrixes at the i-1 position are directly called for splicing, so that the Key and Value matrixes at the i-1 position do not need to be repeatedly calculated, the operation time is shortened, and the delay of document generation is reduced.
In some embodiments, the linear layer is a simple fully-connected neural network, projecting the vector output by the decoder into a larger vector, called a logarithmic vector; assume that the document generation model learns a total of 1 ten thousand words from the training dataset. The logarithmic vector also has 1 ten thousand lengths-each segment represents a score for a unique word. The softmax layer converts the score into probability, selects the index with the highest probability, and then finds out the corresponding word through the index as output.
In some embodiments, the seq2seq model is trained by:
acquiring a plurality of history documents; taking the history file and pictures corresponding to the history file as training samples;
taking the picture as input of a CNN convolutional neural network model, and taking output of the preset CNN convolutional neural network model as input of a preset seq2seq model;
comparing the output of the preset seq2seq model with the corresponding history file, and training the preset seq2seq model.
In some embodiments, the training parameters in the hidden vectors in the encoder and decoder are adjusted using a back propagation algorithm until the document generation model meets a preset requirement, i.e., the decoder can output an output sequence that more matches the history document.
In some embodiments, in order to increase diversity, the document generation model is prevented from outputting repeated words, and in the training process, the penalty probability of the repeated words is added, so that the probability of the repeated words in the softmax layer is reduced, for example, 0.9 is reduced to 0.6, so as to improve the accuracy of document generation.
In some embodiments, in the training process, the parameters of the preset CNN convolutional neural network model may be further fine-tuned.
In some embodiments, the preset CNN convolutional neural network model and the preset seq2seq model may be jointly trained, so as to improve the training speed and shorten the training time.
According to the embodiment of the disclosure, the following technical effects are achieved:
the present creative performs a document generation based on video and/or pictures entered by a user. The method and the device can generate the text rapidly and efficiently, improve the diversity and accuracy of the text generation, improve the text manufacturing efficiency and improve the overall experience of a user for obtaining the text for product description.
Fig. 3 shows a flowchart of a document generation method 300 according to an embodiment of the present disclosure. The method 300 may be performed by the terminal device 102 in fig. 1.
At block 310, a document generation request sent by a user terminal is received, where the document generation request includes a video and/or a picture input by a user and a keyword;
in some embodiments, the document generation request includes a video and/or picture of the target item corresponding, for example, a video of a dress taken and/or a picture of a dress taken. Wherein, the pictures can be one or more.
Wherein the keywords are used for representing at least one of color, category, object recognition, saturation and emotion.
In some embodiments, the keywords may be used to represent an identification or name corresponding to the target item; but also to represent commodity attributes such as specification, price, weight, color, material, use, style, etc.; but also to indicate the buyer for which the product is intended, e.g. male, female, student, trendy, etc.
In some embodiments, the target object may refer to an object that needs to be displayed or introduced to the target object, for example, the target object may be a commodity used as a transaction, or may be a newly produced product, or may be a tradable virtual object, which is not particularly limited in the embodiments of the present disclosure.
At block 320, information extraction is performed on the video and/or the picture input by the user, so as to obtain a corresponding embedded vector; generating a word vector of the keyword; fusing the embedded vector with the word vector;
the information extraction of the video and/or the picture input by the user is similar to the above embodiment, and will not be described herein.
In some embodiments, generating the word vector for the keyword comprises: and generating a vector of the keyword mapped to a vector space according to the granularity of the words, wherein each word corresponds to a 512-dimensional vector.
In some embodiments, the keywords may be expanded, and since multiple keywords may generate the same document or similar documents, different keywords belonging to the selling point label in the history document are extracted by a Named Entity Recognition (NER) manner in the training process, so as to train the document generation model. In the prediction process, the keywords are expanded according to the co-occurrence in the history document, so that a plurality of keywords under the selling point label of the target object in the history document are obtained, and the diversity of the document is increased.
In some embodiments, fusing the embedded vector with the word vector includes:
multiplying the 512-dimensional embedded vector corresponding to the picture and the w (word number) of the keyword by the 512-dimensional vector to form a vector sequence; if there are multiple pictures, a vector sequence is formed by multiplying the multiple 512-dimensional embedded vectors and the w (word number) of the keyword by the 512-dimensional vector.
At block 330, the fused vector sequence is input into a pre-trained document generation model, generating a document.
In some embodiments, the fused 512-dimensional vector sequence is input into a pre-trained document generation model, which is output as a document.
In some embodiments, the document generation model is a seq2seq model comprising a self-attention mechanism based encoder, a self-attention mechanism based decoder, a linear layer, and a softmax layer connected in sequence. The encoder, decoder of the seq2seq model may be a CNN, RNN or a transducer structure. The specific steps are similar to those of the above embodiment, and will not be repeated here.
In some embodiments, the seq2seq model is trained by:
acquiring a plurality of history documents; taking the history file, and pictures and keywords corresponding to the history file as training samples;
taking the picture as the input of a CNN convolutional neural network model, fusing the output of the preset CNN convolutional neural network model and word vectors of keywords, and taking the fused vector sequence as the input of a preset seq2seq model;
comparing the output of the preset seq2seq model with the corresponding history file, and training the preset seq2seq model.
In some embodiments, the training parameters in the hidden vectors in the encoder and decoder are adjusted using a back propagation algorithm until the document generation model meets a preset requirement, i.e., the decoder can output an output sequence that more matches the history document.
In some embodiments, in order to increase diversity, the document generation model is prevented from outputting repeated words, and in the training process, the penalty probability of the repeated words is added, so that the probability of the repeated words in the softmax layer is reduced, for example, 0.9 is reduced to 0.6, so as to improve the accuracy of document generation.
In some embodiments, in the training process, the parameters of the preset CNN convolutional neural network model may be further fine-tuned.
In some embodiments, the preset CNN convolutional neural network model and the preset seq2seq model may be jointly trained, so as to improve the training speed and shorten the training time.
According to the embodiment of the disclosure, the following technical effects are achieved:
the creative performs document generation based on video and/or pictures and keywords input by a user. The method and the device can generate the text rapidly and efficiently, improve the diversity and accuracy of the text generation, improve the text manufacturing efficiency and improve the overall experience of a user for obtaining the text for product description.
It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present disclosure is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present disclosure. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all alternative embodiments, and that the acts and modules referred to are not necessarily required by the present disclosure.
The foregoing is a description of embodiments of the method, and the following further describes embodiments of the present disclosure through examples of apparatus.
Fig. 4 shows a block diagram of a document generation apparatus 400 according to an embodiment of the disclosure. The apparatus 400 may be included in the server 106 of fig. 1 or implemented as the server 106. As shown in fig. 4, the apparatus 400 includes: a receiving module 410, configured to receive a document generation request sent by a user terminal, where the document generation request includes a video and/or a picture input by a user; the expansion module 420 is configured to extract information from the video and/or the picture input by the user, so as to obtain a corresponding embedded vector; the document generation module 430 is configured to input the embedded vector into a pre-trained document generation model to generate a document.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the described modules may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again.
Fig. 5 shows a schematic block diagram of an electronic device 700 that may be used to implement embodiments of the present disclosure. The device 500 may be used in the implementation of the server 106 in fig. 1. As shown, the device 500 includes a Central Processing Unit (CPU) 501 that may perform various suitable actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM) 502 or loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The CPU501, ROM 502, and RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
Various components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, etc.; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508 such as a magnetic disk, an optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The processing unit 501 performs the various methods and processes described above, such as methods 200, 300. For example, in some embodiments, the methods 200, 300 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into RAM 503 and executed by CPU501, one or more of the steps of the methods 200, 300 described above may be performed. Alternatively, in other embodiments, CPU501 may be configured to perform methods 200, 300 by any other suitable means (e.g., by means of firmware).
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), etc.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Moreover, although operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.