WO2024046189A1 - 文本生成方法以及装置 - Google Patents

文本生成方法以及装置 Download PDF

Info

Publication number
WO2024046189A1
WO2024046189A1 PCT/CN2023/114514 CN2023114514W WO2024046189A1 WO 2024046189 A1 WO2024046189 A1 WO 2024046189A1 CN 2023114514 W CN2023114514 W CN 2023114514W WO 2024046189 A1 WO2024046189 A1 WO 2024046189A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
text
target
data
attribute information
Prior art date
Application number
PCT/CN2023/114514
Other languages
English (en)
French (fr)
Inventor
赵中州
宋雪萌
聂礼强
井立强
刘萌
关惟俐
周伟
陈海青
Original Assignee
阿里巴巴(中国)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴(中国)有限公司 filed Critical 阿里巴巴(中国)有限公司
Publication of WO2024046189A1 publication Critical patent/WO2024046189A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0623Item investigation
    • G06Q30/0625Directed, with specific intent or strategy
    • G06Q30/0627Directed, with specific intent or strategy using item specifications
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting

Definitions

  • the embodiments of this specification relate to the field of computer technology, and in particular, to a text generation method.
  • One or more embodiments of this specification simultaneously relate to a text generation device, a computing device, and a computer-readable storage medium.
  • embodiments of this specification provide a text generation method.
  • One or more embodiments of this specification simultaneously relate to a text generation device, a computing device, a computer-readable storage medium, and a computer program to solve technical deficiencies existing in the prior art.
  • a text generation method including:
  • the target description text of the target object is generated.
  • a text generation device including:
  • the acquisition module is configured to obtain graphic and text data of the target object, where the graphic and text data includes image data and text data;
  • the identification module is configured to identify the visual attribute information of the target object based on the image data, where the visual attribute information represents the explicit characteristics of the target object;
  • a determining module configured to determine an object attribute set of the target object based on the text data and the visual attribute information
  • the generation module is configured to generate the target description text of the target object based on the object attribute set.
  • a computing device including:
  • the memory is used to store computer-executable instructions
  • the processor is used to execute the computer-executable instructions.
  • the steps of the above text generation method are implemented.
  • a computer-readable storage medium which stores computer-executable instructions. When the instructions are executed by a processor, the steps of the above text generation method are implemented.
  • a computer program is provided, wherein when the computer program is executed in a computer, the computer is caused to execute the steps of the above text generation method.
  • the text generation method obtaineds graphic and text data of a target object, where the graphic and text data includes image data and text data; based on the image data, the visual attribute information of the target object is identified, where the visual attribute information represents the target Explicit features of the object; determine the object attribute set of the target object based on the text data and visual attribute information; generate the target description text of the target object based on the object attribute set.
  • the visual attribute information of the target object is determined, taking into account the explicit characteristics of the target object, making the object attributes of the target object more comprehensive, and determining the target based on the text data and visual attribute information.
  • the object attribute set of the object integrates the text data and visual attribute information of the target object, making the generated target description text more coherent and further improving the accuracy of the target description text.
  • Figure 1 is a framework diagram of a text generation system provided by an embodiment of this specification
  • Figure 2 is a framework diagram of another text generation system provided by an embodiment of this specification.
  • Figure 3 is a flow chart of a text generation method provided by an embodiment of this specification.
  • Figure 4 is a training flow chart of a text processing model in a text generation method provided by an embodiment of this specification
  • Figure 5 is a training flow chart of an image classification model in a text generation method provided by an embodiment of this specification
  • Figure 6 is a process flow chart of a text generation method provided by an embodiment of this specification.
  • Figure 7 is a schematic diagram of a target product details page in a text generation method provided by an embodiment of this specification.
  • Figure 8 is a schematic diagram of a display interface of a client in a text generation method provided by an embodiment of this specification
  • Figure 9 is a schematic structural diagram of a text generation device provided by an embodiment of this specification.
  • Figure 10 is a structural block diagram of a computing device provided by an embodiment of this specification.
  • first, second, etc. may be used to describe various information in one or more embodiments of this specification, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other.
  • the first may also be called the second, and similarly, the second may also be called the first.
  • the word "if” as used herein may be interpreted as "when” or “when” or “in response to determining.”
  • Modality refers to the form in which data exists, such as natural language, pictures, etc.
  • Product summary Based on product information, such as product description, appearance, etc., a short text summary with salient information about the product is generated.
  • Natural language generation giving computers the same expression and writing capabilities as humans. That is, it can automatically generate a high-quality natural language text through a planning process based on some key information and its expression within the machine.
  • BART Bidirectional and Auto-Regressive Transformers: A model that combines contextual information and autoregressive characteristics. This model inputs natural language and generates natural language.
  • ASR Automatic Speech Recognition
  • Part-of-speech tagging A technology that tags the part-of-speech of each word in a sentence.
  • a text generation method is provided.
  • This specification also relates to a text generation device, a computing device, and a computer-readable storage medium, which will be described in detail one by one in the following embodiments.
  • this solution provides a solution to generate description text based on multi-modal data. Given the multi-modal graphic and text data of the target object, end-to-end automated generation can accurately summarize the target object. Descriptive text that highlights the characteristics and advantages of the target object.
  • the text generation method obtains graphic and text data of the target object, where the graphic and text data includes image data and text data; based on the image data, the visual attribute information of the target object is identified, where the visual attribute The information represents the explicit characteristics of the target object; based on the text data and visual attribute information, the object attribute set of the target object is determined; based on the object attribute set, the target description text of the target object is generated.
  • the visual attribute information of the target object is determined, taking into account the explicit characteristics of the target object, making the object attributes of the target object more comprehensive, and determining the target based on the text data and visual attribute information.
  • the object's set of object properties The text data and visual attribute information of the target object are integrated to make the generated target description text more coherent and further improve the accuracy of the target description text.
  • Figure 1 shows a framework diagram of a text generation system provided by an embodiment of this specification, where the text generation system includes a server and a client:
  • Client Send the graphic data of the target object to the server, where the graphic data includes image data and text data;
  • Obtain the graphic and text data of the target object Obtain the graphic and text data of the target object; identify the visual attribute information of the target object based on the image data, where the visual attribute information represents the explicit characteristics of the target object; determine the object attributes of the target object based on the text data and visual attribute information Set; based on the object attribute set, generate the target description text of the target object, and send the target description text to the client, so that the client displays the target description text.
  • Client Receive and display the target description text sent by the server, so that the user can introduce the target object based on the target description text.
  • the text generation method provided in the embodiments of this specification is generally executed by the server.
  • the client can also have similar functions to the server to execute the embodiments of this specification.
  • the text generation method provided in the embodiments of this specification may also be jointly executed by the client and the server.
  • the graphic and text data includes image data and text data; based on the image data, identify the visual attribute information of the target object, where the visual attribute information represents the appearance of the target object. characteristics; determine the object attribute set of the target object based on text data and visual attribute information; generate the target description text of the target object based on the object attribute set.
  • the visual attribute information of the target object is determined, taking into account the explicit characteristics of the target object, making the object attributes of the target object more comprehensive, and determining the target based on the text data and visual attribute information.
  • the object attribute set of the object integrates the text data and visual attribute information of the target object, making the generated target description text more coherent and further improving the accuracy of the target description text.
  • Figure 2 shows a framework diagram of another text generation system provided by an embodiment of this specification.
  • the system may include a server 100 and multiple clients 200. Communication connections can be established between multiple clients 200 through the server 100.
  • the server 100 is used to provide text generation services between multiple clients 200.
  • the multiple clients 200 can serve as senders respectively. end or receiving end, real-time communication is realized through the server 100.
  • the user can interact with the server 100 through the client 200 to receive data sent by other clients 200, or send data to other clients 200, etc.
  • the user can publish a data stream to the server 100 through the client 200, and the server 100 pushes the data stream to the client that subscribes to the data stream.
  • the data stream may be graphic data, for example.
  • users can collect graphic and text data of target products in real time through the client and send the graphic and text data to the server.
  • the server can generate corresponding product description text based on the graphic and text data sent by the client. , push the product description text to all live broadcast rooms that include the product, so that the anchor can introduce the target product based on the product description text.
  • participating users can collect image and text data in real time through the client.
  • the data is sent to the server.
  • the server can process the image and text data sent by the client, generate summary text, and push the summary text to the clients of other participating users.
  • a connection is established between the client 200 and the server 100 through a network.
  • the network provides the medium for communication links between clients and servers.
  • Networks can include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
  • the data transmitted by the client 200 may need to be encoded, transcoded, compressed, etc. before being released to the server 100.
  • the client 200 can be a browser, an APP (Application, application), or a web application such as an H5 (HyperText Markup Language 5, Hypertext Markup Language 5th Edition) application, or a light application (also known as a mini program, a light application). Scale application) or cloud application, etc., the client 200 can be developed based on the software development kit (SDK, Software Development Kit) of the corresponding service provided by the server, such as based on the real-time communication (RTC, Real Time Communication) SDK, etc.
  • SDK software development kit
  • RTC Real Time Communication
  • the electronic device may have a display screen and support information browsing, and may be a personal mobile terminal such as a mobile phone, a tablet computer, a personal computer, etc.
  • a personal mobile terminal such as a mobile phone, a tablet computer, a personal computer, etc.
  • Various other types of applications can usually be configured in electronic devices, such as human-computer conversation applications, model training applications, text processing applications, web browser applications, shopping applications, search applications, instant messaging tools, and email clients. Terminal, social platform software, etc.
  • the server 100 may include servers that provide various services, such as servers that provide communication services for multiple clients, servers that provide support for models used on clients for background training, and servers that provide support for models used on clients. Servers for data processing, etc.
  • the server 100 can be implemented as a distributed server cluster composed of multiple servers, or as a single server.
  • the server can also be a distributed system server or a server combined with a blockchain.
  • Servers can also be cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content distribution networks (CDN, Content Delivery Network), and big data and Cloud servers for basic cloud computing services such as artificial intelligence platforms, or intelligent cloud computing servers or intelligent cloud hosts with artificial intelligence technology.
  • Figure 3 shows a flow chart of a text generation method provided by an embodiment of this specification, which specifically includes the following steps:
  • Step 302 Obtain graphic and text data of the target object, where the graphic and text data includes image data and text data.
  • description forms for target objects are becoming more and more abundant.
  • the description of a product includes a title, a detailed text description, a product display image, etc.
  • multi-modal data of the target object can be obtained.
  • the multi-modal data can include image data and text data, and the target description text of the target object is further generated based on the multi-modal image and text data.
  • the target object refers to the object for which the target description text needs to be generated, and can also be understood as the object waiting for the target description text to be generated, including but not limited to commodities, people, scenery, places of interest, etc.
  • the graphic data of the target object refers to image data and text data including information related to the target object.
  • the image data can be pictures, photos, design drawings, etc. of the target object, and the text data can be the name, structural attributes, detailed information, process information, etc. of the target object.
  • the image and text data of the target object can be obtained after receiving a text generation instruction.
  • the text generation instruction carries the graphic data covering the target object information input by the user; in another possible way, the text generation instruction includes the unique identification of the target object. According to the unique identification, the Determine the target object and further obtain the graphic and text data of the target object.
  • the target object as the target product as an example
  • the information covering the target product can be completed. Therefore, the text is received After the instruction is generated, based on the unique identifier of the target object in the text generation instruction, the graphic and text data of the target object can be obtained from the details page of the target product.
  • the graphic and text data of the target object since the graphic and text data of the target object usually changes, the graphic and text data of the target object can be monitored, and when the graphic and text data changes, the graphic and text data of the target object can be obtained in real time. Text data is generated to generate the target description text of the target object, so that when the user needs the target description text, he can immediately query the target description text. That is to say, the above-mentioned steps of obtaining the image and text data of the target object may include the following steps:
  • the update of graphic and text data includes addition, deletion, replacement, modification, etc.
  • the graphic and text data of the target object it can be considered that the graphic and text data of the target object has been updated.
  • an offline timing method may also be used to generate the target description text of the target object.
  • the offline timing method is to update the target description text of the target object when specified.
  • the image and text data of the target object will be compared with the image and text data of the target object when it was last updated. data for comparison. If the graphic and text data changes, a scheduled task is triggered to obtain the graphic and text data of the target object, and the target description text is generated based on the graphic and text data; if the graphic and text data does not change, the description text of the target object is not updated.
  • Step 304 Based on the image data, identify the visual attribute information of the target object, where the visual attribute information represents the explicit characteristics of the target object.
  • the visual attribute information of the target object can be further identified based on the image data included in the graphic and text data.
  • the visual attribute information it is equivalent to converting the image into The data is converted into text data, which unifies the multi-modal data of the target object and reduces the modal heterogeneity between multiple modalities.
  • visual attribute information represents the explicit characteristics of the target object.
  • the explicit characteristics refer to the characteristics of the target object, which can be noun characteristics such as color and shape of the target object, or adjective characteristics such as beautiful, beautiful, and generous. Specifically, The selection is made according to the actual situation, and the embodiments of this specification do not limit this in any way.
  • the image data may include text data of the target object
  • OCR optical character recognition
  • Image color recognition tools can also be used to obtain visual attribute information in image data.
  • a pre-trained image classification model can be used to identify the visual attribute information of the target object. That is, the above-mentioned steps of identifying the visual attribute information of the target object based on image data may include the following steps :
  • the pre-trained image classification model is a model generated by training a preset classification model.
  • the preset classification model refers to a model that can achieve classification, such as the Swin Transformer model, residual neural network (ResNet, Residual Network), image
  • the classification transformation model (Vit, Vision Transformer) is specifically selected according to the actual situation. The embodiments of this specification do not limit this in any way.
  • the image data is input into the image classification transformation model.
  • the image data is divided into patches, for example, the image is divided into 9 patch.
  • the size of each patch can be specified, such as 16 ⁇ 16 and so on.
  • each patch is input to the embedding layer (embedding). After passing this layer, a series of vectors (tokens) can be obtained. All 9 patches will get their corresponding vectors, and then a vector for classification is added before all vectors. , the dimensions of this category vector are consistent with the other 9 vectors. In addition, location information needs to be added.
  • the image data is input into the pre-trained picture classification model, and the visual attribute information of the target object is obtained through classification and recognition by the picture classification model, which improves the efficiency and accuracy of obtaining the visual attribute information of the target object. , further making the subsequently generated target description text more accurate.
  • the visual attribute information and text data of the target object can be compared, and the text data of the target object can be modified according to the comparison result.
  • the text data of the target object is "Red clothes make women look younger”
  • the visual attribute information of the target object is obtained as “Rose red looks whiter”
  • the text data and the visual attribute information are compared, and the "red clothes look younger” in the target object's text data are obtained.
  • “Red” is replaced with “Rose Red”
  • the modified text data obtained is "Rose red clothes make women look younger”.
  • Step 306 Determine the object attribute set of the target object based on the text data and visual attribute information.
  • the object attribute set of the target object can be further determined based on the text data and visual attribute information.
  • the object attributes of the target object are enriched, making the generated target description text more coherent and accurate.
  • the object attribute set refers to a set composed of object attribute information of multiple target objects.
  • the object attribute information includes text data and visual attribute information of the target object.
  • the object attribute information can be understood as text information that completely describes the attributes of the target object. .
  • text data and visual attribute information can be merged and spliced to determine the object attribute set of the target object.
  • the text data of the target object is "orange cat sofa pillow" and the visual attribute information is "orange high-end”.
  • the objects included in the object attribute set of the target object can be determined.
  • the content is "Orange cat sofa cushion with orange high-end feel”.
  • the determined object attribute set is " Orange cat sofa cushions have a high-end feel.”
  • the above steps of determining the object attribute set of the target object based on text data and visual attribute information may include the following steps:
  • the title of the product usually includes the brand name of the product, etc.
  • the product introduction usually includes the origin, function, etc. of the product
  • the product parameters of the product usually include the size, material, item number, etc. of the product.
  • the specific selection is based on the actual situation. This manual implements The example does not impose any restrictions on this.
  • the target product For example, taking the target product as a pillow, the title of the target product is "Big Bear Cushion Plush Giant Backrest Bedside Cushion Birthday Gift", and the introduction of the target product is "Panda-shaped pillow is cute and childlike, feels soft, and can be used to swipe your phone.” and a good companion for reading", the product parameters of the target product are "item number: 00001, material: other, size: 70cm*90cm”.
  • the text data includes at least one of the title, introduction, and product parameters of the target product, enriching the objects of the target product. Attributes make the generated product description text more coherent and accurate.
  • Step 308 Based on the object attribute set, generate the target description text of the target object.
  • the image and text data of the target object is obtained, and the visual attribute information of the target object is identified based on the image data.
  • the object attribute set of the target object based on the text data and visual attribute information, it can further be based on Object attribute set to generate the target description text of the target object.
  • target description text refers to text that can describe the target object concisely and accurately.
  • description text can also be understood as summary text, script, synopsis, summary and summary script.
  • the target description text of the target product is the product description text.
  • the above steps of generating the target description text of the target object based on the object attribute set may include the following steps:
  • the target description text of the target product is generated.
  • the graphic and text data includes image data and text data; based on the image data, identify the visual attribute information of the target object, where the visual attribute information represents the appearance of the target object. characteristics; determine the object attribute set of the target object based on text data and visual attribute information; generate the target description text of the target object based on the object attribute set.
  • the visual attribute information of the target object is determined, taking into account the explicit characteristics of the target object, making the object attributes of the target object more comprehensive, and determining the target based on the text data and visual attribute information.
  • the object attribute set of the object integrates the text data and visual attribute information of the target object, making the generated target description text more coherent and further improving the accuracy of the target description text.
  • the text content in the object attribute set can be word segmented, and a preset description text generation template can be used to process each word obtained by word segmentation to generate a target description text of the target object.
  • the method of word segmentation processing may be to use a word segmentation tool to perform word segmentation processing, or to use a preset word list to match to obtain the word segmentation results. The specific selection is based on the actual situation. The embodiments of this specification do not limit this in any way.
  • a pre-trained text processing model can be used to generate the target description text. That is, the above-mentioned steps of generating the target description text of the target object based on the object attribute set may include the following steps:
  • the object attribute set is input into the pre-trained text processing model, and the text processing model generates target description text of the target object.
  • the pre-trained text processing model is a model generated by training a preset processing model.
  • the preset processing model refers to a model that can implement text processing, such as the Transformer model (BART) that has both contextual information and autoregressive characteristics. , Bidirectional and Auto-Regressive Transformers), text-to-text transfer transformation model (T5, Text-to-Text Transfer Transformer), pre-training model (GPT, Generative Pre-Training), etc., select according to the actual situation, this manual is implemented The example does not impose any restrictions on this.
  • BART is an encoder-decoder (Encoder-Decoder) structure.
  • the input to the Encoder is a sequence with noise added, and the input to the Decoder is a sequence with a start symbol (right-shifted) added.
  • the target of the Decoder side is the original sequence.
  • the object attribute set is input into the pre-trained text processing model, and the target description text of the target object is generated through the text processing model, which improves the efficiency of obtaining the target description text and the accuracy of the generated target description text.
  • the target description text can be displayed directly on the client.
  • the target description text can also be stored in a preset database.
  • the target description text is called from the preset database. That is, the target description text of the target object is generated based on the object attribute set.
  • the target description text can be searched in the preset database to determine whether there is pre-generated target description text in the preset database. If it exists, the target description text is directly called from the default database and displayed on the client. Target description text. If there is no target description text in the preset database, the text generation method provided by the embodiment of this specification can be used to generate the target description text in real time, and the generated target description text can be displayed on the client.
  • the client displays the target description text
  • the user can introduce the target object according to the target description text.
  • the text-audio conversion tool can also be used to convert the target description text into audio, and generate the audio data corresponding to the target description text. After the audio data is generated, the audio data is actively played to introduce the target object.
  • the target description text is called from the preset database, which saves the user's time in obtaining the target description text and improves the user experience; when the client The target description text is displayed on the client, and the user does not need to understand the target object carefully.
  • the target object can be introduced directly based on the target description text; the audio data corresponding to the target description text is generated and played without the user's introduction, saving a lot of labor costs.
  • the training method of the text processing model may include the following steps:
  • each sample object carries sample text data and sample description text
  • a preset processing model is trained to obtain a text processing model.
  • sample objects are used to train text processing models, and sample objects include but are not limited to commodities, people, scenery, places of interest, and so on.
  • the sample text data carried by the sample object is text data describing the sample object, such as the name of the sample object, unique attributes, detailed information, process information, etc.
  • the sample description text is the description text corresponding to the sample object.
  • the sample description text can also be understood as sample summary text, sample script, sample summary, sample content summary, and sample summary script.
  • the method of obtaining the first sample set can be to manually input a large amount of sample text data and sample description text to form the first sample set; it can also be to read a large amount of sample text data from other data acquisition devices or databases and
  • the sample description text constitutes the first sample set, which is selected according to the actual situation.
  • the embodiments of this specification do not limit this in any way.
  • the way to identify each sample description text and determine the sample visual attribute information of each sample object can be to perform word segmentation processing on each sample description text, and match each word segmentation result with a preset visual attribute vocabulary to obtain Sample visual attribute information of each sample object; you can also directly tag the sample description text with part-of-speech tags, retain the obtained nouns and adjectives, and determine the sample visual attribute information.
  • the sample text data of the sample object is "This dress is so beautiful.”
  • the augmented text data is obtained as "this "The dress is so beautiful”, “This dress is so beautiful”, “This dress is awesome”, etc., where the augmented text data can be one or multiple, and the selection is made based on the actual situation.
  • the embodiments of this specification are This is without any limitation.
  • each sample object carries sample text data and sample description text
  • each sample description text is identified, and each sample object is determined.
  • data augmentation is performed on each sample text data to determine the augmented text data of each sample object.
  • training Preset the processing model to obtain the text processing model, which takes into account the explicit characteristics of the sample object, making the object attributes of the sample object more comprehensive, expanding the sample text data of the sample object, making the sample text data more diverse, and further making the trained model have Stronger generalization ability improves the accuracy of the trained model.
  • the sample text data and sample description text can be obtained from the live broadcast room and product details page of the sample product, and further construct the first sample set, that is, the above-mentioned acquisition of the first
  • the steps in this episode may include the following steps:
  • a first sample set is constructed based on the sample text data and sample description text of multiple sample commodities.
  • the sample text data of each sample product can be extracted from the details page of the sample product.
  • the method of extracting the sample text data includes but is not limited to OCR technology.
  • the live broadcast data of the sample products can also be collected from the live broadcast rooms of the sample products. These live broadcast data include video data and voice data.
  • ASR technology is used to identify and convert the live broadcast data to generate sample description texts for each sample product.
  • the first sample set can be constructed, where the sample description text can be understood as a sample label carried by the sample object, and the sample label represents the result that is actually desired to be output by the preset processing model.
  • the live broadcast data includes video data and voice data.
  • the live broadcast data is recognized and converted to generate a sample description of each sample product.
  • text extract the sample text data of each sample product from the detail pages of multiple sample products, and construct the first sample set based on the sample text data and sample description text of the multiple sample products, enriching the first sample set, so that The sample text data in the sample set are contextually and semantically coherent, further improving the accuracy of the trained model.
  • Initial training samples and augmented training samples that is, the above-mentioned steps of training a preset processing model based on sample visual attribute information, sample text data and augmented text data of multiple sample objects to obtain a text processing model may include the following step:
  • the sample text data and sample visual attribute information of each sample object are merged, the initial training sample of each sample object is determined, the augmented text data and sample visual attribute information of each sample object are merged, and the augmentation of each sample object is determined.
  • the training sample method can be text splicing, or the text data after deduplication can be spliced.
  • the preset processing model can be trained based on the initial training samples and augmented training samples, that is, the above-mentioned initial training samples using multiple sample objects , augment the training samples and sample description text, train the preset processing model, and obtain the text processing model, which may include the following steps:
  • the second loss value and the third loss value Based on the first loss value, the second loss value and the third loss value, adjust the model parameters of the preset processing model, and return to the step of extracting the first initial training sample and the first augmented training sample of the first sample object;
  • the first sample description text refers to the result that is actually intended to be output by the preset processing model, that is, the first sample description text is the real result.
  • the generated first prediction description text and when the first augmented training sample is input into the preset processing model, the generated second prediction description text is generated by the preset processing model.
  • the prediction result if the difference between the prediction result and the real result is small enough, that is, the first loss value and the second loss value are small enough, it means that the prediction result is close enough to the real result.
  • the preset processing model is improved anti-noise ability, so the third loss value can be calculated based on the first predicted description text and the second predicted description text.
  • the model parameters of the preset processing model can be adjusted based on the first loss value, the second loss value and the third loss value, and return to perform extraction
  • the first initial training sample of the first sample object and the first augmented training sample step are to obtain a text processing model that has completed training when the first training stop condition is reached.
  • the cross-entropy loss function can be used to calculate the first loss value and the second loss value
  • the relative entropy loss function KLD, Kullback-Leibler Divergence
  • KLD Kullback-Leibler Divergence
  • the first training stop condition includes but is not limited to The first preset threshold and the first preset number of iterations are specifically selected according to the actual situation, and the embodiments of this specification do not limit this in any way.
  • the preset processing model includes an encoder; the first initial training sample is input into the preset processing model, the first prediction description text is generated, and the first augmented training sample is input into the preset processing model, Before the step of generating the second prediction description text, the following steps may also be included:
  • the encoder that has completed training is determined.
  • the coding loss value can be calculated using the following formula (1):
  • the second training stop condition includes but is not limited to the second preset threshold and the second preset number of iterations, which are selected according to the actual situation.
  • the embodiments of this specification do not limit this in any way.
  • Figure 4 shows a text processing model in a text generation method provided by an embodiment of this specification.
  • the training flow chart includes:
  • each sample object carries sample text data and sample description text; identify each sample description text, determine the sample visual attribute information of each sample object; perform data augmentation on each sample text data, determine each sample Augmented text data of the object; merge the sample text data and sample visual attribute information of each sample object, and pass the merged result through the encoder and decoder of the preset processing model to generate the first predicted description text; merge each sample object
  • the augmented text data and sample visual attribute information are combined, and the combined results are passed through the encoder and decoder of the preset processing model to generate a second predicted description text;
  • the first loss value is calculated based on the first predicted description text and the sample description text;
  • the second loss value is calculated according to the second prediction description text and the sample description text;
  • the third loss value is calculated according to the first prediction description text and the second prediction description text; based on the first loss value, the second loss value and the third loss value, Adjust the model parameters of the preset processing model, and obtain the text processing model that has completed training when the first training stop condition is reached
  • the preset processing model includes an encoder and a decoder.
  • the combined sample text data and sample visual attribute information of each sample object are input into the encoder to generate a first feature vector; the sample description text of each sample object is input into the encoder.
  • the second feature vector is generated by Encoder.
  • the training method of the image classification model may include the following steps:
  • each sample object carries sample image data and sample description text
  • the specific method of obtaining the second sample set, identifying the description text of each sample, and determining the sample visual attribute information of each sample object can refer to the above text processing model training method, and will not be described in detail in the embodiments of this specification. Determining the sample visual attribute information of each sample object takes into account the explicit characteristics of the sample object, making the object attributes of the sample object more comprehensive and improving the accuracy of the trained model.
  • the step of using sample image data and sample visual attribute information of multiple sample objects to train a preset classification model and obtain an image classification model may include the following steps:
  • the classification loss value adjust the model parameters of the preset classification model, and return to the step of extracting the second sample image data and the second sample visual attribute information of the second sample object;
  • the image classification model that has completed training is obtained.
  • the predicted visual attribute information of the second sample object and the second sample visual attribute information can be used Calculate the classification loss value
  • the second sample visual attribute information represents the actual desired output result of the preset classification model
  • the output predicted visual attribute information is the prediction result of the preset classification model , when the difference between the predicted results and the real results is small enough, that is, the classification loss value is small enough, indicating that the predicted results are close enough to the real results.
  • the training of the preset classification model is completed, and the trained image classification model is obtained.
  • the difference between the predicted results of the preset classification model and the real results can be intuitively shown by calculating the classification loss value.
  • the preset classification model can be trained based on the difference and the preset can be adjusted.
  • the parameters of the classification model can effectively improve the speed of preset classification model training and the effect of preset classification model training.
  • the third training stop condition includes but is not limited to the third preset threshold and the third preset number of iterations, which are selected according to the actual situation.
  • the embodiments of this specification do not limit this in any way.
  • the third preset threshold is the critical value of the classification loss value.
  • the classification loss value is greater than the third preset threshold, it means that there is still a certain deviation between the prediction results of the preset classification model and the real results, and it still needs to be adjusted.
  • the number of iterations can also be combined to determine whether the current preset classification model has been trained. Specifically, if the classification loss value is less than or equal to the third preset threshold, it means that the difference between the visual attribute information of the second sample and the predicted visual attribute information of the second sample object is small.
  • the training is stopped and the trained image classification model is obtained. That is, when the classification loss value is less than or equal to the third preset threshold, the training can be stopped without combining the number of iterations to obtain the trained image classification model; if the classification loss value is greater than the third preset threshold, it is judged whether the number of iterations at this time has reached the third preset threshold.
  • step continue to train the preset classification model until the third preset number of iterations is reached, stop iteration, and obtain the trained image classification model.
  • the values of the third preset threshold and the third preset number of iterations are specifically selected according to the actual situation, and the embodiments of this specification do not limit this in any way.
  • the number of iterations reaches the third preset number of iterations it means that the number of training times of the preset classification model has been enough.
  • the prediction results of the preset classification model are close enough to the real results, and the training can be stopped.
  • classification loss values such as cross entropy loss function, L1 norm loss function, maximum loss function, mean square error loss function, logarithmic loss function, etc.
  • L1 norm loss function L1 norm loss function
  • maximum loss function mean square error loss function
  • logarithmic loss function etc.
  • the specific choice is based on the actual situation. This manual The examples do not limit this in any way.
  • the specific training situation of the preset classification model can be determined based on the classification loss value, and if the training fails, the model parameters of the preset classification model can be reversely adjusted based on the classification loss value to improve the model.
  • Figure 5 shows a training flow chart of an image classification model in a text generation method provided by an embodiment of this specification, which specifically includes:
  • each sample object carries sample image data and sample description text; identify each sample description text, determine the sample visual attribute information of each sample object; input the sample image data of each sample object into the preset classification model In the In the case of , the image classification model that has completed training is obtained.
  • Figure 6 shows a process flow chart of a text generation method provided by an embodiment of this specification, which specifically includes the following steps:
  • Step 602 Obtain the detail page data of the target product, where the detail page data includes image data and text data, and the text data includes at least one of the title, introduction, and product parameters of the target product.
  • Figure 7 shows a schematic diagram of a target product details page in a text generation method provided by an embodiment of this specification.
  • the target product details page includes image data of coffee cups, such as the two coffee cups in the picture, and also includes the title of the target product: Large capacity coffee cup with spoon; Introduction to the target product: High glaze firing, safe and secure, warm tone , bringing a different experience to life; product parameters of the target product: rich styles, 500ml.
  • Step 604 Input the image data into the pre-trained image classification model, and obtain the visual attribute information of the target product through classification recognition by the image classification model, where the visual attribute information represents the explicit characteristics of the target product.
  • the image data is input into the pre-trained picture classification model.
  • the visual attribute information of the target product is obtained as "white, warm brown, striped, non-striped, soft color, simple and elegant" .
  • Step 606 Combine text data and visual attribute information to determine the product attribute set of the target product.
  • text data and visual attribute information are combined to determine the product attribute set of the target product as "large-capacity coffee cup with spoon, high-glaze firing, safe and secure, warm tone, bringing different experiences to life, rich styles, 500ml, white, warm brown, striped, non-striped, soft colors, simple and elegant.”
  • Step 608 Input the product attribute set into the pre-trained text processing model, and use the text processing model to generate the target description text of the target product.
  • FIG. 8 shows a schematic diagram of a display interface of a client in a text generation method provided by an embodiment of this specification.
  • the target description text included in the client display interface is "This is a large-capacity coffee cup with a spoon, with a capacity of 500ml. This coffee cup has various styles, including white, warm brown, striped and non-striped. . The colors are soft, simple and elegant. The coffee cup is fired with high glaze, which is safe and secure, and brings you a different life experience.”
  • Step 610 Display the target description text on the client, so that the virtual anchor can introduce the target product based on the target description text.
  • FIG. 9 shows a schematic structural diagram of a text generation device provided by an embodiment of this specification. As shown in Figure 9, the device includes:
  • the acquisition module 902 is configured to acquire image and text data of the target object, where the image and text data includes image data and text data;
  • the identification module 904 is configured to identify the visual attribute information of the target object based on the image data, where the visual attribute information represents the dominant characteristics of the target object;
  • the determination module 906 is configured to determine the object attribute set of the target object based on the text data and visual attribute information
  • the generation module 908 is configured to generate target description text of the target object based on the object attribute set.
  • the acquisition module 902 is further configured to monitor the graphic and text data of the target object; when the graphic and text data is updated, acquire the graphic and text data of the target object.
  • the device further includes: a calling module configured to call the target description text from a preset database when the object currently displayed by the client is the target object, wherein the preset database is used to store the generated target.
  • Description text display the target description text on the client; or perform audio conversion on the target description text, generate and play audio data corresponding to the target description text.
  • the target object includes the target product; the determination module 906 is further configured to determine the product attribute set of the target product based on text data and visual attribute information, where the text data includes the title, introduction, and product parameters of the target product. at least one;
  • the generation module 908 is further configured to generate a target description text of the target product based on the product attribute set.
  • the generation module 908 is further configured to input the object attribute set into a pre-trained text processing model, and generate a target description text of the target object through the text processing model;
  • the device also includes: a text processing model training module configured to obtain a first sample set, wherein the first sample set includes a plurality of sample objects, each sample object carries sample text data and sample description text; identifying each Sample description text, determine the sample visual attribute information of each sample object; perform data augmentation on each sample text data, determine the augmented text data of each sample object; based on the sample visual attribute information, sample text data of multiple sample objects and Augment the text data, train the preset processing model, and obtain the text processing model.
  • a text processing model training module configured to obtain a first sample set, wherein the first sample set includes a plurality of sample objects, each sample object carries sample text data and sample description text; identifying each Sample description text, determine the sample visual attribute information of each sample object; perform data augmentation on each sample text data, determine the augmented text data of each sample object; based on the sample visual attribute information, sample text data of multiple sample objects and Augment the text data, train the preset processing model, and obtain the text processing model.
  • the sample objects include sample commodities; the text processing model training module is further configured to extract the live broadcast data of each sample commodity from the live broadcast rooms of multiple sample commodities, where the live broadcast data includes video data and voice data; for the live broadcast The data is recognized and converted to generate sample description text of each sample product; sample text data of each sample product is extracted from the detail pages of multiple sample products; based on the sample text data and sample description text of multiple sample products, the first sample product is constructed. This episode.
  • the text processing model training module is further configured to merge the sample text data and sample visual attribute information of each sample object, determine the initial training sample of each sample object; merge the augmented text data and samples of each sample object Visual attribute information is used to determine the augmented training samples of each sample object; the initial training samples, augmented training samples and sample description texts of multiple sample objects are used to train the preset processing model and obtain the text processing model.
  • the text processing model training module is further configured to extract the first initial training sample and the first augmented training sample of the first sample object, where the first sample object is any sample in the first sample set.
  • This object inputs the first initial training sample into the preset processing model to generate the first prediction description text, and inputs the first augmented training sample into the preset processing model to generate the second prediction description text; according to the first prediction description Calculate the first loss value based on the text and the first sample description text; Calculate the second loss value based on the second predicted description text and the first sample description text; Calculate the third loss value based on the first predicted description text and the second predicted description text ; Based on the first loss value, the second loss value and the third loss value, adjust the model parameters of the preset processing model, and return to the step of extracting the first initial training sample and the first augmented training sample of the first sample object ; When the first training stop condition is reached, the text processing model that has completed training is obtained.
  • the preset processing model includes an encoder; the device further includes: an encoder training module configured to input the first initial training sample into the encoder to generate the first feature vector; input the first sample description text into the encoder The second feature vector is generated by A feature vector step; when the second training stop condition is reached, determine the encoder that has completed training.
  • the recognition module 904 is further configured to input the image data into a pre-trained picture classification model, and obtain the visual attribute information of the target object through classification and recognition by the picture classification model;
  • the device also includes: a picture classification model training module configured to obtain a second sample set, wherein the second sample set includes a plurality of sample objects, each sample object carries sample image data and sample description text; identifying each sample description Text, determine the sample visual attribute information of each sample object; use the sample image data and sample visual attribute information of multiple sample objects to train a preset classification model and obtain a picture classification model.
  • a picture classification model training module configured to obtain a second sample set, wherein the second sample set includes a plurality of sample objects, each sample object carries sample image data and sample description text; identifying each sample description Text, determine the sample visual attribute information of each sample object; use the sample image data and sample visual attribute information of multiple sample objects to train a preset classification model and obtain a picture classification model.
  • the picture classification model training module is further configured to extract the second sample image data and the second sample visual attribute information of the second sample object, where the second sample object is any sample object in the second sample set; Input the second sample image data into the preset classification model to obtain the predicted visual attribute information of the second sample object; calculate the classification loss of the preset classification model based on the second sample visual attribute information and the predicted visual attribute information of the second sample object value; adjust the model parameters of the preset classification model according to the classification loss value, and return to the step of extracting the second sample image data and the second sample visual attribute information of the second sample object; when the third training stop condition is reached , obtain the image classification model that has completed training.
  • the graphic and text data includes image data and text data; based on the image data, identify the visual attribute information of the target object, where the visual attribute information represents the appearance of the target object. characteristics; determine the object attribute set of the target object based on text data and visual attribute information; generate the target description text of the target object based on the object attribute set.
  • the visual attribute information of the target object is determined, taking into account the explicit characteristics of the target object, making the object attributes of the target object more comprehensive, and determining the target based on the text data and visual attribute information.
  • the object attribute set of the object integrates the text data and visual attribute information of the target object, making the generated target description text more coherent and further improving the target description. accuracy of the text.
  • Figure 10 shows a structural block diagram of a computing device provided by an embodiment of this specification.
  • Components of the computing device 1000 include, but are not limited to, memory 1010 and processor 1020 .
  • the processor 1020 is connected to the memory 1010 through a bus 1030, and the database 1050 is used to save data.
  • Computing device 1000 also includes an access device 1040 that enables computing device 1000 to communicate via one or more networks 1060 .
  • networks include Public Switched Telephone Network (PSTN), Local Area Network (LAN), Wide Area Network (WAN), Personal Area Network (PAN), or networks such as the Internet A combination of communication networks.
  • Access device 1040 may include one or more of any type of network interface (eg, Network Interface Card (NIC)), wired or wireless, such as IEEE802.11 Wireless Local Area Networks (WLAN) Wireless interface, World Interoperability for Microwave Access (Wi-MAX, World Interoperability for Microwave Access) interface, Ethernet interface, Universal Serial Bus (USB, Universal Serial Bus) interface, cellular network interface, Bluetooth interface, Near Field Communication (NFC) , Near Field Communication) interface, etc.
  • NIC Network Interface Card
  • the above-mentioned components of the computing device 1000 and other components not shown in FIG. 10 may also be connected to each other, such as through a bus. It should be understood that the structural block diagram of the computing device shown in FIG. 10 is for illustrative purposes only and does not limit the scope of this description. Those skilled in the art can add or replace other components as needed.
  • Computing device 1000 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet computer, personal digital assistant, laptop computer, notebook computer, netbook, etc.), a mobile telephone (e.g., smartphone ), a wearable computing device (e.g., smart watch, smart glasses, etc.) or other type of mobile device, or a stationary computing device such as a desktop computer or PC.
  • a mobile computer or mobile computing device e.g., tablet computer, personal digital assistant, laptop computer, notebook computer, netbook, etc.
  • a mobile telephone e.g., smartphone
  • a wearable computing device e.g., smart watch, smart glasses, etc.
  • stationary computing device such as a desktop computer or PC.
  • Computing device 1000 may also be a mobile or stationary server.
  • the processor 1020 is configured to execute the following computer-executable instructions. When the computer-executable instructions are executed by the processor, the steps of the above text generation method are implemented.
  • the above is a schematic solution of a computing device in this embodiment. It should be noted that the technical solution of the computing device and the technical solution of the above-mentioned text generation method belong to the same concept. For details that are not described in detail in the technical solution of the computing device, please refer to the description of the technical solution of the above text generation method.
  • An embodiment of the present specification also provides a computer-readable storage medium that stores computer-executable instructions.
  • the computer-executable instructions are executed by a processor, the steps of the above text generation method are implemented.
  • An embodiment of the present specification also provides a computer program, wherein when the computer program is executed in a computer, the computer is caused to execute the steps of the above text generation method.
  • the computer instructions include computer program code, which may be in the form of source code, object code, executable file or some intermediate form.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording media, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electrical carrier signals, telecommunications signals, and software distribution media, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • General Physics & Mathematics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Processing Or Creating Images (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本说明书实施例提供文本生成方法以及装置,其中所述文本生成方法包括:获取目标对象的图文数据,其中,图文数据包括图像数据和文本数据;基于图像数据,识别目标对象的视觉属性信息,其中,视觉属性信息表征目标对象的显性特征;根据文本数据和视觉属性信息,确定目标对象的对象属性集;基于对象属性集,生成目标对象的目标描述文本。通过获取目标对象的多模态图文数据,确定目标对象的视觉属性信息,考虑了目标对象的显性特征,使得目标对象的对象属性更加全面,并且,根据文本数据和视觉属性信息,确定目标对象的对象属性集,综合了目标对象的文本数据以及视觉属性信息,使得生成的目标描述文本更加连贯,进一步提高了目标描述文本的准确性。

Description

文本生成方法以及装置
本申请要求于2022年08月30日提交中国专利局、申请号为202211048016.0、申请名称为“文本生成方法以及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本说明书实施例涉及计算机技术领域,特别涉及一种文本生成方法。本说明书一个或者多个实施例同时涉及一种文本生成装置,一种计算设备,一种计算机可读存储介质。
背景技术
随着计算机技术的发展,文本摘要的生成逐渐成为自然语言处理领域的热点话题。以电商场景为例,在电商场景中,每种商品的描述通常由丰富多样的数据构成,为了能够更好地描述商品的特点,吸引用户进行购买,需要生成商品对应的文本摘要,供用户快速准确地了解商品的信息。
目前,通常由主播对商品信息进行充分理解,并将商品的显著特点进行概述。然而,由于在电商领域中商品是海量的,由人工概述获得商品的文本摘要,需要花费大量人力,付出高昂的成本,并且,人工势必会引入大量不确定性因素,导致生成的文本摘要准确性差。因此,亟需一种准确的文本生成方案。
发明内容
有鉴于此,本说明书实施例提供了一种文本生成方法。本说明书一个或者多个实施例同时涉及一种文本生成装置,一种计算设备,一种计算机可读存储介质以及一种计算机程序,以解决现有技术中存在的技术缺陷。
根据本说明书实施例的第一方面,提供了一种文本生成方法,包括:
获取目标对象的图文数据,其中,图文数据包括图像数据和文本数据;
基于图像数据,识别目标对象的视觉属性信息,其中,视觉属性信息表征目标对象的显性特征;
根据文本数据和视觉属性信息,确定目标对象的对象属性集;
基于对象属性集,生成目标对象的目标描述文本。
根据本说明书实施例的第二方面,提供了一种文本生成装置,包括:
获取模块,被配置为获取目标对象的图文数据,其中,图文数据包括图像数据和文本数据;
识别模块,被配置为基于图像数据,识别目标对象的视觉属性信息,其中,视觉属性信息表征目标对象的显性特征;
确定模块,被配置为根据文本数据和视觉属性信息,确定目标对象的对象属性集;
生成模块,被配置为基于对象属性集,生成目标对象的目标描述文本。
根据本说明书实施例的第三方面,提供了一种计算设备,包括:
存储器和处理器;
所述存储器用于存储计算机可执行指令,所述处理器用于执行所述计算机可执行指令,该计算机可执行指令被处理器执行时实现上述文本生成方法的步骤。
根据本说明书实施例的第四方面,提供了一种计算机可读存储介质,其存储有计算机可执行指令,该指令被处理器执行时实现上述文本生成方法的步骤。
根据本说明书实施例的第五方面,提供了一种计算机程序,其中,当所述计算机程序在计算机中执行时,令计算机执行上述文本生成方法的步骤。
本说明书一个实施例提供的文本生成方法,获取目标对象的图文数据,其中,图文数据包括图像数据和文本数据;基于图像数据,识别目标对象的视觉属性信息,其中,视觉属性信息表征目标对象的显性特征;根据文本数据和视觉属性信息,确定目标对象的对象属性集;基于对象属性集,生成目标对象的目标描述文本。通过获取目标对象的多模态图文数据,确定目标对象的视觉属性信息,考虑了目标对象的显性特征,使得目标对象的对象属性更加全面,并且,根据文本数据和视觉属性信息,确定目标对象的对象属性集,综合了目标对象的文本数据以及视觉属性信息,使得生成的目标描述文本更加连贯,进一步提高了目标描述文本的准确性。
附图说明
图1是本说明书一个实施例提供的一种文本生成系统的框架图;
图2是本说明书一个实施例提供的另一种文本生成系统的框架图;
图3是本说明书一个实施例提供的一种文本生成方法的流程图;
图4是本说明书一个实施例提供的一种文本生成方法中文本处理模型的训练流程图;
图5是本说明书一个实施例提供的一种文本生成方法中图像分类模型的训练流程图;
图6是本说明书一个实施例提供的一种文本生成方法的处理过程流程图;
图7是本说明书一个实施例提供的一种文本生成方法中目标商品详情页的示意图;
图8是本说明书一个实施例提供的一种文本生成方法中客户端的显示界面示意图;
图9是本说明书一个实施例提供的一种文本生成装置的结构示意图;
图10是本说明书一个实施例提供的一种计算设备的结构框图。
具体实施方式
在下面的描述中阐述了很多具体细节以便于充分理解本说明书。但是本说明书能够以很多不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本说明书内涵的情况下做类似推广,因此本说明书不受下面公开的具体实施的限制。
在本说明书一个或多个实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本说明书一个或多个实施例。在本说明书一个或多个实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本说明书一个或多个实施例中使用的术语“和/或”是指并 包含一个或多个相关联的列出项目的任何或所有可能组合。
应当理解,尽管在本说明书一个或多个实施例中可能采用术语第一、第二等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本说明书一个或多个实施例范围的情况下,第一也可以被称为第二,类似地,第二也可以被称为第一。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。
首先,对本说明书一个或多个实施例涉及的名词术语进行解释。
模态:指数据存在的形式,如自然语言、图片等形式。
商品摘要:基于商品的信息,如商品的描述、外观等,来生成一段简短的、具有商品显著信息的文本摘要。
自然语言生成:使计算机具有人一样的表达和写作的功能。即能够根据一些关键信息及其在机器内部的表达形式,经过一个规划过程,来自动生成一段高质量的自然语言文本。
BART(Bidirectional and Auto-Regressive Transformers):一种兼具上下文语境信息和自回归特性的模型,该模型输入自然语言,生成自然语言。
自动语音识别(ASR,Automatic Speech Recognition):将人类表述的语言转化为对应的文字的技术。
词性标注:一种可以在句子中把每个词的词性标注出来的技术。
互信息:两个随机变量之间的依赖度。
在本说明书中,提供了一种文本生成方法,本说明书同时涉及一种文本生成装置,一种计算设备,以及一种计算机可读存储介质,在下面的实施例中逐一进行详细说明。
随着计算机技术的发展,文本摘要的生成逐渐成为自然语言处理领域的热点话题。以电商场景为例,在电商场景中,每种商品的描述通常由丰富多样的数据构成,例如商品的标题、详细的文本描述和图像等。为了能够更好地描述商品的特点,吸引用户进行购买,需要生成商品对应的文本摘要,供用户快速准确地了解商品的信息。
目前,通常由主播对商品信息进行充分理解,并将商品的显著特点进行概述。然而,由于在电商领域中商品是海量的,由人工编排获得商品的文本摘要,需要花费大量人力,付出高昂的成本,并且,人工势必会引入大量不确定性因素,大部分文本摘要仅为简单的拼接,导致生成的文本摘要准确性差,修改成本高。因此,亟需一种准确的文本生成方案。
为了提高文本生成的效率以及准确性,本方案提供了一种多基于多模态数据生成描述文本的方案,给定目标对象的多模态图文数据,端到端的自动化生成能够准确概括目标对象的特点和突出目标对象优势的描述文本。
具体实施时,本说明书实施例提供的文本生成方法,获取目标对象的图文数据,其中,图文数据包括图像数据和文本数据;基于图像数据,识别目标对象的视觉属性信息,其中,视觉属性信息表征目标对象的显性特征;根据文本数据和视觉属性信息,确定目标对象的对象属性集;基于对象属性集,生成目标对象的目标描述文本。通过获取目标对象的多模态图文数据,确定目标对象的视觉属性信息,考虑了目标对象的显性特征,使得目标对象的对象属性更加全面,并且,根据文本数据和视觉属性信息,确定目标对象的对象属性集, 综合了目标对象的文本数据以及视觉属性信息,使得生成的目标描述文本更加连贯,进一步提高了目标描述文本的准确性。
参见图1,图1示出了本说明书一个实施例提供的一种文本生成系统的框架图,其中,文本生成系统包括服务端和客户端:
客户端:向服务端发送目标对象的图文数据,其中,图文数据包括图像数据和文本数据;
服务端:获取目标对象的图文数据;基于图像数据,识别目标对象的视觉属性信息,其中,视觉属性信息表征目标对象的显性特征;根据文本数据和视觉属性信息,确定目标对象的对象属性集;基于对象属性集,生成目标对象的目标描述文本,并将目标描述文本发送至客户端,以使客户端显示目标描述文本。
客户端:接收并显示服务端发送的目标描述文本,以使用户根据目标描述文本对目标对象进行介绍。
值得说明的是,本说明书实施例中提供的文本生成方法一般由服务端执行,但是,在本说明书的其它实施例中,客户端也可以与服务端具有相似的功能,从而执行本说明书实施例所提供的文本生成方法。在其他实施例中,本说明书实施例所提供的文本生成方法还可以是由客户端与服务端共同执行。
应用本说明书实施例的方案,获取目标对象的图文数据,其中,图文数据包括图像数据和文本数据;基于图像数据,识别目标对象的视觉属性信息,其中,视觉属性信息表征目标对象的显性特征;根据文本数据和视觉属性信息,确定目标对象的对象属性集;基于对象属性集,生成目标对象的目标描述文本。通过获取目标对象的多模态图文数据,确定目标对象的视觉属性信息,考虑了目标对象的显性特征,使得目标对象的对象属性更加全面,并且,根据文本数据和视觉属性信息,确定目标对象的对象属性集,综合了目标对象的文本数据以及视觉属性信息,使得生成的目标描述文本更加连贯,进一步提高了目标描述文本的准确性。
本说明书一个或多个实施例提供的方案,可以应用于文本生成场景,如电商直播场景、会议场景、教育场景等等,具体根据实际情况进行选择,本说明书实施例对此不作任何限定。
参见图2,图2示出了本说明书一个实施例提供的另一种文本生成系统的框架图,该系统可以包括服务端100以及多个客户端200。多个客户端200之间通过服务端100可以建立通信连接,在文本生成场景中,服务端100即用来在多个客户端200之间提供文本生成服务,多个客户端200可以分别作为发送端或接收端,通过服务端100实现实时通信。
用户通过客户端200可与服务端100进行交互以接收其它客户端200发送的数据,或将数据发送至其它客户端200等。在文本生成场景中,可以是用户通过客户端200向服务端100发布数据流,服务端100将该数据流推送至订阅该数据流的客户端中。数据流例如可以是图文数据。如在电商直播场景中,用户通过客户端可以实时采集目标商品的图文数据,并将图文数据发送至服务端,服务端可以根据客户端发送的图文数据,生成对应的商品描述文本,将该商品描述文本推送至包括该商品的所有直播间,以使主播根据商品描述文本对目标商品进行介绍。又如在会议场景中,参会用户通过客户端可以实时采集图文数 据并发送至服务端,服务端可以对客户端发送的图文数据进行处理,生成摘要文本,并将摘要文本推送至其它参会用户的客户端等。
其中,客户端200与服务端100之间通过网络建立连接。网络为客户端与服务端之间提供了通信链路的介质。网络可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。客户端200所传输的数据可能需要经过编码、转码、压缩等处理之后才发布至服务端100。
客户端200可以为浏览器、APP(Application,应用程序)、或网页应用如H5(HyperText Markup Language5,超文本标记语言第5版)应用、或轻应用(也被称为小程序,一种轻量级应用程序)或云应用等,客户端200可以基于服务端提供的相应服务的软件开发工具包(SDK,Software Development Kit),如基于实时通信(RTC,Real Time Communication)SDK开发获得等。客户端200可以部署在电子设备中,需要依赖设备运行或者设备中的某些App而运行等。电子设备例如可以具有显示屏并支持信息浏览等,如可以是个人移动终端如手机、平板电脑、个人计算机等。在电子设备中通常还可以配置各种其它类应用,例如人机对话类应用、模型训练类应用、文本处理类应用、网页浏览器应用、购物类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等。
服务端100可以包括提供各种服务的服务器,例如为多个客户端提供通信服务的服务器,又如为客户端上使用的模型提供支持的用于后台训练的服务器,又如对客户端发送的数据进行处理的服务器等。
需要说明的是,服务端100可以实现成多个服务器组成的分布式服务器集群,也可以实现成单个服务器。服务器也可以为分布式系统的服务器,或者是结合了区块链的服务器。服务器也可以是云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(CDN,Content Delivery Network)、以及大数据和人工智能平台等基础云计算服务的云服务器,或者是带人工智能技术的智能云计算服务器或智能云主机。
参见图3,图3示出了本说明书一个实施例提供的一种文本生成方法的流程图,具体包括以下步骤:
步骤302:获取目标对象的图文数据,其中,图文数据包括图像数据和文本数据。
本说明书一个或多个实施例中,随着计算机技术的发展,针对目标对象的描述形式也越来越丰富,如商品的描述包括标题、详细的文本描述以及商品展示图像等。为了准确生成目标对象的描述文本,可以获取目标对象的多模态数据,多模态数据可以包括图像数据以及文本数据,进一步根据多模态的图文数据生成目标对象的目标描述文本。
具体地,目标对象是指需要生成目标描述文本的对象,也可以理解为等待生成目标描述文本的对象,包括但不限于商品、人物、风景、名胜古迹等等。目标对象的图文数据是指包括目标对象相关信息的图像数据以及文本数据。图像数据可以是目标对象的配图、照片、设计图等等,文本数据可以是目标对象的名称、结构化属性、细节信息、工艺信息等等。
实际应用中,获取目标对象的图文数据的方式有多种,具体根据实际情况进行选择,本说明书实施例对此不作任何限定。
本说明书一种可选的实现方式中,可以在接收到文本生成指令的情况下,获取目标对象的图文数据。一种可能的方式中,文本生成指令中携带了用户输入的涵盖目标对象信息的图文数据;另一种可能的方式中,文本生成指令中包括目标对象的唯一标识,根据该唯一标识,可以确定目标对象,进一步获取目标对象的图文数据。
示例性地,以目标对象为目标商品为例,由于商品的详情页中存在大量的商品细节信息,并且整个详情页之间存在上下文语义连贯性,能够完成涵盖目标商品的信息,因此,接收文本生成指令后,根据文本生成指令中的目标对象的唯一标识,可以从目标商品的详情页中获取目标对象的图文数据。
本说明书另一种可选的实现方式中,由于目标对象的图文数据通常是变化的,因此可以监测目标对象的图文数据,在图文数据产生变化的情况下,实时获取目标对象的图文数据,生成目标对象的目标描述文本,以便于用户在需要目标描述文本时,可以立刻查询到目标描述文本。也即,上述获取目标对象的图文数据的步骤,可以包括以下步骤:
对目标对象的图文数据进行监测;
在图文数据更新的情况下,获取目标对象的图文数据。
本说明书实施例中,图文数据的更新包括增加、删除、替换、更改等,本说明书实施例中,目标对象的图文数据只要有变动,即可认为目标对象的图文数据产生更新。
进一步地,由于目标描述文本的生成过程会花费一定时间,本说明书实施例中还可以采用离线定时的方式生成目标对象的目标描述文本。离线定时的方式是指定时更新目标对象的目标描述文本。
需要说明的是,在定时更新目标描述文本之前,可以检测目标对象的图文数据是否发生变更,也即定时任务启动时,将当前目标对象的图文数据与上次更新时目标对象的图文数据进行比较。若图文数据发生变更,则触发定时任务,获取目标对象的图文数据,基于该图文数据生成目标描述文本;若图文数据未发生变更,则不对目标对象的描述文本进行更新。
应用本说明书实施例的方案,通过对目标对象的图文数据进行监测,在图文数据更新的情况下,获取目标对象的图文数据,实现了主动生成目标对象的目标描述文本,节省了用户获得目标描述文本的时间,提高了用户体验度。
步骤304:基于图像数据,识别目标对象的视觉属性信息,其中,视觉属性信息表征目标对象的显性特征。
本说明书一个或多个实施例中,在获取目标对象的图文数据之后,进一步可以基于图文数据中包括的图像数据,识别目标对象的视觉属性信息,通过生成视觉属性信息,相当于将图像数据转化为文本数据,统一了目标对象的多模态数据,减小了多种模态之间的模态异构性。
具体地,视觉属性信息表征目标对象的显性特征,显性特征是指目标对象显现的特征,可以是目标对象的颜色、形状等名词特征,还可以是美观、漂亮、大方等形容词特征,具体根据实际情况进行选择,本说明书实施例对此不作任何限定。
实际应用中,基于图像数据,识别目标对象的视觉属性信息的方式有多种,具体根据 实际情况进行选择,本说明书实施例对此不作任何限定。
本说明书一种可选的实现方式中,由于图像数据中可能包括目标对象的文本数据,因此,可以利用光学字符识别(OCR,Optical Character Recognition)获得图像数据中的文本数据。还可以利用图像颜色识别工具获得图像数据中的视觉属性信息。
本说明书另一种可选的实现方式中,可以利用预先训练的图片分类模型识别目标对象的视觉属性信息,也即,上述基于图像数据,识别目标对象的视觉属性信息的步骤,可以包括以下步骤:
将图像数据输入预先训练的图片分类模型中,经图片分类模型的分类识别,获得目标对象的视觉属性信息。
具体地,预先训练的图片分类模型是对预设分类模型进行训练生成的模型,预设分类模型是指能够实现分类的模型,如Swin Transformer模型、残差神经网络(ResNet,Residual Network)、图像分类变换模型(Vit,Vision Transformer),具体根据实际情况进行选择,本说明书实施例对此不作任何限定。
以图像分类变换模型为例,把图像数据输入图像分类变换模型,和传统的卷积神经网络输入图片不同的是,这里将图像数据分为一个个分块(patch),如将图像分成9个patch。每个patch的大小是可以指定的,比如16×16等等。然后把每个patch输入到嵌入层(embedding),通过该层以后,可以得到一系列向量(token),9个patch都会得到它们对应的向量,然后在所有的向量之前加入一个用于分类的向量,这个分类的向量的维度和其他9个向量一致。此外,还需要加入位置信息。然后把所有的向量输入Transformer编码器(Encoder)中,然后把Transformer Encoder重复堆叠L次,再将用于分类的token的输出输入多层感知器(MLP,Multilayer Perceptron)Head,然后得到最终分类的结果。
应用本说明书实施例的方案,将图像数据输入预先训练的图片分类模型中,经图片分类模型的分类识别,获得目标对象的视觉属性信息,提高了获得目标对象的视觉属性信息的效率以及准确性,进一步使得后续生成的目标描述文本更加准确。
值得说明的是,在获取目标对象的视觉属性信息之后,可以对比目标对象的视觉属性信息和文本数据,根据对比结果对目标对象的文本数据进行修改。
示例性地,目标对象的文本数据为“红色衣服女式显年轻”,获得目标对象的视觉属性信息为“玫红色显白”,对比文本数据和视觉属性信息,将目标对象文本数据中的“红色”替换为“玫红色”,获得修改后的文本数据为“玫红色衣服女式显年轻”。
步骤306:根据文本数据和视觉属性信息,确定目标对象的对象属性集。
本说明书一个或多个实施例中,在获取目标对象的图文数据,基于图像数据,识别目标对象的视觉属性信息之后,进一步可以根据文本数据和视觉属性信息,确定目标对象的对象属性集,通过综合文本数据和视觉属性信息,丰富了目标对象的对象属性,使得生成的目标描述文本更加连贯、准确。
具体地,对象属性集是指由多个目标对象的对象属性信息构成的集合,对象属性信息中包括目标对象的文本数据以及视觉属性信息,对象属性信息可以理解为完整描述目标对象属性的文本信息。
实际应用中,可以对文本数据和视觉属性信息进行合并拼接,确定目标对象的对象属性集。例如目标对象的文本数据为“橘色猫咪沙发靠枕”,视觉属性信息为“橘色高级感”,将目标对象的文本数据和视觉属性信息进行拼接,即可确定目标对象的对象属性集中包括的内容为“橘色猫咪沙发靠枕橘色高级感”。
进一步地,为了减少数据处理量,提高文本生成效率,在对文本数据和视觉属性信息进行拼接时,还可以取文本数据和视觉属性信息的并集,引用上述示例,确定的对象属性集为“橘色猫咪沙发靠枕高级感”。
本说明书一种可选的实现方式中,以目标对象为目标商品为例,上述根据文本数据和视觉属性信息,确定目标对象的对象属性集的步骤,可以包括以下步骤:
根据文本数据和视觉属性信息,确定目标商品的商品属性集,其中,文本数据包括目标商品的标题、简介、产品参数中的至少一种。
具体地,商品的标题通常包括商品的品牌名等,商品简介通常包括商品的产地、功能等,商品的产品参数通常包括商品的尺寸、材质、货号等,具体根据实际情况进行选择,本说明书实施例对此不作任何限定。
示例性地,以目标商品为抱枕为例,目标商品的标题为“大熊抱枕毛绒巨型靠背床头靠垫生日礼物”,目标商品的简介为“熊猫造型的抱枕可爱童真,手感柔软,还是刷手机和阅读的好伴侣”,目标商品的产品参数为“货号:00001,材质:其他,大小:70cm*90cm”。
应用本说明书实施例的方案,根据文本数据和视觉属性信息,确定目标商品的商品属性集,其中,文本数据包括目标商品的标题、简介、产品参数中的至少一种,丰富了目标商品的对象属性,使得生成的商品描述文本更加连贯、准确。
步骤308:基于对象属性集,生成目标对象的目标描述文本。
本说明书一个或多个实施例中,获取目标对象的图文数据,基于图像数据,识别目标对象的视觉属性信息,根据文本数据和视觉属性信息,确定目标对象的对象属性集之后,进一步可以基于对象属性集,生成目标对象的目标描述文本。
具体地,目标描述文本是指可以简明、确切地描述目标对象的文本。在本说明书实施例中,描述文本还可以理解为摘要文本、剧本、概要、内容提要以及摘要剧本。
需要说明的是,以目标对象为目标商品为例,目标商品的目标描述文本即为商品描述文本,上述基于对象属性集,生成目标对象的目标描述文本的步骤,可以包括以下步骤:
基于商品属性集,生成目标商品的目标描述文本。
应用本说明书实施例的方案,获取目标对象的图文数据,其中,图文数据包括图像数据和文本数据;基于图像数据,识别目标对象的视觉属性信息,其中,视觉属性信息表征目标对象的显性特征;根据文本数据和视觉属性信息,确定目标对象的对象属性集;基于对象属性集,生成目标对象的目标描述文本。通过获取目标对象的多模态图文数据,确定目标对象的视觉属性信息,考虑了目标对象的显性特征,使得目标对象的对象属性更加全面,并且,根据文本数据和视觉属性信息,确定目标对象的对象属性集,综合了目标对象的文本数据以及视觉属性信息,使得生成的目标描述文本更加连贯,进一步提高了目标描述文本的准确性。
实际应用中,基于对象属性集,生成目标对象的目标描述文本的方式有多种,具体根据实际情况进行选择,本说明书实施例对此不作任何限定。
本说明书一种可选的实现方式中,可以将对象属性集中的文本内容进行分词处理,利用预先设置的描述文本生成模板对分词获得的各词语进行处理,生成目标对象的目标描述文本。其中,分词处理的方式可以是利用分词工具进行分词处理,还可以是利用预设词语表匹配获得分词结果,具体根据实际情况进行选择,本说明书实施例对此不作任何限定。
示例性地,以对象属性集中的文本内容为“橘色猫咪沙发靠枕高级感”为例,对该文本内容进行分词,获得分词结果为“橘色、猫咪、沙发靠枕、高级感”,获取预先设置的描述文本生成模板为“XX是XX形状的,给人XX的感觉”,将分词结果填充至描述文本生成目标中,获得目标描述文本为“沙发靠枕是橘色猫咪形状的,给人高级的感觉”。
本说明书另一种可选的实现方式中,可以利用预先训练的文本处理模型生成目标描述文本,也即,上述基于对象属性集,生成目标对象的目标描述文本的步骤,可以包括以下步骤:
将对象属性集输入预先训练的文本处理模型中,经文本处理模型生成目标对象的目标描述文本。
具体地,预先训练的文本处理模型是对预设处理模型进行训练生成的模型,预设处理模型是指能够实现文本处理的模型,如兼具上下文语境信息和自回归特性的Transformer模型(BART,Bidirectional and Auto-Regressive Transformers)、文本到文本传输转换模型(T5,Text-to-Text Transfer Transformer)、预训练模型(GPT,Generative Pre-Training)等,具体根据实际情况进行选择,本说明书实施例对此不作任何限定。
以BART模型为例,BART是一个编码器-解码器(Encoder-Decoder)的结构,其Encoder端的输入是加了噪音的序列,Decoder端的输入是添加了起始符(right-shifted)的序列,Decoder端的目标是原序列。
应用本说明书实施例的方案,将对象属性集输入预先训练的文本处理模型中,经文本处理模型生成目标对象的目标描述文本,提高了获得目标描述文本的效率以及生成的目标描述文本的准确性。
值得说明的是,生成目标对象的目标描述文本之后,可以直接在客户端显示目标描述文本。还可以将目标描述文本存储至预设数据库中,在当前客户端关联到目标对象时,再从预设数据库中调用目标描述文本,也即,上述基于对象属性集,生成目标对象的目标描述文本的步骤之后,还可以包括以下步骤:
在客户端当前展示的对象为目标对象的情况下,从预设数据库中调用目标描述文本,其中,预设数据库用于存储生成的目标描述文本;
在客户端显示目标描述文本;或者,对目标描述文本进行音频转换,生成并播放目标描述文本对应的音频数据。
具体地,若客户端当前展示的对象为目标对象,则表示需要获取目标对象的目标描述文本。此时,可以在预设数据库中查找目标描述文本,判断预设数据库中是否存在预先生成的目标描述文本。若存在,则直接从预设数据库中调用该目标描述文本,在客户端显示 目标描述文本。若预设数据库中没有目标描述文本,则可以利用本说明书实施例提供的文本生成方法,实时生成目标描述文本,并在客户端显示生成的目标描述文本。
进一步地,由于客户端显示目标描述文本,用户可以根据目标描述文本对目标对象进行介绍。为了减轻用户工作量,还可以利用文本-音频转换工具对目标描述文本进行音频转换,生成目标描述文本对应的音频数据,在生成音频数据后,主动播放该音频数据,实现对目标对象的介绍。
应用本说明书实施例的方案,在客户端当前展示的对象为目标对象的情况下,从预设数据库中调用目标描述文本,节省了用户获得目标描述文本的时间,提高了用户体验度;在客户端显示目标描述文本,无需用户仔细了解目标对象,可以直接根据目标描述文本对目标对象进行介绍;生成并播放目标描述文本对应的音频数据,无需用户进行介绍,节省了大量人力成本。
下面对图1所示实施例中文本处理模型的训练方式进行详细说明。
本说明书一个或多个实施例中,文本处理模型的训练方式,可以包括以下步骤:
获取第一样本集,其中,第一样本集中包括多个样本对象,每个样本对象携带样本文本数据和样本描述文本;
识别每个样本描述文本,确定各样本对象的样本视觉属性信息;
对每个样本文本数据进行数据增广,确定各样本对象的增广文本数据;
基于多个样本对象的样本视觉属性信息、样本文本数据以及增广文本数据,训练预设处理模型,获得文本处理模型。
具体地,样本对象用于训练文本处理模型,样本对象包括但不限于商品、人物、风景、名胜古迹等等。样本对象携带的样本文本数据为描述样本对象的文本数据,如样本对象的名称、特有属性、细节信息、工艺信息等等。样本描述文本为样本对象对应的描述文本,样本描述文本也可以理解为样本摘要文本、样本剧本、样本概要、样本内容提要以及样本摘要剧本。一般情况下,获取第一样本集的方式可以是人工输入的大量样本文本数据以及样本描述文本组成第一样本集;也可以是从其他数据获取设备或者数据库中读取大量样本文本数据以及样本描述文本组成第一样本集,具体根据实际情况进行选择,本说明书实施例对此不作任何限定。
实际应用中,识别每个样本描述文本,确定各样本对象的样本视觉属性信息的方式可以是对每个样本描述文本进行分词处理,将各分词结果与预先设置的视觉属性词表进行匹配,获得各样本对象的样本视觉属性信息;还可以直接对样本描述文本进行词性标注,保留得到的名词和形容词,确定样本视觉属性信息。
本说明书实施例中,考虑到同一个语义对应多个词语,如表达好看的词语有美丽、漂亮、颜值高等,因此,可以对样本对象的样本文本数据进行数据增广,扩充样本对象的样本文本数据,使得样本文本数据更加多样化,对样本数据中增加一定的噪音,进一步使得训练的模型具有更强的泛化能力。
示例性地,样本对象的样本文本数据为“这件衣服真好看”,将样本文本数据中的“好看”替换为好看的近义词,实现对样本文本数据的数据增广,获得增广文本数据为“这件 衣服真美丽”、“这件衣服真漂亮”、“这件衣服真棒”等,其中,增广文本数据可以是一个,也可以是多个,具体根据实际情况进行选择,本说明书实施例对此不作任何限定。
应用本说明书实施例的方案,获取第一样本集,其中,第一样本集中包括多个样本对象,每个样本对象携带样本文本数据和样本描述文本,识别每个样本描述文本,确定各样本对象的样本视觉属性信息,对每个样本文本数据进行数据增广,确定各样本对象的增广文本数据,基于多个样本对象的样本视觉属性信息、样本文本数据以及增广文本数据,训练预设处理模型,获得文本处理模型,考虑了样本对象的显性特征,使得样本对象的对象属性更加全面,扩充样本对象的样本文本数据,使得样本文本数据更加多样化,进一步使得训练的模型具有更强的泛化能力,提高了训练后的模型的准确性。
示例性地,以样本对象为样本商品为例,可以从样本商品的直播间以及商品详情页中获得样本文本数据以及样本描述文本,进一步构建第一样本集,也即,上述获取第一样本集的步骤,可以包括以下步骤:
从多个样本商品的直播间中提取各样本商品的直播数据,其中,直播数据包括视频数据和语音数据;
对直播数据进行识别转换,生成各样本商品的样本描述文本;
从多个样本商品的详情页中提取各样本商品的样本文本数据;
根据多个样本商品的样本文本数据和样本描述文本,构建第一样本集。
具体地,由于商品详情页存在大量的商品细节信息,且整个详情页之间存在上下文语义连贯性,能完整涵盖商品的图文数据。因此,可以从样本商品的详情页中提取各样本商品的样本文本数据,提取样本文本数据的方式包括但不限于OCR技术。并且,还可以从样本商品的直播间中收集样本商品的直播数据,这些直播数据包括视频数据以及语音数据,利用ASR技术对直播数据进行识别转换,生成各样本商品的样本描述文本。在获得样本文本数据以及样本描述文本之后,可以构建第一样本集,其中,样本描述文本可以理解为样本对象携带的样本标签,该样本标签表征真实想要预设处理模型输出的结果。
应用本说明书实施例的方案,从多个样本商品的直播间中提取各样本商品的直播数据,其中,直播数据包括视频数据和语音数据,对直播数据进行识别转换,生成各样本商品的样本描述文本,从多个样本商品的详情页中提取各样本商品的样本文本数据,根据多个样本商品的样本文本数据和样本描述文本,构建第一样本集,丰富了第一样本集,使得样本集中的样本文本数据上下文语义连贯,进一步提高了训练后的模型的准确性。
进一步地,在获得多个样本对象的样本视觉属性信息、样本文本数据以及增广文本数据之后,可以基于样本视觉属性信息,分别对样本文本数据以及增广文本数据进行处理,确定各样本对象的初始训练样本和增广训练样本,也即,上述,基于多个样本对象的样本视觉属性信息、样本文本数据以及增广文本数据,训练预设处理模型,获得文本处理模型的步骤,可以包括以下步骤:
合并每个样本对象的样本文本数据和样本视觉属性信息,确定各样本对象的初始训练样本;
合并每个样本对象的增广文本数据和样本视觉属性信息,确定各样本对象的增广训练 样本;
利用多个样本对象的初始训练样本、增广训练样本以及样本描述文本,训练预设处理模型,获得文本处理模型。
具体地,合并每个样本对象的样本文本数据和样本视觉属性信息,确定各样本对象的初始训练样本、合并每个样本对象的增广文本数据和样本视觉属性信息,确定各样本对象的增广训练样本的方式可以是文本拼接,还可以对去重后的文本数据进行拼接。
应用本说明书实施例的方案,合并每个样本对象的样本文本数据和样本视觉属性信息,确定各样本对象的初始训练样本,合并每个样本对象的增广文本数据和样本视觉属性信息,确定各样本对象的增广训练样本,利用多个样本对象的初始训练样本、增广训练样本以及样本描述文本,训练预设处理模型,获得文本处理模型。通过综合文本数据和样本视觉属性信息,丰富了样本对象的对象属性,提升了训练后的模型的泛化性。
进一步地,在获得各样本对象的初始训练样本、增广训练样本之后,可以基于初始训练样本、增广训练样本对预设处理模型进行训练,也即,上述利用多个样本对象的初始训练样本、增广训练样本以及样本描述文本,训练预设处理模型,获得文本处理模型的步骤,可以包括以下步骤:
提取第一样本对象的第一初始训练样本和第一增广训练样本,其中,第一样本对象为第一样本集中的任一样本对象;
将第一初始训练样本输入预设处理模型中,生成第一预测描述文本,并将第一增广训练样本输入预设处理模型中,生成第二预测描述文本;
根据第一预测描述文本和第一样本描述文本计算第一损失值;
根据第二预测描述文本和第一样本描述文本计算第二损失值;
根据第一预测描述文本和第二预测描述文本计算第三损失值;
基于第一损失值、第二损失值以及第三损失值,调整预设处理模型的模型参数,并返回执行提取第一样本对象的第一初始训练样本和第一增广训练样本的步骤;
在达到第一训练停止条件的情况下,获得完成训练的文本处理模型。
具体地,第一样本描述文本是指真实想要预设处理模型输出的结果,即第一样本描述文本为真实结果。而将第一初始训练样本输入预设处理模型中,生成的第一预测描述文本以及将第一增广训练样本输入预设处理模型中,生成的第二预测描述文本为预设处理模型生成的预测结果,在预测结果和真实结果之间的差异足够小时,即第一损失值以及第二损失值足够小时,说明预测结果足够接近真实结果。
特别地,由于第一增广训练样本为增加了噪音的第一初始训练样本,为了使得预设处理模型对第一初始训练样本和第一增广训练样本的预测结果接近,提高预设处理模型的抗噪音能力,因此可以根据第一预测描述文本和第二预测描述文本计算第三损失值。最终,在获得第一损失值、第二损失值以及第三损失值之后,可以基于第一损失值、第二损失值以及第三损失值,调整预设处理模型的模型参数,并返回执行提取第一样本对象的第一初始训练样本和第一增广训练样本的步骤,在达到第一训练停止条件的情况下,获得完成训练的文本处理模型。
需要说明的是,可以利用交叉熵损失函数计算第一损失值和第二损失值,利用相对熵损失函数(KLD,Kullback-Leibler Divergence)计算第三损失值,第一训练停止条件包括但不限于第一预设阈值、第一预设迭代次数,具体根据实际情况进行选择,本说明书实施例对此不作任何限定。
应用本说明书实施例的方案,通过利用交叉熵损失函数,提高了计算第一损失值和第二损失值的效率和准确率,利用相对熵损失函数,提高了计算第三损失值的效率和准确率,进一步使完成训练的文本处理模型更加准确。
本说明书一种可选的实现方式中,为了学习到更好的文本特征,还可以利用各样本对象初始训练样本和样本描述文本,对预设处理模型中的编码器使用互信息最大化损失函数进行约束,也即,预设处理模型包括编码器;上述将第一初始训练样本输入预设处理模型中,生成第一预测描述文本,并将第一增广训练样本输入预设处理模型中,生成第二预测描述文本的步骤之前,还可以包括以下步骤:
将第一初始训练样本输入编码器,生成第一特征向量;
将第一样本描述文本输入编码器,生成第二特征向量;
根据第一特征向量和第二特征向量,计算编码损失值;
基于编码损失值,调整编码器的参数,并返回执行将第一初始训练样本输入编码器,生成第一特征向量的步骤;
在达到第二训练停止条件的情况下,确定完成训练的编码器。
具体地,可以利用以下公式(1)计算编码损失值:
其中,B是训练过程中一个batch的大小(每次更新参数的时候需要算B个数据的损失),zi=avg(Zi),avg代表平均池化操作(average pooling),Zi代表第i个初始训练样本输入到编码器之后得到的特征向量,zy=avg(Zy),Zy代表第i个样本描述文本输入到编码器之后得到的特征向量。
需要说明的是,第二训练停止条件包括但不限于第二预设阈值、第二预设迭代次数,具体根据实际情况进行选择,本说明书实施例对此不作任何限定。
应用本说明书实施例的方案,将第一初始训练样本输入编码器,生成第一特征向量,将第一样本描述文本输入编码器,生成第二特征向量,根据第一特征向量和第二特征向量,计算编码损失值,基于编码损失值,调整编码器的参数,并返回执行将第一初始训练样本输入编码器,生成第一特征向量的步骤,在达到第二训练停止条件的情况下,确定完成训练的编码器,对预设处理模型中的编码器使用了互信息最大化损失函数进行约束,使得预设处理模型可以学习到更好的文本特征,使完成训练的文本处理模型更加准确。
参见图4,图4示出了本说明书一个实施例提供的一种文本生成方法中文本处理模型 的训练流程图,具体包括:
获取多个样本对象,每个样本对象携带样本文本数据和样本描述文本;识别每个样本描述文本,确定各样本对象的样本视觉属性信息;对每个样本文本数据进行数据增广,确定各样本对象的增广文本数据;合并每个样本对象的样本文本数据和样本视觉属性信息,将合并的结果经预设处理模型的编码器和解码器,生成第一预测描述文本;合并每个样本对象的增广文本数据和样本视觉属性信息,将合并的结果经预设处理模型的编码器和解码器,生成第二预测描述文本;根据第一预测描述文本和样本描述文本计算第一损失值;根据第二预测描述文本和样本描述文本计算第二损失值;根据第一预测描述文本和第二预测描述文本计算第三损失值;基于第一损失值、第二损失值以及第三损失值,调整预设处理模型的模型参数,在达到第一训练停止条件的情况下,获得完成训练的文本处理模型。
其中,预设处理模型包括编码器和解码器,将每个样本对象合并后的样本文本数据和样本视觉属性信息输入编码器,生成第一特征向量;将每个样本对象的样本描述文本输入编码器,生成第二特征向量;根据第一特征向量和第二特征向量,计算编码损失值;基于编码损失值,调整编码器的参数,在达到第二训练停止条件的情况下,确定完成训练的编码器。
下面对图1所示实施例中图片分类模型的训练方式进行详细说明。
本说明书一个或多个实施例中,图片分类模型的训练方式,可以包括以下步骤:
获取第二样本集,其中,第二样本集中包括多个样本对象,每个样本对象携带样本图像数据和样本描述文本;
识别每个样本描述文本,确定各样本对象的样本视觉属性信息;
利用多个样本对象的样本图像数据和样本视觉属性信息,训练预设分类模型,获得图片分类模型。
具体地,获取第二样本集、识别每个样本描述文本,确定各样本对象的样本视觉属性信息的具体方式可以参考上述文本处理模型训练方式,本说明书实施例便不再进行赘述。确定各样本对象的样本视觉属性信息考虑了样本对象的显性特征,使得样本对象的对象属性更加全面,提高了训练后的模型的准确性。
进一步地,利用多个样本对象的样本图像数据和样本视觉属性信息,训练预设分类模型,获得图片分类模型的步骤,可以包括以下步骤:
提取第二样本对象的第二样本图像数据和第二样本视觉属性信息,其中,第二样本对象为第二样本集中的任一样本对象;
将第二样本图像数据输入预设分类模型中,获得第二样本对象的预测视觉属性信息;
根据第二样本视觉属性信息和第二样本对象的预测视觉属性信息,计算预设分类模型的分类损失值;
根据分类损失值,调整预设分类模型的模型参数,并返回执行提取第二样本对象的第二样本图像数据和第二样本视觉属性信息的步骤;
在达到第三训练停止条件的情况下,获得完成训练的图片分类模型。
需要说明的是,可以基于第二样本对象的预测视觉属性信息和第二样本视觉属性信息 计算分类损失值,第二样本视觉属性信息表征真实想要预设分类模型输出的结果,而将第二样本图像数据输入预设分类模型,输出的预测视觉属性信息就是预设分类模型的预测结果,在预测结果和真实结果之间的差异足够小时,即分类损失值足够小,说明预测结果足够接近真实结果,此时预设分类模型训练完成,获得完成训练的图片分类模型。
在本说明书实施例中,通过计算分类损失值可以直观地示出预设分类模型的预测结果与真实结果之间的差异,后续可以基于该差异对预设分类模型进行针对性训练,调整预设分类模型的参数,可以有效提高预设分类模型训练的速率及预设分类模型训练的效果。
需要说明的是,第三训练停止条件包括但不限于第三预设阈值、第三预设迭代次数,具体根据实际情况进行选择,本说明书实施例对此不作任何限定。
一种可能的实现方式中,可以仅基于分类损失值和第三预设阈值之间的关系,确定是否停止训练。具体地,若分类损失值大于第三预设阈值,则说明第二样本视觉属性信息和第二样本对象的预测视觉属性信息的差异较大,预设分类模型的分类识别能力较差,此时可以调整预设分类模型的模型参数,并返回执行提取第二样本对象的第二样本图像数据和第二样本视觉属性信息的步骤,继续对预设分类模型进行训练,直至分类损失值小于或等于第三预设阈值,说明第二样本视觉属性信息和第二样本对象的预测视觉属性信息的差异较小,停止训练,获得训练后的图片分类模型。
其中,第三预设阈值为分类损失值的临界值,在分类损失值大于第三预设阈值的情况下,说明预设分类模型的预测结果与真实结果之间仍存在一定偏差,仍需调整预设分类模型的模型参数,并对该预设分类模型进行训练;在分类损失值小于或等于第三预设阈值的情况下,说明预设分类模型的预测结果与真实结果的接近程度已经足够,可以停止训练。
另一种可能的实现方式中,除了比较分类损失值和第三预设阈值的关系之外,还可以结合迭代次数,确定当前的预设分类模型是否训练完成。具体的,若分类损失值小于或等于第三预设阈值,则说明第二样本视觉属性信息和第二样本对象的预测视觉属性信息的差异较小,停止训练,获得训练后的图片分类模型,即分类损失值小于或等于第三预设阈值时,无需结合迭代次数即可停止训练以获得训练后的图片分类模型;若分类损失值大于第三预设阈值,判断此刻的迭代次数是否达到第三预设迭代次数,若此刻的迭代次数并未达到第三迭代次数,则调整预设分类模型的模型参数,并返回执行提取第二样本对象的第二样本图像数据和第二样本视觉属性信息的步骤,继续对预设分类模型进行训练,直至达到第三预设迭代次数的情况下,停止迭代,得到训练后的图片分类模型。
其中,第三预设阈值、第三预设迭代次数的数值具体根据实际情况进行选择,本说明书实施例对此不作任何限定。在迭代次数达到第三预设迭代次数时,说明预设分类模型的训练次数已经足够,此时预设分类模型的预测结果与真实结果的接近程度已经足够,可以停止训练。
实际应用中,计算分类损失值的函数有很多,如交叉熵损失函数、L1范数损失函数、最大损失函数、均方误差损失函数、对数损失函数等,具体根据实际情况进行选择,本说明书实施例对此不作任何限定。
应用本说明书实施例的方案,可以根据分类损失值判断预设分类模型的具体训练情况,并在训练未合格的情况下根据分类损失值反向调整预设分类模型的模型参数,以提高该模 型的分类识别能力,训练速率高,且训练效果好。
参见图5,图5示出了本说明书一个实施例提供的一种文本生成方法中图像分类模型的训练流程图,具体包括:
获取多个样本对象,每个样本对象携带样本图像数据和样本描述文本;识别每个样本描述文本,确定各样本对象的样本视觉属性信息;将每个样本对象的样本图像数据输入预设分类模型中,获得预测视觉属性信息;根据样本视觉属性信息和预测视觉属性信息,计算预设分类模型的分类损失值;根据分类损失值,对预设分类模型进行调参,在达到第三训练停止条件的情况下,获得完成训练的图片分类模型。
下述结合附图6,以本说明书提供的文本生成方法在电商直播场景的应用为例,对所述文本生成方法进行进一步说明。其中,图6示出了本说明书一个实施例提供的一种文本生成方法的处理过程流程图,具体包括以下步骤:
步骤602:获取目标商品的详情页数据,其中,详情页数据包括图像数据和文本数据,文本数据包括目标商品的标题、简介、产品参数中的至少一种。
参见图7,图7示出了本说明书一个实施例提供的一种文本生成方法中目标商品详情页的示意图。
目标商品详情页中包括咖啡杯的图像数据,如图中的两个咖啡杯,还包括目标商品的标题:咖啡杯大容量带勺子;目标商品的简介:高釉烧制,安全放心,暖调,为生活带来不一样的体验;目标商品的产品参数:样式丰富,500ml。
步骤604:将图像数据输入预先训练的图片分类模型中,经图片分类模型的分类识别,获得目标商品的视觉属性信息,其中,视觉属性信息表征目标商品的显性特征。
具体地,将图像数据输入预先训练的图片分类模型中,经图片分类模型的分类识别,获得目标商品的视觉属性信息为“白色、暖调棕色、有条纹、无条纹、色彩柔和、简约大方”。
步骤606:合并文本数据和视觉属性信息,确定目标商品的商品属性集。
具体地,合并文本数据和视觉属性信息,确定目标商品的商品属性集为“咖啡杯大容量带勺子、高釉烧制,安全放心,暖调,为生活带来不一样的体验、样式丰富,500ml,白色、暖调棕色、有条纹、无条纹、色彩柔和、简约大方”。
步骤608:将商品属性集输入预先训练的文本处理模型中,经文本处理模型生成目标商品的目标描述文本。
具体地,参见图8,图8示出了本说明书一个实施例提供的一种文本生成方法中客户端的显示界面示意图。客户端显示界面中包括的目标描述文本为“这是一款大容量带勺子的咖啡杯,其容量有500ml。这款咖啡杯样式丰富,有白色、暖调棕色,有条纹款、无条纹款。色彩柔和,简约大方。咖啡杯采用高釉烧制,安全放心,为您带来不一样的生活体验。”。
步骤610:在客户端显示目标描述文本,以使虚拟主播根据目标描述文本对目标商品进行介绍。
应用本说明书实施例的方案,获取目标商品的详情页数据,将详情页数据中的图像数 据输入预先训练的图片分类模型中,经图片分类模型的分类识别,获得目标商品的视觉属性信息,合并详情页数据中的文本数据和视觉属性信息,确定目标商品的商品属性集,将商品属性集输入预先训练的文本处理模型中,经文本处理模型生成目标商品的目标描述文本,在客户端显示目标描述文本,以使虚拟主播根据目标描述文本对目标商品进行介绍,将多模态数据与算法结合,应用到虚拟主播剧本构建流程中,用来指导符合直播场景特色的内容构建,并且支持多源文本数据与图像数据的输入,支持长文本生成,从而实现自动化生成的商品摘要。
与上述方法实施例相对应,本说明书还提供了文本生成装置实施例,图9示出了本说明书一个实施例提供的一种文本生成装置的结构示意图。如图9所示,该装置包括:
获取模块902,被配置为获取目标对象的图文数据,其中,图文数据包括图像数据和文本数据;
识别模块904,被配置为基于图像数据,识别目标对象的视觉属性信息,其中,视觉属性信息表征目标对象的显性特征;
确定模块906,被配置为根据文本数据和视觉属性信息,确定目标对象的对象属性集;
生成模块908,被配置为基于对象属性集,生成目标对象的目标描述文本。
可选地,获取模块902,进一步被配置为对目标对象的图文数据进行监测;在图文数据更新的情况下,获取目标对象的图文数据。
可选地,该装置还包括:调用模块,被配置为在客户端当前展示的对象为目标对象的情况下,从预设数据库中调用目标描述文本,其中,预设数据库用于存储生成的目标描述文本;在客户端显示目标描述文本;或者,对目标描述文本进行音频转换,生成并播放目标描述文本对应的音频数据。
可选地,目标对象包括目标商品;确定模块906,进一步被配置为根据文本数据和视觉属性信息,确定目标商品的商品属性集,其中,文本数据包括目标商品的标题、简介、产品参数中的至少一种;
生成模块908,进一步被配置为基于商品属性集,生成目标商品的目标描述文本。
可选地,生成模块908,进一步被配置为将对象属性集输入预先训练的文本处理模型中,经文本处理模型生成目标对象的目标描述文本;
该装置还包括:文本处理模型训练模块,被配置为获取第一样本集,其中,第一样本集中包括多个样本对象,每个样本对象携带样本文本数据和样本描述文本;识别每个样本描述文本,确定各样本对象的样本视觉属性信息;对每个样本文本数据进行数据增广,确定各样本对象的增广文本数据;基于多个样本对象的样本视觉属性信息、样本文本数据以及增广文本数据,训练预设处理模型,获得文本处理模型。
可选地,样本对象包括样本商品;文本处理模型训练模块,进一步被配置为从多个样本商品的直播间中提取各样本商品的直播数据,其中,直播数据包括视频数据和语音数据;对直播数据进行识别转换,生成各样本商品的样本描述文本;从多个样本商品的详情页中提取各样本商品的样本文本数据;根据多个样本商品的样本文本数据和样本描述文本,构建第一样本集。
可选地,文本处理模型训练模块,进一步被配置为合并每个样本对象的样本文本数据和样本视觉属性信息,确定各样本对象的初始训练样本;合并每个样本对象的增广文本数据和样本视觉属性信息,确定各样本对象的增广训练样本;利用多个样本对象的初始训练样本、增广训练样本以及样本描述文本,训练预设处理模型,获得文本处理模型。
可选地,文本处理模型训练模块,进一步被配置为提取第一样本对象的第一初始训练样本和第一增广训练样本,其中,第一样本对象为第一样本集中的任一样本对象;将第一初始训练样本输入预设处理模型中,生成第一预测描述文本,并将第一增广训练样本输入预设处理模型中,生成第二预测描述文本;根据第一预测描述文本和第一样本描述文本计算第一损失值;根据第二预测描述文本和第一样本描述文本计算第二损失值;根据第一预测描述文本和第二预测描述文本计算第三损失值;基于第一损失值、第二损失值以及第三损失值,调整预设处理模型的模型参数,并返回执行提取第一样本对象的第一初始训练样本和第一增广训练样本的步骤;在达到第一训练停止条件的情况下,获得完成训练的文本处理模型。
可选地,预设处理模型包括编码器;该装置还包括:编码器训练模块,被配置为将第一初始训练样本输入编码器,生成第一特征向量;将第一样本描述文本输入编码器,生成第二特征向量;根据第一特征向量和第二特征向量,计算编码损失值;基于编码损失值,调整编码器的参数,并返回执行将第一初始训练样本输入编码器,生成第一特征向量的步骤;在达到第二训练停止条件的情况下,确定完成训练的编码器。
可选地,识别模块904,进一步被配置为将图像数据输入预先训练的图片分类模型中,经图片分类模型的分类识别,获得目标对象的视觉属性信息;
该装置还包括:图片分类模型训练模块,被配置为获取第二样本集,其中,第二样本集中包括多个样本对象,每个样本对象携带样本图像数据和样本描述文本;识别每个样本描述文本,确定各样本对象的样本视觉属性信息;利用多个样本对象的样本图像数据和样本视觉属性信息,训练预设分类模型,获得图片分类模型。
可选地,图片分类模型训练模块,进一步被配置为提取第二样本对象的第二样本图像数据和第二样本视觉属性信息,其中,第二样本对象为第二样本集中的任一样本对象;将第二样本图像数据输入预设分类模型中,获得第二样本对象的预测视觉属性信息;根据第二样本视觉属性信息和第二样本对象的预测视觉属性信息,计算预设分类模型的分类损失值;根据分类损失值,调整预设分类模型的模型参数,并返回执行提取第二样本对象的第二样本图像数据和第二样本视觉属性信息的步骤;在达到第三训练停止条件的情况下,获得完成训练的图片分类模型。
应用本说明书实施例的方案,获取目标对象的图文数据,其中,图文数据包括图像数据和文本数据;基于图像数据,识别目标对象的视觉属性信息,其中,视觉属性信息表征目标对象的显性特征;根据文本数据和视觉属性信息,确定目标对象的对象属性集;基于对象属性集,生成目标对象的目标描述文本。通过获取目标对象的多模态图文数据,确定目标对象的视觉属性信息,考虑了目标对象的显性特征,使得目标对象的对象属性更加全面,并且,根据文本数据和视觉属性信息,确定目标对象的对象属性集,综合了目标对象的文本数据以及视觉属性信息,使得生成的目标描述文本更加连贯,进一步提高了目标描 述文本的准确性。
上述为本实施例的一种文本生成装置的示意性方案。需要说明的是,该文本生成装置的技术方案与上述的文本生成方法的技术方案属于同一构思,文本生成装置的技术方案未详细描述的细节内容,均可以参见上述文本生成方法的技术方案的描述。
图10示出了本说明书一个实施例提供的一种计算设备的结构框图。该计算设备1000的部件包括但不限于存储器1010和处理器1020。处理器1020与存储器1010通过总线1030相连接,数据库1050用于保存数据。
计算设备1000还包括接入设备1040,接入设备1040使得计算设备1000能够经由一个或多个网络1060通信。这些网络的示例包括公用交换电话网(PSTN,Public Switched Telephone Network)、局域网(LAN,Local Area Network)、广域网(WAN,Wide Area Network)、个域网(PAN,Personal Area Network)或诸如因特网的通信网络的组合。接入设备1040可以包括有线或无线的任何类型的网络接口(例如,网络接口卡(NIC,Network Interface Card))中的一个或多个,诸如IEEE802.11无线局域网(WLAN,Wireless Local Area Networks)无线接口、全球微波互联接入(Wi-MAX,World Interoperability for Microwave Access)接口、以太网接口、通用串行总线(USB,Universal Serial Bus)接口、蜂窝网络接口、蓝牙接口、近场通信(NFC,Near Field Communication)接口,等等。
在本说明书的一个实施例中,计算设备1000的上述部件以及图10中未示出的其他部件也可以彼此相连接,例如通过总线。应当理解,图10所示的计算设备结构框图仅仅是出于示例的目的,而不是对本说明书范围的限制。本领域技术人员可以根据需要,增添或替换其他部件。
计算设备1000可以是任何类型的静止或移动计算设备,包括移动计算机或移动计算设备(例如,平板计算机、个人数字助理、膝上型计算机、笔记本计算机、上网本等)、移动电话(例如,智能手机)、可佩戴的计算设备(例如,智能手表、智能眼镜等)或其他类型的移动设备,或者诸如台式计算机或PC的静止计算设备。计算设备1000还可以是移动式或静止式的服务器。
其中,处理器1020用于执行如下计算机可执行指令,该计算机可执行指令被处理器执行时实现上述文本生成方法的步骤。
上述为本实施例的一种计算设备的示意性方案。需要说明的是,该计算设备的技术方案与上述的文本生成方法的技术方案属于同一构思,计算设备的技术方案未详细描述的细节内容,均可以参见上述文本生成方法的技术方案的描述。
本说明书一实施例还提供一种计算机可读存储介质,其存储有计算机可执行指令,该计算机可执行指令被处理器执行时实现上述文本生成方法的步骤。
上述为本实施例的一种计算机可读存储介质的示意性方案。需要说明的是,该存储介质的技术方案与上述的文本生成方法的技术方案属于同一构思,存储介质的技术方案未详细描述的细节内容,均可以参见上述文本生成方法的技术方案的描述。
本说明书一实施例还提供一种计算机程序,其中,当所述计算机程序在计算机中执行时,令计算机执行上述文本生成方法的步骤。
上述为本实施例的一种计算机程序的示意性方案。需要说明的是,该计算机程序的技术方案与上述的文本生成方法的技术方案属于同一构思,计算机程序的技术方案未详细描述的细节内容,均可以参见上述文本生成方法的技术方案的描述。
上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
所述计算机指令包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。
需要说明的是,对于前述的各方法实施例,为了简便描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本说明书实施例并不受所描述的动作顺序的限制,因为依据本说明书实施例,某些步骤可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定都是本说明书实施例所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。
以上公开的本说明书优选实施例只是用于帮助阐述本说明书。可选实施例并没有详尽叙述所有的细节,也不限制该发明仅为所述的具体实施方式。显然,根据本说明书实施例的内容,可作很多的修改和变化。本说明书选取并具体描述这些实施例,是为了更好地解释本说明书实施例的原理和实际应用,从而使所属技术领域技术人员能很好地理解和利用本说明书。本说明书仅受权利要求书及其全部范围和等效物的限制。

Claims (14)

  1. 一种文本生成方法,包括:
    获取目标对象的图文数据,其中,所述图文数据包括图像数据和文本数据;
    基于所述图像数据,识别所述目标对象的视觉属性信息,其中,所述视觉属性信息表征所述目标对象的显性特征;
    根据所述文本数据和所述视觉属性信息,确定所述目标对象的对象属性集;
    基于所述对象属性集,生成所述目标对象的目标描述文本。
  2. 根据权利要求1所述的方法,所述获取目标对象的图文数据的步骤,包括:
    对所述目标对象的图文数据进行监测;
    在所述图文数据更新的情况下,获取所述目标对象的图文数据。
  3. 根据权利要求1或2所述的方法,所述基于所述对象属性集,生成所述目标对象的目标描述文本的步骤之后,还包括:
    在客户端当前展示的对象为所述目标对象的情况下,从预设数据库中调用所述目标描述文本,其中,所述预设数据库用于存储生成的所述目标描述文本;
    在所述客户端显示所述目标描述文本;或者,对所述目标描述文本进行音频转换,生成并播放所述目标描述文本对应的音频数据。
  4. 根据权利要求1所述的方法,所述目标对象包括目标商品;所述根据所述文本数据和所述视觉属性信息,确定所述目标对象的对象属性集的步骤,包括:
    根据所述文本数据和所述视觉属性信息,确定所述目标商品的商品属性集,其中,所述文本数据包括所述目标商品的标题、简介、产品参数中的至少一种;
    所述基于所述对象属性集,生成所述目标对象的目标描述文本的步骤,包括:
    基于所述商品属性集,生成所述目标商品的目标描述文本。
  5. 根据权利要求1所述的方法,所述基于所述对象属性集,生成所述目标对象的目标描述文本的步骤,包括:
    将所述对象属性集输入预先训练的文本处理模型中,经所述文本处理模型生成所述目标对象的目标描述文本;
    其中,所述文本处理模型的训练方式,包括:
    获取第一样本集,其中,所述第一样本集中包括多个样本对象,每个样本对象携带样本文本数据和样本描述文本;
    识别每个样本描述文本,确定各样本对象的样本视觉属性信息;
    对每个样本文本数据进行数据增广,确定所述各样本对象的增广文本数据;
    基于所述多个样本对象的样本视觉属性信息、样本文本数据以及增广文本数据,训练预设处理模型,获得所述文本处理模型。
  6. 根据权利要求5所述的方法,所述样本对象包括样本商品;所述获取第一样本集的步骤,包括:
    从多个样本商品的直播间中提取各样本商品的直播数据,其中,所述直播数据包括视频数据和语音数据;
    对所述直播数据进行识别转换,生成所述各样本商品的样本描述文本;
    从所述多个样本商品的详情页中提取各样本商品的样本文本数据;
    根据所述多个样本商品的样本文本数据和样本描述文本,构建所述第一样本集。
  7. 根据权利要求5所述的方法,所述基于所述多个样本对象的样本视觉属性信息、样本文本数据以及增广文本数据,训练预设处理模型,获得所述文本处理模型的步骤,包括:
    合并每个样本对象的样本文本数据和样本视觉属性信息,确定各样本对象的初始训练样本;
    合并每个样本对象的增广文本数据和样本视觉属性信息,确定各样本对象的增广训练样本;
    利用所述多个样本对象的初始训练样本、增广训练样本以及样本描述文本,训练预设处理模型,获得所述文本处理模型。
  8. 根据权利要求7所述的方法,所述利用所述多个样本对象的初始训练样本、增广训练样本以及样本描述文本,训练预设处理模型,获得所述文本处理模型的步骤,包括:
    提取第一样本对象的第一初始训练样本和第一增广训练样本,其中,所述第一样本对象为所述第一样本集中的任一样本对象;
    将所述第一初始训练样本输入预设处理模型中,生成第一预测描述文本,并将所述第一增广训练样本输入预设处理模型中,生成第二预测描述文本;
    根据所述第一预测描述文本和第一样本描述文本计算第一损失值;
    根据所述第二预测描述文本和所述第一样本描述文本计算第二损失值;
    根据所述第一预测描述文本和所述第二预测描述文本计算第三损失值;
    基于所述第一损失值、所述第二损失值以及所述第三损失值,调整所述预设处理模型的模型参数,并返回执行所述提取第一样本对象的第一初始训练样本和第一增广训练样本的步骤;
    在达到第一训练停止条件的情况下,获得完成训练的文本处理模型。
  9. 根据权利要求8所述的方法,所述预设处理模型包括编码器;所述将所述第一初始训练样本输入预设处理模型中,生成第一预测描述文本,并将所述第一增广训练样本输入预设处理模型中,生成第二预测描述文本的步骤之前,还包括:
    将所述第一初始训练样本输入所述编码器,生成第一特征向量;
    将所述第一样本描述文本输入所述编码器,生成第二特征向量;
    根据所述第一特征向量和所述第二特征向量,计算编码损失值;
    基于所述编码损失值,调整所述编码器的参数,并返回执行所述将所述第一初始训练样本输入所述编码器,生成第一特征向量的步骤;
    在达到第二训练停止条件的情况下,确定完成训练的所述编码器。
  10. 根据权利要求1所述的方法,所述基于所述图像数据,识别所述目标对象的视觉属性信息的步骤,包括:
    将所述图像数据输入预先训练的图片分类模型中,经所述图片分类模型的分类识别,获得所述目标对象的视觉属性信息;
    其中,所述图片分类模型的训练方式,包括:
    获取第二样本集,其中,所述第二样本集中包括多个样本对象,每个样本对象携带样本图像数据和样本描述文本;
    识别每个样本描述文本,确定各样本对象的样本视觉属性信息;
    利用所述多个样本对象的样本图像数据和样本视觉属性信息,训练预设分类模型,获得所述图片分类模型。
  11. 根据权利要求10所述的方法,所述利用所述多个样本对象的样本图像数据和样本视觉属性信息,训练预设分类模型,获得所述图片分类模型的步骤,包括:
    提取第二样本对象的第二样本图像数据和第二样本视觉属性信息,其中,所述第二样本对象为所述第二样本集中的任一样本对象;
    将所述第二样本图像数据输入预设分类模型中,获得所述第二样本对象的预测视觉属性信息;
    根据所述第二样本视觉属性信息和所述第二样本对象的预测视觉属性信息,计算所述预设分类模型的分类损失值;
    根据所述分类损失值,调整所述预设分类模型的模型参数,并返回执行所述提取第二样本对象的第二样本图像数据和第二样本视觉属性信息的步骤;
    在达到第三训练停止条件的情况下,获得完成训练的图片分类模型。
  12. 一种文本生成装置,包括:
    获取模块,被配置为获取目标对象的图文数据,其中,所述图文数据包括图像数据和文本数据;
    识别模块,被配置为基于所述图像数据,识别所述目标对象的视觉属性信息,其中,所述视觉属性信息表征所述目标对象的显性特征;
    确定模块,被配置为根据所述文本数据和所述视觉属性信息,确定所述目标对象的对象属性集;
    生成模块,被配置为基于所述对象属性集,生成所述目标对象的目标描述文本。
  13. 一种计算设备,包括:
    存储器和处理器;
    所述存储器用于存储计算机可执行指令,所述处理器用于执行所述计算机可执行指令,该计算机可执行指令被处理器执行时实现权利要求1至11任意一项所述文本生成方法的步骤。
  14. 一种计算机可读存储介质,其存储有计算机可执行指令,该计算机可执行指令被处理器执行时实现权利要求1至11任意一项所述文本生成方法的步骤。
PCT/CN2023/114514 2022-08-30 2023-08-23 文本生成方法以及装置 WO2024046189A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211048016.0A CN115496550A (zh) 2022-08-30 2022-08-30 文本生成方法以及装置
CN202211048016.0 2022-08-30

Publications (1)

Publication Number Publication Date
WO2024046189A1 true WO2024046189A1 (zh) 2024-03-07

Family

ID=84466461

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/114514 WO2024046189A1 (zh) 2022-08-30 2023-08-23 文本生成方法以及装置

Country Status (2)

Country Link
CN (1) CN115496550A (zh)
WO (1) WO2024046189A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115496550A (zh) * 2022-08-30 2022-12-20 阿里巴巴(中国)有限公司 文本生成方法以及装置
CN116778011B (zh) * 2023-05-22 2024-05-24 阿里巴巴(中国)有限公司 图像生成方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114005012A (zh) * 2021-11-05 2022-02-01 北京市商汤科技开发有限公司 多模态预训练模型的训练方法、装置、设备及存储介质
CN114168777A (zh) * 2020-09-10 2022-03-11 阿里巴巴集团控股有限公司 图像数据的处理方法、装置、存储介质和处理器
CN115496550A (zh) * 2022-08-30 2022-12-20 阿里巴巴(中国)有限公司 文本生成方法以及装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114168777A (zh) * 2020-09-10 2022-03-11 阿里巴巴集团控股有限公司 图像数据的处理方法、装置、存储介质和处理器
CN114005012A (zh) * 2021-11-05 2022-02-01 北京市商汤科技开发有限公司 多模态预训练模型的训练方法、装置、设备及存储介质
CN115496550A (zh) * 2022-08-30 2022-12-20 阿里巴巴(中国)有限公司 文本生成方法以及装置

Also Published As

Publication number Publication date
CN115496550A (zh) 2022-12-20

Similar Documents

Publication Publication Date Title
WO2024046189A1 (zh) 文本生成方法以及装置
CN104735468B (zh) 一种基于语义分析将图像合成新视频的方法及系统
EP3612926B1 (en) Parsing electronic conversations for presentation in an alternative interface
WO2016197767A2 (zh) 一种表情输入方法、装置、终端和计算机可读存储介质
WO2022134701A1 (zh) 视频处理方法及装置
WO2023065617A1 (zh) 基于预训练模型和召回排序的跨模态检索系统及方法
US20220335079A1 (en) Method for generating virtual image, device and storage medium
US11158349B2 (en) Methods and systems of automatically generating video content from scripts/text
WO2022134698A1 (zh) 视频处理方法及装置
CN114880441B (zh) 视觉内容生成方法、装置、系统、设备和介质
CN112348111B (zh) 视频中的多模态特征融合方法、装置、电子设备及介质
WO2024045474A1 (zh) 图像文案的生成方法、设备及计算机存储介质
CN112632244A (zh) 一种人机通话的优化方法、装置、计算机设备及存储介质
US20230107213A1 (en) Method of generating virtual character, electronic device, and storage medium
US20230419716A1 (en) Image processing method, apparatus, and device, storage medium, and computer program product
WO2019085625A1 (zh) 表情图片推荐方法及设备
CN113705315A (zh) 视频处理方法、装置、设备及存储介质
CN116611496A (zh) 文本到图像的生成模型优化方法、装置、设备及存储介质
CN115687664A (zh) 中文图文检索方法及中文图文检索的数据处理方法
CN114373028A (zh) 生成图片的方法及装置、电子设备
US20230326369A1 (en) Method and apparatus for generating sign language video, computer device, and storage medium
CN117011875A (zh) 多媒体页面的生成方法、装置、设备、介质和程序产品
US20220375223A1 (en) Information generation method and apparatus
US20230111633A1 (en) Lead conversion using conversational virtual avatar
CN117009577A (zh) 一种视频数据处理方法、装置、设备及可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23859222

Country of ref document: EP

Kind code of ref document: A1