CN115496550A - Text generation method and device - Google Patents

Text generation method and device Download PDF

Info

Publication number
CN115496550A
CN115496550A CN202211048016.0A CN202211048016A CN115496550A CN 115496550 A CN115496550 A CN 115496550A CN 202211048016 A CN202211048016 A CN 202211048016A CN 115496550 A CN115496550 A CN 115496550A
Authority
CN
China
Prior art keywords
sample
text
target
data
attribute information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211048016.0A
Other languages
Chinese (zh)
Inventor
赵中州
宋雪萌
聂礼强
井立强
刘萌
关惟俐
周伟
陈海青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202211048016.0A priority Critical patent/CN115496550A/en
Publication of CN115496550A publication Critical patent/CN115496550A/en
Priority to PCT/CN2023/114514 priority patent/WO2024046189A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0623Item investigation
    • G06Q30/0625Directed, with specific intent or strategy
    • G06Q30/0627Directed, with specific intent or strategy using item specifications
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • General Physics & Mathematics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The embodiment of the specification provides a text generation method and a text generation device, wherein the text generation method comprises the following steps: acquiring image-text data of a target object, wherein the image-text data comprises image data and text data; identifying visual attribute information of the target object based on the image data, wherein the visual attribute information characterizes an explicit feature of the target object; determining an object attribute set of the target object according to the text data and the visual attribute information; and generating a target description text of the target object based on the object attribute set. The method comprises the steps of obtaining multi-mode image-text data of a target object, determining visual attribute information of the target object, considering the dominant characteristic of the target object, enabling the object attribute of the target object to be more comprehensive, determining an object attribute set of the target object according to the text data and the visual attribute information, integrating the text data and the visual attribute information of the target object, enabling a generated target description text to be more coherent, and further improving the accuracy of the target description text.

Description

Text generation method and device
Technical Field
The embodiment of the specification relates to the technical field of computers, in particular to a text generation method. One or more embodiments of the present specification also relate to a text generation apparatus, a computing device, and a computer-readable storage medium.
Background
With the development of computer technology, the generation of text summaries gradually becomes a hot topic in the field of natural language processing. Taking an e-commerce scene as an example, in the e-commerce scene, the description of each commodity is generally composed of abundant and diverse data, and in order to better describe the characteristics of the commodity and attract users to purchase, a text abstract corresponding to the commodity needs to be generated so that the users can quickly and accurately know the information of the commodity.
Currently, merchandise information is generally well understood by the anchor and the salient features of the merchandise are summarized. However, since the goods are massive in the e-commerce field, the text summary of the goods is obtained by manual summary, a lot of manpower is needed, the cost is high, and a lot of uncertainty factors are inevitably introduced by manual work, so that the accuracy of the generated text summary is poor. Therefore, an accurate text generation scheme is urgently needed.
Disclosure of Invention
In view of this, the embodiments of the present specification provide a text generation method. One or more embodiments of the present specification also relate to a text generation apparatus, a computing device, a computer-readable storage medium, and a computer program, so as to solve the technical deficiencies in the prior art.
According to a first aspect of embodiments of the present specification, there is provided a text generation method including:
acquiring image-text data of a target object, wherein the image-text data comprises image data and text data;
identifying visual attribute information of the target object based on the image data, wherein the visual attribute information characterizes an explicit feature of the target object;
determining an object attribute set of the target object according to the text data and the visual attribute information;
and generating a target description text of the target object based on the object attribute set.
According to a second aspect of embodiments of the present specification, there is provided a text generation apparatus including:
the image-text data acquisition module is configured to acquire image-text data of a target object, wherein the image-text data comprises image data and text data;
an identification module configured to identify visual attribute information of a target object based on image data, wherein the visual attribute information characterizes an explicit feature of the target object;
a determination module configured to determine an object property set of the target object based on the text data and the visual property information;
a generating module configured to generate a target description text of the target object based on the object attribute set.
According to a third aspect of embodiments herein, there is provided a computing device comprising:
a memory and a processor;
the memory is configured to store computer-executable instructions, and the processor is configured to execute the computer-executable instructions, which when executed by the processor, implement the steps of the text generation method described above.
According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the text generation method described above.
According to a fifth aspect of embodiments herein, there is provided a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the text generation method described above.
In a text generation method provided in an embodiment of the present specification, image-text data of a target object is obtained, where the image-text data includes image data and text data; identifying visual attribute information of the target object based on the image data, wherein the visual attribute information characterizes an explicit feature of the target object; determining an object attribute set of the target object according to the text data and the visual attribute information; and generating a target description text of the target object based on the object attribute set. The method comprises the steps of obtaining multi-mode image-text data of a target object, determining visual attribute information of the target object, considering the dominant characteristic of the target object, enabling the object attribute of the target object to be more comprehensive, determining an object attribute set of the target object according to the text data and the visual attribute information, integrating the text data and the visual attribute information of the target object, enabling a generated target description text to be more coherent, and further improving the accuracy of the target description text.
Drawings
FIG. 1 is a block diagram of a text generation system provided in one embodiment of the present specification;
FIG. 2 is a block diagram of another text generation system provided by one embodiment of the present specification;
FIG. 3 is a flow diagram of a method for generating text according to one embodiment of the present disclosure;
FIG. 4 is a flowchart illustrating training of a text processing model in a method for generating a text according to an embodiment of the present disclosure;
FIG. 5 is a flowchart illustrating training of an image classification model in a text generation method according to an embodiment of the present disclosure;
FIG. 6 is a flowchart illustrating a process of a text generation method according to an embodiment of the present disclosure;
fig. 7 is a schematic diagram of a detail page of a target product in a text generation method according to an embodiment of the present specification;
fig. 8 is a schematic diagram of a display interface of a client in a text generation method according to an embodiment of the present specification;
fig. 9 is a schematic structural diagram of a text generation apparatus according to an embodiment of the present specification;
fig. 10 is a block diagram of a computing device according to an embodiment of the present disclosure.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be implemented in many ways other than those specifically set forth herein, and those skilled in the art will appreciate that the present description is susceptible to similar generalizations without departing from the scope of the description, and thus is not limited to the specific implementations disclosed below.
The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at" \8230; "or" when 8230; \8230; "or" in response to a determination ", depending on the context.
First, the noun terms to which one or more embodiments of the present specification relate are explained.
Mode: the data exists in a form such as natural language, pictures, etc.
And (4) commodity summarization: based on the information of the goods, such as description, appearance, etc., a short text abstract with the goods significant information is generated.
And (3) natural language generation: making the computer have human-like functions of expression and writing. Namely, a section of high-quality natural language text can be automatically generated through a planning process according to some key information and the expression form of the key information in the machine.
BART (Bidirectional and Auto-Regressive transducers): a model with context information and autoregressive characteristics is used for inputting natural language and generating the natural language.
Automatic Speech Recognition (ASR): the technology for converting human expressed language into corresponding words.
Part of speech tagging: a technique for tagging part of speech of each word in a sentence.
Mutual information: the dependency between two random variables.
In the present specification, a text generation method is provided, and the present specification relates to a text generation apparatus, a computing device, and a computer-readable storage medium, which are described in detail one by one in the following embodiments.
With the development of computer technology, the generation of text summaries gradually becomes a hot topic in the field of natural language processing. Taking an e-commerce scenario as an example, in the e-commerce scenario, the description of each commodity is generally composed of abundant and diverse data, such as a title, a detailed text description, and an image of the commodity. In order to better describe the characteristics of the commodity and attract the user to purchase, a text abstract corresponding to the commodity needs to be generated so that the user can quickly and accurately know the information of the commodity.
Currently, merchandise information is generally well understood by the anchor and the salient features of the merchandise are summarized. However, since the goods are massive in the e-commerce field, the text abstracts of the goods are obtained by manual arrangement, a lot of manpower is needed, the cost is high, and a lot of uncertainty factors are inevitably introduced by the manual arrangement, most of the text abstracts are simply spliced, so that the generated text abstracts are poor in accuracy and high in modification cost. Therefore, an accurate text generation scheme is urgently needed.
In order to improve the efficiency and the accuracy of text generation, the scheme provides a scheme for generating a description text based on multi-mode data, given multi-mode image-text data of a target object, and automatically generating the description text which can accurately summarize the characteristics of the target object and highlight the advantages of the target object from end to end.
In specific implementation, the text generation method provided in the embodiment of the present specification obtains image-text data of a target object, where the image-text data includes image data and text data; identifying visual attribute information of the target object based on the image data, wherein the visual attribute information characterizes an explicit feature of the target object; determining an object attribute set of the target object according to the text data and the visual attribute information; and generating a target description text of the target object based on the object attribute set. The method comprises the steps of obtaining multi-mode image-text data of a target object, determining visual attribute information of the target object, considering the dominant characteristic of the target object, enabling the object attribute of the target object to be more comprehensive, determining an object attribute set of the target object according to the text data and the visual attribute information, integrating the text data and the visual attribute information of the target object, enabling a generated target description text to be more coherent, and further improving the accuracy of the target description text.
Referring to fig. 1, fig. 1 shows a framework diagram of a text generation system provided in an embodiment of the present specification, where the text generation system includes a server and a client:
the client side comprises: sending image-text data of a target object to a server, wherein the image-text data comprises image data and text data;
the server side: acquiring image-text data of a target object; identifying visual attribute information of the target object based on the image data, wherein the visual attribute information characterizes an explicit feature of the target object; determining an object attribute set of the target object according to the text data and the visual attribute information; and generating a target description text of the target object based on the object attribute set, and sending the target description text to the client so as to enable the client to display the target description text.
A client: and receiving and displaying the target description text sent by the server so that the user introduces the target object according to the target description text.
It should be noted that the text generation method provided in the embodiment of the present specification is generally executed by the server, but in other embodiments of the present specification, the client may also have a similar function to the server, so as to execute the text generation method provided in the embodiment of the present specification. In other embodiments, the text generation method provided by the embodiments of the present specification may also be executed by the client and the server together.
By applying the scheme of the embodiment of the specification, image-text data of a target object are obtained, wherein the image-text data comprise image data and text data; identifying visual attribute information of the target object based on the image data, wherein the visual attribute information characterizes an explicit feature of the target object; determining an object attribute set of the target object according to the text data and the visual attribute information; and generating a target description text of the target object based on the object attribute set. The method comprises the steps of obtaining multi-mode image-text data of a target object, determining visual attribute information of the target object, considering the dominant characteristic of the target object, enabling the object attribute of the target object to be more comprehensive, determining an object attribute set of the target object according to the text data and the visual attribute information, integrating the text data and the visual attribute information of the target object, enabling a generated target description text to be more coherent, and further improving the accuracy of the target description text.
The solutions provided in one or more embodiments of the present specification may be applied to a text generation scenario, such as an e-commerce live broadcast scenario, a conference scenario, an education scenario, and the like, and are specifically selected according to actual situations, and the embodiments of the present specification do not limit this.
Referring to fig. 2, fig. 2 shows a frame diagram of another text generation system provided by an embodiment of the present specification, which may include a server 100 and a plurality of clients 200. Communication connection can be established among a plurality of clients 200 through the server 100, in a text generation scene, the server 100 is used for providing text generation service among the plurality of clients 200, and the plurality of clients 200 can be respectively used as a sending end or a receiving end to realize real-time communication through the server 100.
The user through the client 200 can interact with the server 100 to receive data sent by other clients 200, or send data to other clients 200, and so on. In the text generation scenario, a user may publish a data stream to the server 100 through the client 200, and the server 100 pushes the data stream to a client subscribing to the data stream. The data stream may be, for example, teletext data. In a live broadcast scene of an e-commerce, a user can acquire image-text data of a target commodity in real time through a client and send the image-text data to a server, the server can generate a corresponding commodity description text according to the image-text data sent by the client, and the commodity description text is pushed to all live broadcast rooms including the commodity, so that a main broadcast introduces the target commodity according to the commodity description text. In a conference scene, for example, a participating user can acquire image-text data in real time through a client and send the image-text data to a server, and the server can process the image-text data sent by the client to generate an abstract text and push the abstract text to clients of other participating users.
Wherein, the connection between the client 200 and the server 100 is established through a network. The network provides a medium for communication links between clients and servers. The network may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. The data transmitted by the client 200 may need to be encoded, transcoded, compressed, etc. before being distributed to the server 100.
The client 200 may be a browser, an APP (Application), or a web Application such as an H5 (HyperText Markup Language version 5) Application, or a light Application (also referred to as an applet, a light Application), or a cloud Application, and the client 200 may be developed and obtained based on a Software Development Kit (SDK) of a corresponding service provided by a server, such as an RTC (Real Time Communication) SDK. The client 200 may be deployed in an electronic device, need to run on the device or certain apps in the device, etc. The electronic device may have a display screen and support information browsing, etc., for example, and may be a personal mobile terminal such as a mobile phone, a tablet computer, a personal computer, etc. Various other types of applications may also be typically deployed in an electronic device, such as human-machine conversation-type applications, model training-type applications, text processing-type applications, web browser applications, shopping-type applications, search-type applications, instant messaging tools, mailbox clients, social platform software, and so forth.
The server 100 may include a server providing various services, such as a server providing communication services for a plurality of clients, a server for background training that provides support for models used on the clients, a server that processes data sent by the clients, and the like.
It should be noted that the server 100 may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. The server may also be a server of a distributed system, or a server incorporating a blockchain. The server may also be a cloud server of basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, or an intelligent cloud computing server or an intelligent cloud host with artificial intelligence technology.
Referring to fig. 3, fig. 3 shows a flowchart of a text generation method provided in an embodiment of the present specification, which specifically includes the following steps:
step 302: and acquiring image-text data of the target object, wherein the image-text data comprises image data and text data.
In one or more embodiments of the present specification, with the development of computer technology, the description form for the target object is more and more enriched, for example, the description of the commodity includes a title, a detailed text description, a commodity display image, and the like. In order to accurately generate the description text of the target object, multi-modal data of the target object may be obtained, the multi-modal data may include image data and text data, and the target description text of the target object is further generated according to the multi-modal image-text data.
Specifically, the target object refers to an object that needs to generate the target description text, and may also be understood as an object waiting for generation of the target description text, including but not limited to a commodity, a person, a landscape, a scenic spot, and the like. The graphic data of the target object refers to image data and text data including information related to the target object. The image data may be a layout, photograph, design drawing, etc. of the target object, and the text data may be a name, structural attributes, detail information, process information, etc. of the target object.
In practical applications, there are various ways to obtain the image-text data of the target object, which are specifically selected according to actual situations, and this is not limited in this embodiment of the present specification.
In an optional implementation manner of this specification, the image-text data of the target object may be obtained in a case where a text generation instruction is received. In one possible mode, the text generation instruction carries image-text data which is input by a user and covers the target object information; in another possible mode, the text generation instruction includes a unique identifier of the target object, and according to the unique identifier, the target object can be determined, and the image-text data of the target object is further acquired.
Illustratively, taking a target object as a target commodity as an example, since a large amount of commodity detail information exists in a detail page of the commodity and context semantic consistency exists between the whole detail pages, information covering the target commodity can be completed, and therefore, after receiving a text generation instruction, according to a unique identifier of the target object in the text generation instruction, image-text data of the target object can be acquired from the detail page of the target commodity.
In another optional implementation manner of this specification, since the image-text data of the target object usually changes, the image-text data of the target object may be monitored, and when the image-text data changes, the image-text data of the target object is obtained in real time to generate a target description text of the target object, so that a user may immediately query the target description text when the user needs the target description text. That is, the step of obtaining the image-text data of the target object may include the following steps:
monitoring image-text data of a target object;
and acquiring the image-text data of the target object under the condition of updating the image-text data.
In this embodiment of the present specification, the updating of the graphics and text data includes addition, deletion, replacement, change, and the like, and in this embodiment of the present specification, the graphics and text data of the target object may be considered to be updated as long as the graphics and text data of the target object changes.
Further, since the generation process of the target description text takes a certain time, the target description text of the target object may be generated in an offline timing manner in the embodiment of the present specification. The offline timing mode refers to timing updating of the target description text of the target object.
It should be noted that before the target description text is updated regularly, whether the image-text data of the target object is changed or not can be detected, that is, when the timing task is started, the image-text data of the current target object is compared with the image-text data of the target object updated last time. If the image-text data are changed, triggering a timing task, acquiring the image-text data of the target object, and generating a target description text based on the image-text data; and if the image-text data is not changed, the description text of the target object is not updated.
By applying the scheme of the embodiment of the specification, the image-text data of the target object is monitored, and the image-text data of the target object is obtained under the condition of updating the image-text data, so that the target description text of the target object is actively generated, the time for a user to obtain the target description text is saved, and the user experience is improved.
Step 304: visual attribute information of the target object is identified based on the image data, wherein the visual attribute information characterizes an explicit feature of the target object.
In one or more embodiments of the present disclosure, after the image-text data of the target object is obtained, the visual attribute information of the target object may be further identified based on the image data included in the image-text data, and by generating the visual attribute information, which is equivalent to converting the image data into text data, the multi-modal data of the target object is unified, and modal heterogeneity among multiple modalities is reduced.
Specifically, the visual attribute information represents an explicit feature of the target object, where the explicit feature refers to a feature exhibited by the target object, and may be a noun feature such as a color and a shape of the target object, or an adjective feature such as beauty, and genericity, which is specifically selected according to an actual situation, and this is not limited in this embodiment of the present specification.
In practical applications, there are various ways of identifying visual attribute information of a target object based on image data, which are specifically selected according to actual situations, and this is not limited in this embodiment of the present specification.
In an alternative implementation manner of this specification, since the image data may include text data of the target object, the text data in the image data may be obtained by using Optical Character Recognition (OCR). Visual attribute information in the image data may also be obtained using an image color recognition tool.
In another optional implementation manner of this specification, the visual attribute information of the target object may be identified by using a pre-trained image classification model, that is, the step of identifying the visual attribute information of the target object based on the image data may include the following steps:
and inputting the image data into a pre-trained image classification model, and obtaining the visual attribute information of the target object through the classification and identification of the image classification model.
Specifically, the pre-trained image classification model is a model generated by training a preset classification model, where the preset classification model refers to a model capable of realizing classification, such as a Swin Transformer model, a Residual neural Network (ResNet), and an image classification transformation model (Vit), and is specifically selected according to an actual situation, and this is not limited in this embodiment of the present specification.
Taking the image classification transformation model as an example, the image data is input into the image classification transformation model, and unlike the traditional convolution neural network input picture, the image data is divided into blocks (patches), for example, the image is divided into 9 patches. The size of each patch may be specified, such as 16 × 16, etc. Each patch is then input into an embedding layer (embedding), after which a series of vectors (token) is obtained, 9 patches each obtain their corresponding vector, and then a vector for classification is added before all vectors, the dimensions of this classified vector being identical to the other 9 vectors. In addition, location information needs to be added. Then all vectors are inputted into a Transformer Encoder (Encoder), then the Transformer Encoder is repeatedly stacked L times, and then the output of token for classification is inputted into a Multilayer Perceptron (MLP) Head, and then the result of final classification is obtained.
By applying the scheme of the embodiment of the specification, the image data is input into the pre-trained image classification model, and the visual attribute information of the target object is obtained through the classification and identification of the image classification model, so that the efficiency and the accuracy of obtaining the visual attribute information of the target object are improved, and the subsequently generated target description text is more accurate.
It should be noted that after the visual attribute information of the target object is obtained, the visual attribute information of the target object and the text data may be compared, and the text data of the target object may be modified according to the comparison result.
Illustratively, the text data of the target object is "women with red clothes showing youth", the visual attribute information of the target object is "rose showing white", the text data and the visual attribute information are compared, the "red" in the text data of the target object is replaced by "rose", and the modified text data is "women with red clothes showing youth".
Step 306: and determining an object attribute set of the target object according to the text data and the visual attribute information.
In one or more embodiments of the present specification, after the image-text data of the target object is obtained and the visual attribute information of the target object is identified based on the image data, an object attribute set of the target object may be further determined according to the text data and the visual attribute information, and the object attributes of the target object are enriched by integrating the text data and the visual attribute information, so that the generated target description text is more coherent and accurate.
Specifically, the object attribute set refers to a set formed by object attribute information of a plurality of target objects, the object attribute information includes text data of the target objects and visual attribute information, and the object attribute information can be understood as text information completely describing the attributes of the target objects.
In practical application, the text data and the visual attribute information can be merged and spliced to determine an object attribute set of the target object. For example, the text data of the target object is "orange cat sofa headrest", the visual attribute information is "orange high-grade feeling", and the text data of the target object and the visual attribute information are spliced, so that the content included in the object attribute set of the target object can be determined to be "orange cat sofa headrest orange high-grade feeling".
Further, in order to reduce the data processing amount and improve the text generation efficiency, when the text data and the visual attribute information are spliced, a union of the text data and the visual attribute information may be taken, and referring to the above example, the determined object attribute set is "orange cat sofa headrest high-grade feeling".
In an optional implementation manner of this specification, taking a target object as an example of a target product, the step of determining an object attribute set of the target object according to the text data and the visual attribute information may include the following steps:
and determining a product attribute set of the target product according to the text data and the visual attribute information, wherein the text data comprises at least one of a title, a brief introduction and a product parameter of the target product.
Specifically, the title of the commodity generally includes a brand name of the commodity, the product profile generally includes a place of production, a function, and the like of the commodity, and the product parameters of the commodity generally include a size, a material, a product number, and the like of the commodity, which are selected according to actual conditions, and this is not limited in any way in the embodiments of the present specification.
For example, taking a target commodity as a throw pillow as an example, the title of the target commodity is "bear hug pillow plush giant backrest bed head pillow birthday gift", the brief introduction of the target commodity is "panda shaped hug pillow lovely baby, soft hand, or good partner of brushing a mobile phone and reading", and the product parameters of the target commodity are "product number: 00001, material quality: other, size: 70cm 90cm ".
By applying the scheme of the embodiment of the specification, the commodity attribute set of the target commodity is determined according to the text data and the visual attribute information, wherein the text data comprises at least one of the title, the brief introduction and the product parameters of the target commodity, the object attribute of the target commodity is enriched, and the generated commodity description text is more coherent and accurate.
Step 308: and generating a target description text of the target object based on the object attribute set.
In one or more embodiments of the present disclosure, the image-text data of the target object is obtained, the visual attribute information of the target object is identified based on the image data, and after the object attribute set of the target object is determined according to the text data and the visual attribute information, the target description text of the target object may be further generated based on the object attribute set.
Specifically, the target description text refers to text that can concisely and exactly describe the target object. In the present specification embodiment, the description text may also be understood as an abstract text, a scenario, a summary, a synopsis, and an abstract scenario.
It should be noted that, taking the target object as the target product as an example, the target description text of the target product is the product description text, and the step of generating the target description text of the target object based on the object attribute set may include the following steps:
and generating a target description text of the target commodity based on the commodity attribute set.
By applying the scheme of the embodiment of the specification, image-text data of a target object are obtained, wherein the image-text data comprise image data and text data; identifying visual attribute information of the target object based on the image data, wherein the visual attribute information characterizes an explicit feature of the target object; determining an object attribute set of the target object according to the text data and the visual attribute information; and generating a target description text of the target object based on the object attribute set. The method comprises the steps of obtaining multi-mode image-text data of a target object, determining visual attribute information of the target object, considering the dominant characteristic of the target object, enabling the object attribute of the target object to be more comprehensive, determining an object attribute set of the target object according to the text data and the visual attribute information, integrating the text data and the visual attribute information of the target object, enabling a generated target description text to be more coherent, and further improving the accuracy of the target description text.
In practical applications, there are various ways of generating a target description text of a target object based on an object attribute set, and the method is specifically selected according to actual situations, which is not limited in this embodiment of the present specification.
In an optional implementation manner of this specification, word segmentation may be performed on text contents in the object attribute set, and each word obtained by word segmentation is processed by using a preset description text generation template to generate a target description text of the target object. The word segmentation processing mode may be to perform word segmentation processing by using a word segmentation tool, or may be to obtain a word segmentation result by using a preset word list matching, and the word segmentation result is specifically selected according to an actual situation, which is not limited in this embodiment of the present specification.
Illustratively, taking the text content in the object attribute set as "orange cat sofa headrest high-level feeling", the text content is participled to obtain the participle result as "orange, cat, sofa headrest, high-level feeling", the preset description text generation template is obtained as "XX is XX shaped to give people XX feeling", the participle result is filled into the description text generation target, and the target description text is obtained as "sofa headrest is orange cat shaped to give people high-level feeling".
In another optional implementation manner of this specification, the target description text may be generated by using a pre-trained text processing model, that is, the step of generating the target description text of the target object based on the object attribute set may include the following steps:
and inputting the object attribute set into a pre-trained text processing model, and generating a target description text of the target object through the text processing model.
Specifically, the Pre-trained Text processing model is a model generated by Training a Pre-set processing model, where the Pre-set processing model refers to a model capable of implementing Text processing, such as a Transformer model (BART, bidirectional and Auto-Regressive Transformers), a Text-to-Text transmission transformation model (T5), a Pre-trained model (GPT, generic Pre-Training), and the like, which have both contextual information and Auto-regression characteristics.
Taking the BART model as an example, BART is a structure of an Encoder-Decoder (Encoder-Decoder), the input of the Encoder side is a sequence added with noise, the input of the Decoder side is a sequence added with a start-shifted, and the target of the Decoder side is an original sequence.
By applying the scheme of the embodiment of the specification, the object attribute set is input into the pre-trained text processing model, and the target description text of the target object is generated through the text processing model, so that the efficiency of obtaining the target description text and the accuracy of the generated target description text are improved.
It should be noted that after the target description text of the target object is generated, the target description text may be directly displayed on the client. The method may further include the following steps of storing the target description text in a preset database, and calling the target description text from the preset database when the current client is associated with the target object, that is, after the step of generating the target description text of the target object based on the object attribute set:
under the condition that an object currently displayed by a client is a target object, calling a target description text from a preset database, wherein the preset database is used for storing the generated target description text;
displaying the target description text on the client; or, performing audio conversion on the target description text, and generating and playing audio data corresponding to the target description text.
Specifically, if the object currently displayed by the client is a target object, it indicates that a target description text of the target object needs to be acquired. At this time, the target description text may be searched in the preset database, and it is determined whether the preset database has a pre-generated target description text. If the target description text exists, the target description text is directly called from a preset database, and the target description text is displayed on the client. If the preset database does not have the target description text, the target description text can be generated in real time by using the text generation method provided by the embodiment of the specification, and the generated target description text is displayed at the client.
Further, since the client displays the target description text, the user can introduce the target object according to the target description text. In order to reduce the workload of the user, the text-to-audio conversion tool can be used for performing audio conversion on the target description text to generate audio data corresponding to the target description text, and after the audio data is generated, the audio data is actively played to introduce the target object.
By applying the scheme of the embodiment of the specification, the target description text is called from the preset database under the condition that the object currently displayed by the client is the target object, so that the time for obtaining the target description text by a user is saved, and the user experience is improved; the target description text is displayed on the client side, a user does not need to carefully know the target object, and the target object can be introduced directly according to the target description text; and audio data corresponding to the target description text is generated and played without introduction by a user, so that a large amount of labor cost is saved.
The following describes in detail the way in which the text processing model is trained in the embodiment shown in fig. 1.
In one or more embodiments of the present description, a training method of a text processing model may include the following steps:
acquiring a first sample set, wherein the first sample set comprises a plurality of sample objects, and each sample object carries sample text data and a sample description text;
identifying each sample description text, and determining sample visual attribute information of each sample object;
performing data augmentation on each sample text data, and determining augmented text data of each sample object;
and training a preset processing model based on the sample visual attribute information, the sample text data and the augmented text data of the plurality of sample objects to obtain a text processing model.
In particular, sample objects are used to train text processing models, including but not limited to, goods, people, landscapes, scenic spots, and the like. The sample text data carried by the sample object is text data describing the sample object, such as the name, unique attributes, detail information, process information, and the like of the sample object. The sample description text is a description text corresponding to the sample object, and the sample description text can also be understood as a sample abstract text, a sample scenario, a sample summary, a sample content summary and a sample abstract scenario. In general, the manner of obtaining the first sample set may be that a large amount of manually input sample text data and sample description text form the first sample set; the method may also be to read a large amount of sample text data and sample description texts from other data acquisition devices or databases to form a first sample set, which is specifically selected according to the actual situation, and this is not limited in this embodiment of the present specification.
In practical application, the mode of identifying each sample description text and determining the sample visual attribute information of each sample object can be that each sample description text is subjected to word segmentation, and each word segmentation result is matched with a preset visual attribute word list to obtain the sample visual attribute information of each sample object; and the part of speech tagging can be directly carried out on the sample description text, the obtained nouns and adjectives are reserved, and the sample visual attribute information is determined.
In the embodiment of the specification, it is considered that the same semantic meaning corresponds to a plurality of words, for example, words with good expression are beautiful, high in color value and the like, so that data augmentation can be performed on sample text data of a sample object, the sample text data of the sample object is expanded, the sample text data is diversified, certain noise is added to the sample data, and a trained model has stronger generalization capability.
For example, the sample text data of the sample object is "the piece of clothes is really beautiful", the "good look" in the sample text data is replaced by a word similar to a good look, data augmentation on the sample text data is realized, and the augmented text data is "the piece of clothes is really beautiful", "the piece of clothes is really good", and the like, wherein the augmented text data may be one or more, and is specifically selected according to actual conditions, which is not limited in any way in the embodiment of the present specification.
By applying the scheme of the embodiment of the specification, a first sample set is obtained, wherein the first sample set comprises a plurality of sample objects, each sample object carries sample text data and a sample description text, each sample description text is identified, sample visual attribute information of each sample object is determined, data augmentation is performed on each sample text data, augmented text data of each sample object is determined, a preset processing model is trained on the basis of the sample visual attribute information, the sample text data and the augmented text data of the plurality of sample objects, a text processing model is obtained, the explicit characteristics of the sample objects are considered, the object attributes of the sample objects are more comprehensive, the sample text data of the sample objects are augmented, the sample text data are more diversified, the trained model has stronger generalization capability, and the accuracy of the trained model is improved.
For example, taking a sample object as a sample commodity, sample text data and a sample description text may be obtained from a live broadcast of the sample commodity and a commodity detail page, and a first sample set is further constructed, that is, the step of obtaining the first sample set may include the following steps:
extracting live broadcast data of each sample commodity from a live broadcast room of a plurality of sample commodities, wherein the live broadcast data comprises video data and voice data;
identifying and converting the live broadcast data to generate a sample description text of each sample commodity;
extracting sample text data of each sample commodity from detail pages of a plurality of sample commodities;
and constructing a first sample set according to the sample text data and the sample description texts of the plurality of sample commodities.
Specifically, because a large amount of commodity detail information exists in the commodity detail pages and context semantic consistency exists between the whole detail pages, the image-text data of the commodity can be completely covered. Thus, sample textual data for each sample item may be extracted from the detail page for the sample item, including but not limited to OCR technology. And moreover, live broadcast data of the sample commodities can be collected from a live broadcast room of the sample commodities, the live broadcast data comprises video data and voice data, the live broadcast data is identified and converted by utilizing an ASR (access router) technology, and sample description texts of the sample commodities are generated. After obtaining the sample text data and the sample description text, a first sample set may be constructed, where the sample description text may be understood as a sample label carried by a sample object, and the sample label represents a result output by a real intended pre-set processing model.
By applying the scheme of the embodiment of the specification, live broadcast data of each sample commodity is extracted from a live broadcast room of the sample commodities, wherein the live broadcast data comprises video data and voice data, the live broadcast data is subjected to identification conversion to generate a sample description text of each sample commodity, the sample text data of each sample commodity is extracted from a detail page of the sample commodities, and a first sample set is constructed according to the sample text data and the sample description text of the sample commodities, so that the sample text data in the sample set is consistent in context and semantics, and the accuracy of the trained model is further improved.
Further, after obtaining the sample visual attribute information, the sample text data, and the augmented text data of the plurality of sample objects, the method may respectively process the sample text data and the augmented text data based on the sample visual attribute information to determine the initial training sample and the augmented training sample of each sample object, that is, the step of training the preset processing model based on the sample visual attribute information, the sample text data, and the augmented text data of the plurality of sample objects to obtain the text processing model may include the following steps:
combining the sample text data and the sample visual attribute information of each sample object, and determining an initial training sample of each sample object;
combining the augmented text data and the sample visual attribute information of each sample object, and determining an augmented training sample of each sample object;
and training a preset processing model by using the initial training samples, the augmentation training samples and the sample description texts of the plurality of sample objects to obtain a text processing model.
Specifically, sample text data and sample visual attribute information of each sample object are combined, an initial training sample of each sample object is determined, and augmented text data and sample visual attribute information of each sample object are combined, and the mode of determining the augmented training sample of each sample object can be text splicing, and text data after duplication removal can also be spliced.
By applying the scheme of the embodiment of the specification, the sample text data and the sample visual attribute information of each sample object are combined, the initial training sample of each sample object is determined, the augmented text data and the sample visual attribute information of each sample object are combined, the augmented training sample of each sample object is determined, the preset processing model is trained by using the initial training samples, the augmented training samples and the sample description texts of a plurality of sample objects, and the text processing model is obtained. By integrating text data and sample visual attribute information, the object attributes of the sample objects are enriched, and the generalization of the trained model is improved.
Further, after obtaining the initial training samples and the augmented training samples of the sample objects, the step of training the preset processing model based on the initial training samples and the augmented training samples, that is, training the preset processing model by using the initial training samples, the augmented training samples and the sample description texts of the plurality of sample objects to obtain the text processing model, may include the following steps:
extracting a first initial training sample and a first augmented training sample of a first sample object, wherein the first sample object is any sample object in a first sample set;
inputting the first initial training sample into a preset processing model to generate a first prediction description text, and inputting the first augmented training sample into the preset processing model to generate a second prediction description text;
calculating a first loss value according to the first prediction description text and the first sample description text;
calculating a second loss value according to the second prediction description text and the first sample description text;
calculating a third loss value according to the first prediction description text and the second prediction description text;
adjusting model parameters of a preset processing model based on the first loss value, the second loss value and the third loss value, and returning to the step of extracting a first initial training sample and a first augmented training sample of the first sample object;
and under the condition that a first training stopping condition is reached, obtaining the text processing model which completes training.
Specifically, the first sample description text refers to a result that the preset processing model is really intended to output, that is, the first sample description text is a real result. And inputting the first initial training sample into the preset processing model, generating a first prediction description text and inputting the first augmented training sample into the preset processing model, wherein the generated second prediction description text is a prediction result generated by the preset processing model, and when the difference between the prediction result and the real result is small enough, namely the first loss value and the second loss value are small enough, the prediction result is close to the real result enough.
In particular, since the first augmented training sample is the first initial training sample with noise added, in order to make the predicted results of the preset processing model on the first initial training sample and the first augmented training sample approximate, and improve the anti-noise capability of the preset processing model, the third loss value may be calculated according to the first prediction description text and the second prediction description text. Finally, after the first loss value, the second loss value, and the third loss value are obtained, the model parameters of the preset processing model may be adjusted based on the first loss value, the second loss value, and the third loss value, and the step of extracting the first initial training sample and the first augmented training sample of the first sample object is returned to be executed, and the text processing model that has completed training is obtained when the first training stop condition is reached.
It should be noted that the first loss value and the second loss value may be calculated by using a cross entropy loss function, and the third loss value may be calculated by using a relative entropy loss function (KLD), where the first training stop condition includes, but is not limited to, a first preset threshold and a first preset iteration number, and is specifically selected according to an actual situation, and this is not limited in this embodiment of the specification.
By applying the scheme of the embodiment of the specification, the efficiency and the accuracy of calculating the first loss value and the second loss value are improved by using the cross entropy loss function, the efficiency and the accuracy of calculating the third loss value are improved by using the relative entropy loss function, and the trained text processing model is further more accurate.
In an optional implementation manner of this specification, in order to learn a better text feature, an encoder in a preset processing model may be constrained using a mutual information maximization loss function by using an initial training sample and a sample description text of each sample object, that is, the preset processing model includes an encoder; before the step of inputting the first initial training sample into the preset processing model to generate the first prediction description text, and inputting the first augmented training sample into the preset processing model to generate the second prediction description text, the method may further include the following steps:
inputting the first initial training sample into an encoder to generate a first feature vector;
inputting the first sample description text into an encoder to generate a second feature vector;
calculating a coding loss value according to the first eigenvector and the second eigenvector;
adjusting parameters of the encoder based on the encoding loss value, and returning to execute the step of inputting the first initial training sample into the encoder to generate a first feature vector;
in the case where the second training stop condition is reached, the encoder that completed training is determined.
Specifically, the coding loss value may be calculated using the following formula (1):
Figure BDA0003822810610000141
wherein B is the size of a batch in the training process (B data losses need to be calculated each time the parameters are updated),
Figure BDA0003822810610000142
Z i =avg(Z i ) Avg represents average pooling operation (average pooling), Z i Representing the feature vector, z, obtained after the i-th initial training sample is input into the encoder y =avg(Z y ),z y Representing the feature vector obtained after the i-th sample description text is input to the encoder.
It should be noted that the second training stopping condition includes, but is not limited to, a second preset threshold and a second preset iteration number, which are specifically selected according to an actual situation, and this is not limited in this embodiment of the present specification.
By applying the scheme of the embodiment of the specification, a first initial training sample is input into an encoder to generate a first characteristic vector, a first sample description text is input into the encoder to generate a second characteristic vector, a coding loss value is calculated according to the first characteristic vector and the second characteristic vector, parameters of the encoder are adjusted based on the coding loss value, the step of inputting the first initial training sample into the encoder to generate the first characteristic vector is executed, the encoder completing training is determined under the condition that a second training stop condition is reached, and a mutual information maximization loss function is used for restraining the encoder in a preset processing model, so that the preset processing model can learn better text characteristics, and the text processing model completing training is more accurate.
Referring to fig. 4, fig. 4 shows a flowchart of training a text processing model in a text generation method provided in an embodiment of the present specification, which specifically includes:
obtaining a plurality of sample objects, wherein each sample object carries sample text data and a sample description text; identifying each sample description text, and determining sample visual attribute information of each sample object; performing data augmentation on each sample text data, and determining augmented text data of each sample object; combining the sample text data and the sample visual attribute information of each sample object, and generating a first prediction description text by a coder and a decoder of a preset processing model according to a combined result; merging the augmented text data and the sample visual attribute information of each sample object, and generating a second prediction description text by a coder and a decoder of a preset processing model according to a merged result; calculating a first loss value according to the first prediction description text and the sample description text; calculating a second loss value according to the second prediction description text and the sample description text; calculating a third loss value according to the first prediction description text and the second prediction description text; and adjusting the model parameters of the preset processing model based on the first loss value, the second loss value and the third loss value, and obtaining the trained text processing model under the condition that the first training stopping condition is reached.
The preset processing model comprises an encoder and a decoder, and the sample text data and the sample visual attribute information after each sample object is combined are input into the encoder to generate a first feature vector; inputting the sample description text of each sample object into an encoder to generate a second feature vector; calculating a coding loss value according to the first eigenvector and the second eigenvector; and adjusting parameters of the encoder based on the encoding loss value, and determining the encoder completing the training under the condition that a second training stop condition is reached.
The following describes in detail the training method of the image classification model in the embodiment shown in fig. 1.
In one or more embodiments of the present specification, a training method of a picture classification model may include the following steps:
acquiring a second sample set, wherein the second sample set comprises a plurality of sample objects, and each sample object carries sample image data and a sample description text;
identifying each sample description text, and determining sample visual attribute information of each sample object;
and training a preset classification model by using the sample image data and the sample visual attribute information of the plurality of sample objects to obtain a picture classification model.
Specifically, the specific manner of obtaining the second sample set, identifying each sample description text, and determining the sample visual attribute information of each sample object may refer to the text processing model training manner, and the embodiments of this specification are not repeated. The dominant characteristics of the sample objects are considered in determining the sample visual attribute information of each sample object, so that the object attributes of the sample objects are more comprehensive, and the accuracy of the trained model is improved.
Further, the step of training a preset classification model by using the sample image data and the sample visual attribute information of the plurality of sample objects to obtain the picture classification model may include the following steps:
extracting second sample image data and second sample visual attribute information of a second sample object, wherein the second sample object is any sample object in a second sample set;
inputting the second sample image data into a preset classification model to obtain the predicted visual attribute information of the second sample object;
calculating a classification loss value of a preset classification model according to the second sample visual attribute information and the predicted visual attribute information of the second sample object;
adjusting the model parameters of a preset classification model according to the classification loss value, and returning to execute the step of extracting the second sample image data and the second sample visual attribute information of the second sample object;
and under the condition that a third training stopping condition is reached, obtaining the trained picture classification model.
It should be noted that a classification loss value may be calculated based on the predicted visual attribute information of the second sample object and the second sample visual attribute information, where the second sample visual attribute information represents a result that a real intended preset classification model outputs, the second sample image data is input into the preset classification model, the output predicted visual attribute information is a predicted result of the preset classification model, and when a difference between the predicted result and the real result is sufficiently small, that is, the classification loss value is sufficiently small, which indicates that the predicted result is sufficiently close to the real result, the preset classification model is trained, and the trained picture classification model is obtained.
In the embodiment of the present specification, the difference between the predicted result and the actual result of the preset classification model can be visually shown by calculating the classification loss value, and then the preset classification model can be trained specifically based on the difference, and the parameters of the preset classification model are adjusted, so that the training rate of the preset classification model and the training effect of the preset classification model can be effectively improved.
It should be noted that the third training stopping condition includes, but is not limited to, a third preset threshold and a third preset iteration number, which are specifically selected according to an actual situation, and this is not limited in this embodiment of the present specification.
In one possible implementation, the determination of whether to stop training may be based only on the relationship between the classification loss value and the third preset threshold. Specifically, if the classification loss value is greater than the third preset threshold, it indicates that the difference between the second sample visual attribute information and the predicted visual attribute information of the second sample object is large, and the classification recognition capability of the preset classification model is poor, at this time, the model parameters of the preset classification model may be adjusted, the step of extracting the second sample image data and the second sample visual attribute information of the second sample object is returned, the preset classification model continues to be trained until the classification loss value is less than or equal to the third preset threshold, which indicates that the difference between the second sample visual attribute information and the predicted visual attribute information of the second sample object is small, the training is stopped, and the trained image classification model is obtained.
The third preset threshold is a critical value of the classification loss value, and when the classification loss value is greater than the third preset threshold, it is indicated that a certain deviation still exists between the prediction result and the real result of the preset classification model, the model parameters of the preset classification model still need to be adjusted, and the preset classification model is trained; and under the condition that the classification loss value is less than or equal to a third preset threshold, the degree of closeness between the predicted result and the real result of the preset classification model is enough, and the training can be stopped.
In another possible implementation manner, in addition to comparing the relationship between the classification loss value and the third preset threshold, it may be determined whether the current preset classification model is trained in combination with the number of iterations. Specifically, if the classification loss value is less than or equal to a third preset threshold, it indicates that the difference between the second sample visual attribute information and the predicted visual attribute information of the second sample object is small, the training is stopped, and the trained picture classification model is obtained, i.e., when the classification loss value is less than or equal to the third preset threshold, the training is stopped without combining the iteration number to obtain the trained picture classification model; and if the classification loss value is greater than a third preset threshold value, judging whether the iteration number at the moment reaches a third preset iteration number, if the iteration number at the moment does not reach the third iteration number, adjusting model parameters of a preset classification model, returning to the step of extracting second sample image data and second sample visual attribute information of a second sample object, continuing to train the preset classification model, and stopping iteration until the third preset iteration number is reached to obtain the trained picture classification model.
The third preset threshold and the third preset iteration number are specifically selected according to actual conditions, and this is not limited in the embodiments of the present specification. When the iteration times reach a third preset iteration time, the training times of the preset classification model are enough, the approximation degree of the prediction result and the real result of the preset classification model is enough, and the training can be stopped.
In practical applications, there are many functions for calculating the classification loss value, such as a cross entropy loss function, an L1 norm loss function, a maximum loss function, a mean square error loss function, a logarithmic loss function, and the like, which are specifically selected according to practical situations, and the embodiment of the present specification does not limit this to any way.
By applying the scheme of the embodiment of the specification, the specific training condition of the preset classification model can be judged according to the classification loss value, and the model parameters of the preset classification model are reversely adjusted according to the classification loss value under the condition that the training is not qualified, so that the classification recognition capability of the model is improved, the training speed is high, and the training effect is good.
Referring to fig. 5, fig. 5 shows a training flowchart of an image classification model in a text generation method provided in an embodiment of the present specification, specifically including:
obtaining a plurality of sample objects, wherein each sample object carries sample image data and a sample description text; identifying each sample description text, and determining sample visual attribute information of each sample object; inputting the sample image data of each sample object into a preset classification model to obtain predicted visual attribute information; calculating a classification loss value of a preset classification model according to the sample visual attribute information and the predicted visual attribute information; and adjusting parameters of the preset classification model according to the classification loss value, and obtaining the trained picture classification model under the condition of reaching a third training stop condition.
In the following, with reference to fig. 6, the text generation method provided in this specification is further described by taking an application of the text generation method in an e-commerce live broadcast scenario as an example. Fig. 6 shows a flowchart of a processing procedure of a text generation method provided in an embodiment of the present specification, which specifically includes the following steps:
step 602: acquiring detail page data of the target commodity, wherein the detail page data comprises image data and text data, and the text data comprises at least one of a title, a brief introduction and product parameters of the target commodity.
Referring to fig. 7, fig. 7 is a schematic diagram illustrating a detail page of a target product in a text generation method according to an embodiment of the present specification.
The target item detail page includes image data of coffee cups, such as two coffee cups in the figure, and also includes the title of the target item: the coffee cup has a large capacity and is provided with a spoon; introduction of the target product: the high glaze firing is safe, relieved and warm-adjusted, and brings different experiences for life; product parameters of the target commodity: the style is rich, 500ml.
Step 604: inputting image data into a pre-trained picture classification model, and obtaining visual attribute information of the target commodity through classification and identification of the picture classification model, wherein the visual attribute information represents the dominant characteristics of the target commodity.
Specifically, the image data is input into a pre-trained image classification model, and the visual attribute information of the target commodity is white, warm-adjusted brown, striped, soft in color, simple and generous through classification and identification of the image classification model.
Step 606: and combining the text data and the visual attribute information to determine a commodity attribute set of the target commodity.
Specifically, text data and visual attribute information are combined, and a commodity attribute set of a target commodity is determined as 'coffee cup high-capacity roasting with a spoon and high glaze, safety and reassurance, warm adjustment, different experience and abundant styles for life, 500ml, white, warm adjustment brown, striped, soft in color, simple and generous'.
Step 608: and inputting the commodity attribute set into a pre-trained text processing model, and generating a target description text of the target commodity through the text processing model.
Specifically, referring to fig. 8, fig. 8 shows a schematic display interface of a client in a text generation method provided in an embodiment of the present specification. The target description text included in the client display interface is' this is a high-capacity coffee cup with a spoon, and the capacity of the coffee cup is 500ml. The coffee cup has rich patterns, white color, warm color, brown color, stripe pattern and non-stripe pattern. The color is soft, simple and generous. The coffee cup is fired by the high glaze, so that safety is ensured, and different life experiences are brought to people. ".
Step 610: and displaying the target description text at the client so that the virtual anchor introduces the target commodity according to the target description text.
By applying the scheme of the embodiment of the specification, detail page data of a target commodity is obtained, image data in the detail page data is input into a pre-trained picture classification model, visual attribute information of the target commodity is obtained through classification and identification of the picture classification model, text data and visual attribute information in the detail page data are combined, a commodity attribute set of the target commodity is determined, the commodity attribute set is input into a pre-trained text processing model, a target description text of the target commodity is generated through the text processing model, the target description text is displayed at a client side, so that a virtual anchor introduces the target commodity according to the target description text, multi-mode data and an algorithm are combined and applied to a virtual anchor script construction flow to guide construction of content conforming to live scene characteristics, input of multi-source text data and image data is supported, and generation of a long text is supported, and accordingly automatic generated commodity abstract is achieved.
Corresponding to the above method embodiment, this specification further provides a text generation apparatus embodiment, and fig. 9 shows a schematic structural diagram of a text generation apparatus provided in an embodiment of this specification. As shown in fig. 9, the apparatus includes:
an obtaining module 902 configured to obtain teletext data of a target object, wherein the teletext data comprises image data and text data;
an identifying module 904 configured to identify visual attribute information of the target object based on the image data, wherein the visual attribute information characterizes an explicit feature of the target object;
a determining module 906 configured to determine an object property set of the target object according to the text data and the visual property information;
a generating module 908 configured to generate a target description text of the target object based on the set of object properties.
Optionally, the obtaining module 902 is further configured to monitor the image-text data of the target object; and acquiring the image-text data of the target object under the condition of updating the image-text data.
Optionally, the apparatus further comprises: the calling module is configured to call the target description text from a preset database under the condition that the object currently displayed by the client is the target object, wherein the preset database is used for storing the generated target description text; displaying the target description text on the client; or, performing audio conversion on the target description text, and generating and playing audio data corresponding to the target description text.
Optionally, the target object comprises a target commodity; a determining module 906, further configured to determine a product attribute set of the target product according to the text data and the visual attribute information, wherein the text data includes at least one of a title, a brief introduction, and a product parameter of the target product;
the generating module 908 is further configured to generate a target description text of the target item based on the item attribute set.
Optionally, the generating module 908 is further configured to input the object attribute set into a pre-trained text processing model, and generate a target description text of the target object through the text processing model;
the device also includes: the text processing model training module is configured to obtain a first sample set, wherein the first sample set comprises a plurality of sample objects, and each sample object carries sample text data and a sample description text; identifying each sample description text, and determining sample visual attribute information of each sample object; performing data augmentation on each sample text data, and determining augmented text data of each sample object; and training a preset processing model based on the sample visual attribute information, the sample text data and the augmented text data of the plurality of sample objects to obtain a text processing model.
Optionally, the sample object comprises a sample commodity; the text processing model training module is further configured to extract live broadcast data of each sample commodity from a live broadcast room of the sample commodities, wherein the live broadcast data comprises video data and voice data; identifying and converting the live broadcast data to generate a sample description text of each sample commodity; extracting sample text data of each sample commodity from detail pages of a plurality of sample commodities; and constructing a first sample set according to the sample text data and the sample description texts of the plurality of sample commodities.
Optionally, the text processing model training module is further configured to combine the sample text data and the sample visual attribute information of each sample object, and determine an initial training sample of each sample object; combining the augmented text data and the sample visual attribute information of each sample object, and determining an augmented training sample of each sample object; and training a preset processing model by using the initial training samples, the augmentation training samples and the sample description texts of the plurality of sample objects to obtain a text processing model.
Optionally, the text processing model training module is further configured to extract a first initial training sample and a first augmented training sample of a first sample object, where the first sample object is any sample object in the first sample set; inputting the first initial training sample into a preset processing model to generate a first prediction description text, and inputting the first augmented training sample into the preset processing model to generate a second prediction description text; calculating a first loss value according to the first prediction description text and the first sample description text; calculating a second loss value according to the second prediction description text and the first sample description text; calculating a third loss value according to the first prediction description text and the second prediction description text; adjusting model parameters of a preset processing model based on the first loss value, the second loss value and the third loss value, and returning to the step of extracting a first initial training sample and a first augmented training sample of the first sample object; and under the condition that a first training stopping condition is reached, obtaining the text processing model which completes training.
Optionally, the predetermined processing model comprises an encoder; the device also includes: an encoder training module configured to input a first initial training sample into an encoder, generating a first feature vector; inputting the first sample description text into an encoder to generate a second feature vector; calculating a coding loss value according to the first eigenvector and the second eigenvector; adjusting parameters of an encoder based on the encoding loss value, and returning to execute the step of inputting the first initial training sample into the encoder to generate a first feature vector; in the case where the second training stop condition is reached, the encoder that completed training is determined.
Optionally, the identifying module 904 is further configured to input the image data into a pre-trained image classification model, and obtain the visual attribute information of the target object through classification and identification of the image classification model;
the device also includes: the image classification model training module is configured to acquire a second sample set, wherein the second sample set comprises a plurality of sample objects, and each sample object carries sample image data and a sample description text; identifying each sample description text, and determining sample visual attribute information of each sample object; and training a preset classification model by using the sample image data and the sample visual attribute information of the plurality of sample objects to obtain a picture classification model.
Optionally, the image classification model training module is further configured to extract second sample image data and second sample visual attribute information of a second sample object, where the second sample object is any sample object in the second sample set; inputting the second sample image data into a preset classification model to obtain the predicted visual attribute information of the second sample object; calculating a classification loss value of a preset classification model according to the second sample visual attribute information and the predicted visual attribute information of the second sample object; adjusting the model parameters of a preset classification model according to the classification loss value, and returning to execute the step of extracting the second sample image data and the second sample visual attribute information of the second sample object; and under the condition that a third training stopping condition is reached, obtaining the image classification model after training is finished.
By applying the scheme of the embodiment of the specification, image-text data of a target object are obtained, wherein the image-text data comprise image data and text data; identifying visual attribute information of the target object based on the image data, wherein the visual attribute information characterizes an explicit feature of the target object; determining an object attribute set of the target object according to the text data and the visual attribute information; and generating a target description text of the target object based on the object attribute set. The method comprises the steps of obtaining multi-mode image-text data of a target object, determining visual attribute information of the target object, considering the dominant characteristic of the target object, enabling the object attribute of the target object to be more comprehensive, determining an object attribute set of the target object according to the text data and the visual attribute information, integrating the text data and the visual attribute information of the target object, enabling a generated target description text to be more coherent, and further improving the accuracy of the target description text.
The above is a schematic scheme of a text generating apparatus of the present embodiment. It should be noted that the technical solution of the text generation apparatus and the technical solution of the text generation method belong to the same concept, and details that are not described in detail in the technical solution of the text generation apparatus can be referred to the description of the technical solution of the text generation method.
Fig. 10 shows a block diagram of a computing device according to an embodiment of the present specification. The components of the computing device 1000 include, but are not limited to, a memory 1010 and a processor 1020. The processor 1020 is coupled to the memory 1010 via a bus 1030 and the database 1050 is used to store data.
Computing device 1000 also includes access device 1040, access device 1040 enabling computing device 1000 to communicate via one or more networks 1060. Examples of such networks include a Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The Access device 1040 may include one or more of any type of Network Interface (e.g., a Network Interface Card (NIC)) that may be wired or Wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) Wireless Interface, a worldwide Interoperability for Microwave Access (Wi-MAX) Interface, an ethernet Interface, a Universal Serial Bus (USB) Interface, a cellular Network Interface, a bluetooth Interface, a Near Field Communication (NFC) Interface, and so forth.
In one embodiment of the present description, the above-described components of computing device 1000 and other components not shown in FIG. 10 may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device illustrated in FIG. 10 is for purposes of example only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.
Computing device 1000 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 1000 may also be a mobile or stationary server.
Wherein the processor 1020 is configured to execute computer-executable instructions that, when executed by the processor, implement the steps of the text generation method described above.
The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the text generation method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the text generation method.
An embodiment of the present specification also provides a computer-readable storage medium storing computer-executable instructions, which when executed by a processor, implement the steps of the text generation method described above.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the text generation method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the text generation method.
An embodiment of the present specification further provides a computer program, wherein when the computer program is executed in a computer, the computer is caused to execute the steps of the text generation method.
The above is an illustrative scheme of a computer program of the present embodiment. It should be noted that the technical solution of the computer program and the technical solution of the text generation method belong to the same concept, and details that are not described in detail in the technical solution of the computer program can be referred to the description of the technical solution of the text generation method.
The foregoing description of specific embodiments has been presented for purposes of illustration and description. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The computer instructions comprise computer program code which may be in source code form, object code form, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like.
It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts, but those skilled in the art should understand that the present embodiment is not limited by the described acts, because some steps may be performed in other sequences or simultaneously according to the present embodiment. Furthermore, those skilled in the art will appreciate that the embodiments described in this specification are presently preferred and that no acts or modules are required in the implementations of the disclosure.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present specification disclosed above are intended only to aid in the description of the specification. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, and to thereby enable others skilled in the art to best understand the specification and utilize the specification. The specification is limited only by the claims and their full scope and equivalents.

Claims (14)

1. A text generation method, comprising:
acquiring image-text data of a target object, wherein the image-text data comprises image data and text data;
identifying visual attribute information of the target object based on the image data, wherein the visual attribute information characterizes an explicit feature of the target object;
determining an object attribute set of the target object according to the text data and the visual attribute information;
and generating a target description text of the target object based on the object attribute set.
2. The method of claim 1, the step of obtaining teletext data of the target object comprising:
monitoring the image-text data of the target object;
and acquiring the image-text data of the target object under the condition of updating the image-text data.
3. The method according to claim 1 or 2, further comprising, after the step of generating the target description text of the target object based on the set of object properties:
under the condition that the object currently displayed by the client is the target object, calling the target description text from a preset database, wherein the preset database is used for storing the generated target description text;
displaying the target description text on the client; or, performing audio conversion on the target description text, and generating and playing audio data corresponding to the target description text.
4. The method of claim 1, the target object comprising a target commodity; the step of determining an object property set of the target object based on the text data and the visual property information comprises:
determining a product attribute set of the target product according to the text data and the visual attribute information, wherein the text data comprises at least one of a title, a brief introduction and a product parameter of the target product;
the step of generating a target description text of the target object based on the object property set includes:
and generating a target description text of the target commodity based on the commodity attribute set.
5. The method of claim 1, the step of generating a target description text for the target object based on the set of object properties, comprising:
inputting the object attribute set into a pre-trained text processing model, and generating a target description text of the target object through the text processing model;
the training mode of the text processing model comprises the following steps:
obtaining a first sample set, wherein the first sample set comprises a plurality of sample objects, and each sample object carries sample text data and a sample description text;
identifying each sample description text, and determining sample visual attribute information of each sample object;
performing data augmentation on each sample text data, and determining augmented text data of each sample object;
and training a preset processing model based on the sample visual attribute information, the sample text data and the augmented text data of the plurality of sample objects to obtain the text processing model.
6. The method of claim 5, the sample object comprising a sample commodity; the step of obtaining a first set of samples comprises:
extracting live broadcast data of each sample commodity from a live broadcast room of a plurality of sample commodities, wherein the live broadcast data comprises video data and voice data;
identifying and converting the live broadcast data to generate sample description texts of the sample commodities;
extracting sample text data of each sample commodity from the detail pages of the plurality of sample commodities;
and constructing the first sample set according to the sample text data and the sample description texts of the plurality of sample commodities.
7. The method of claim 5, wherein the step of training a pre-set processing model based on the sample visual attribute information, the sample text data, and the augmented text data of the plurality of sample objects to obtain the text processing model comprises:
combining the sample text data and the sample visual attribute information of each sample object, and determining an initial training sample of each sample object;
combining the augmented text data and the sample visual attribute information of each sample object, and determining an augmented training sample of each sample object;
and training a preset processing model by using the initial training samples, the augmentation training samples and the sample description texts of the plurality of sample objects to obtain the text processing model.
8. The method of claim 7, wherein the step of training a predetermined processing model using the initial training samples, the augmented training samples, and the sample description texts of the plurality of sample objects to obtain the text processing model comprises:
extracting a first initial training sample and a first augmented training sample of a first sample object, wherein the first sample object is any sample object in the first sample set;
inputting the first initial training sample into a preset processing model to generate a first prediction description text, and inputting the first augmented training sample into the preset processing model to generate a second prediction description text;
calculating a first loss value according to the first prediction description text and the first sample description text;
calculating a second loss value from the second prediction description text and the first sample description text;
calculating a third loss value according to the first prediction description text and the second prediction description text;
adjusting model parameters of the preset processing model based on the first loss value, the second loss value and the third loss value, and returning to execute the step of extracting the first initial training sample and the first augmented training sample of the first sample object;
and under the condition that a first training stopping condition is reached, obtaining the text processing model which completes training.
9. The method of claim 8, the pre-set processing model comprising an encoder; before the step of inputting the first initial training sample into a preset processing model to generate a first prediction description text, and inputting the first augmented training sample into the preset processing model to generate a second prediction description text, the method further includes:
inputting the first initial training sample into the encoder to generate a first feature vector;
inputting the first sample description text into the encoder to generate a second feature vector;
calculating a coding loss value according to the first eigenvector and the second eigenvector;
adjusting parameters of the encoder based on the encoding loss value, and returning to execute the step of inputting the first initial training sample into the encoder to generate a first feature vector;
in case a second training stop condition is reached, the encoder completing training is determined.
10. The method of claim 1, the step of identifying visual attribute information of the target object based on the image data, comprising:
inputting the image data into a pre-trained picture classification model, and obtaining visual attribute information of the target object through classification and identification of the picture classification model;
the training mode of the image classification model comprises the following steps:
acquiring a second sample set, wherein the second sample set comprises a plurality of sample objects, and each sample object carries sample image data and a sample description text;
identifying each sample description text, and determining sample visual attribute information of each sample object;
and training a preset classification model by using the sample image data and the sample visual attribute information of the plurality of sample objects to obtain the picture classification model.
11. The method of claim 10, wherein the step of obtaining the picture classification model by training a preset classification model using sample image data and sample visual attribute information of the plurality of sample objects comprises:
extracting second sample image data and second sample visual attribute information of a second sample object, wherein the second sample object is any sample object in the second sample set;
inputting the second sample image data into a preset classification model to obtain the predicted visual attribute information of the second sample object;
calculating a classification loss value of the preset classification model according to the second sample visual attribute information and the predicted visual attribute information of the second sample object;
adjusting the model parameters of the preset classification model according to the classification loss value, and returning to execute the step of extracting the second sample image data and the second sample visual attribute information of the second sample object;
and under the condition that a third training stopping condition is reached, obtaining the image classification model after training is finished.
12. A text generation apparatus comprising:
the image-text data acquisition module is configured to acquire image-text data of a target object, wherein the image-text data comprises image data and text data;
an identification module configured to identify visual attribute information of the target object based on the image data, wherein the visual attribute information characterizes an explicit feature of the target object;
a determination module configured to determine an object property set of the target object from the textual data and the visual property information;
a generating module configured to generate a target description text of the target object based on the set of object attributes.
13. A computing device, comprising:
a memory and a processor;
the memory is for storing computer-executable instructions and the processor is for executing the computer-executable instructions which, when executed by the processor, implement the steps of the text generation method of any one of claims 1 to 11.
14. A computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the steps of the text generation method of any one of claims 1 to 11.
CN202211048016.0A 2022-08-30 2022-08-30 Text generation method and device Pending CN115496550A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202211048016.0A CN115496550A (en) 2022-08-30 2022-08-30 Text generation method and device
PCT/CN2023/114514 WO2024046189A1 (en) 2022-08-30 2023-08-23 Text generation method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211048016.0A CN115496550A (en) 2022-08-30 2022-08-30 Text generation method and device

Publications (1)

Publication Number Publication Date
CN115496550A true CN115496550A (en) 2022-12-20

Family

ID=84466461

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211048016.0A Pending CN115496550A (en) 2022-08-30 2022-08-30 Text generation method and device

Country Status (2)

Country Link
CN (1) CN115496550A (en)
WO (1) WO2024046189A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116778011A (en) * 2023-05-22 2023-09-19 阿里巴巴(中国)有限公司 Image generating method
WO2024046189A1 (en) * 2022-08-30 2024-03-07 阿里巴巴(中国)有限公司 Text generation method and apparatus

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114168777A (en) * 2020-09-10 2022-03-11 阿里巴巴集团控股有限公司 Image data processing method and device, storage medium and processor
CN114005012A (en) * 2021-11-05 2022-02-01 北京市商汤科技开发有限公司 Training method, device, equipment and storage medium of multi-mode pre-training model
CN115496550A (en) * 2022-08-30 2022-12-20 阿里巴巴(中国)有限公司 Text generation method and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024046189A1 (en) * 2022-08-30 2024-03-07 阿里巴巴(中国)有限公司 Text generation method and apparatus
CN116778011A (en) * 2023-05-22 2023-09-19 阿里巴巴(中国)有限公司 Image generating method
CN116778011B (en) * 2023-05-22 2024-05-24 阿里巴巴(中国)有限公司 Image generating method

Also Published As

Publication number Publication date
WO2024046189A1 (en) 2024-03-07

Similar Documents

Publication Publication Date Title
CN115496550A (en) Text generation method and device
Cosatto et al. Lifelike talking faces for interactive services
KR102509666B1 (en) Real-time face replay based on text and audio
CN111325571B (en) Automatic generation method, device and system for commodity comment labels for multitask learning
US11581020B1 (en) Facial synchronization utilizing deferred neural rendering
US11582519B1 (en) Person replacement utilizing deferred neural rendering
CN112738557A (en) Video processing method and device
CN112819933A (en) Data processing method and device, electronic equipment and storage medium
CN114913303A (en) Virtual image generation method and related device, electronic equipment and storage medium
CN114880441A (en) Visual content generation method, device, system, equipment and medium
CN108550173A (en) Method based on speech production shape of the mouth as one speaks video
CN116704085B (en) Avatar generation method, apparatus, electronic device, and storage medium
CN114241558A (en) Model training method, video generation method, device, equipment and medium
CN115661829A (en) Image-text recognition method and data processing method of image-text recognition model
CN116611496A (en) Text-to-image generation model optimization method, device, equipment and storage medium
WO2022166840A1 (en) Face attribute editing model training method, face attribute editing method and device
US20220375223A1 (en) Information generation method and apparatus
CN116737150A (en) Page generation method and device
CN115171673A (en) Role portrait based communication auxiliary method and device and storage medium
KR102120936B1 (en) System for providing customized character doll including smart phone
Cakir et al. Audio to video: Generating a talking fake agent
Song et al. Virtual Human Talking-Head Generation
Kanakia et al. Designing a User-Friendly and Responsive AI based Image Generation Website and Performing Diversity Assessment of the Generated Images
CN117373455B (en) Audio and video generation method, device, equipment and storage medium
WO2024066549A1 (en) Data processing method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination