CN118152617A - Method, device, electronic equipment and storage medium for configuring image for text - Google Patents

Method, device, electronic equipment and storage medium for configuring image for text Download PDF

Info

Publication number
CN118152617A
CN118152617A CN202410198650.5A CN202410198650A CN118152617A CN 118152617 A CN118152617 A CN 118152617A CN 202410198650 A CN202410198650 A CN 202410198650A CN 118152617 A CN118152617 A CN 118152617A
Authority
CN
China
Prior art keywords
text
image
target image
user
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410198650.5A
Other languages
Chinese (zh)
Inventor
刘浩朋
杨秋歌
杜惠中
王�琦
张东亮
李耀鹏
张明月
刘世娇
刘鹏翔
杜玮宁
陆明月
刘勇
张昱峰
杜垚
王永佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202410198650.5A priority Critical patent/CN118152617A/en
Publication of CN118152617A publication Critical patent/CN118152617A/en
Pending legal-status Critical Current

Links

Landscapes

  • User Interface Of Digital Computer (AREA)

Abstract

The disclosure provides a method, a device, electronic equipment and a storage medium for configuring images for texts, relates to the technical field of artificial intelligence, and particularly relates to the fields of natural language processing, computer vision, deep learning and the like. The method for configuring the image for the text comprises the following steps: acquiring a first text input by a user; responsive to determining that the user has an intent to configure the image for the first text, outputting an image acquisition interface for acquiring a target image that matches the first text; determining a target image matched with the first text by using a preset text-to-text matching strategy; and outputting the target image in response to a user operation of the image acquisition interface.

Description

Method, device, electronic equipment and storage medium for configuring image for text
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular, to the technical fields of natural language processing, computer vision, deep learning, and the like, and more particularly, to a method and apparatus for configuring an image for text, an electronic device, a computer-readable storage medium, and a computer program product.
Background
Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the discipline of studying certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) that make computers simulate humans, both hardware-level and software-level technologies. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.
A large language model (Large Language Model, LLM, also known as a large model) is a deep learning model trained using large amounts of text data that can generate natural language text or understand the meaning of natural language text. Large language models can handle a variety of natural language tasks, such as text generation, text classification, question-answering, etc., and are an important approach to artificial intelligence.
The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated.
Disclosure of Invention
The present disclosure provides a method and apparatus, an electronic device, a computer-readable storage medium, and a computer program product for configuring an image for text.
According to an aspect of the present disclosure, there is provided a method of configuring an image for text, including: acquiring a first text input by a user; responsive to determining that the user has an intent to configure an image for the first text, outputting an image acquisition interface for acquiring a target image that matches the first text; determining a target image matched with the first text by using a preset text-to-text matching strategy; and outputting the target image in response to an operation of the image acquisition interface by the user.
According to an aspect of the present disclosure, there is provided an apparatus for configuring an image for text, including: the acquisition module is configured to acquire a first text input by a user; an interface output module configured to output an image acquisition interface for acquiring a target image matching the first text in response to determining that the user has an intention to configure an image for the first text; the matching module is configured to determine a target image matched with the first text by using a preset text-to-text matching strategy; and an image output module configured to output the target image in response to an operation of the image acquisition interface by the user.
According to an aspect of the present disclosure, there is provided an electronic apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method described above.
According to an aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the above-described method.
According to an aspect of the present disclosure, there is provided a computer program product comprising computer program instructions which, when executed by a processor, implement the above-described method.
According to one or more embodiments of the present disclosure, when a user edits text content, the user's map matching requirements can be automatically identified and a suitable image can be configured for the user, so that the complexity of the user's operation is reduced, and the efficiency and quality of content editing are improved.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.
FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, in accordance with an embodiment of the present disclosure;
FIG. 2 illustrates a flow chart of a method of configuring an image for text in accordance with an embodiment of the present disclosure;
3A-3D illustrate schematic diagrams of a content editing interface according to embodiments of the present disclosure;
FIG. 4 shows a schematic diagram of a content editing process according to an embodiment of the present disclosure;
FIG. 5 illustrates a schematic diagram of a literal group graph policy according to an embodiment of the present disclosure;
FIG. 6 illustrates a schematic diagram of an AI-text graph policy in accordance with an embodiment of the disclosure;
FIG. 7 shows a block diagram of an apparatus for configuring an image for text according to an embodiment of the present disclosure; and
Fig. 8 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another element. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.
The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items. "plurality" means two or more.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.
The user can make content creation and transmit the content created by himself to other users through the internet, publications and the like. Text and images are common data types for user authored content.
In some cases, a user may wish to configure an image for self-edited text. For example, some information classes, social class applications (apps) integrate content posting functionality, i.e., publishers. A user may edit a word in a publisher and fit the word with an appropriate image to describe the interests he is experiencing, the current mood, etc., and publish the word and the image. For another example, a user may wish to insert images at certain locations of an article while composing the article to assist in the description of the text content at the corresponding locations.
In the related art, a user needs to configure an image for own text by means of self-searching for image materials, self-photographing and the like. The operation of finding the picture or taking the picture is complicated, the quality of the picture is difficult to ensure, the problem of mismatching of pictures and texts possibly exists, and the efficiency and the quality of editing the content are reduced.
In view of the foregoing, embodiments of the present disclosure provide a method for configuring an image for text. The method can automatically identify the matching requirements of the user and configure proper images for the user when the user edits the text content, thereby reducing the operation complexity of the user, ensuring that the configured images are matched with the text input by the user, and improving the efficiency and quality of content editing.
Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.
Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented, in accordance with an embodiment of the present disclosure. Referring to fig. 1, the system 100 includes one or more client devices 101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120. Client devices 101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.
In embodiments of the present disclosure, client devices 101, 102, 103, 104, 105, and 106, and server 120 may run one or more services or software applications that enable execution of methods of configuring images for text.
In some embodiments, server 120 may also provide other services or software applications, which may include non-virtual environments and virtual environments. In some embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of client devices 101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.
In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof that are executable by one or more processors. A user operating client devices 101, 102, 103, 104, 105, and/or 106 may in turn utilize one or more client applications to interact with server 120 to utilize the services provided by these components. It should be appreciated that a variety of different system configurations are possible, which may differ from system 100. Accordingly, FIG. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.
The client devices 101, 102, 103, 104, 105, and/or 106 may provide interfaces that enable a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that the present disclosure may support any number of client devices.
Client devices 101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptop computers), workstation computers, wearable devices, smart screen devices, self-service terminal devices, service robots, vehicle-mounted devices, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and the like. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, appli os, UNIX-like operating systems, linux, or Linux-like operating systems; or include various mobile operating systems such as MICROSOFT Windows Mobile OS, iOS, windows Phone, android. Portable handheld devices may include cellular telephones, smart phones, tablet computers, personal Digital Assistants (PDAs), and the like. Wearable devices may include head mounted displays (such as smart glasses) and other devices. The gaming system may include various handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), short Message Service (SMS) applications, and may use a variety of communication protocols.
Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a number of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. For example only, the one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a blockchain network, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, wi-Fi), and/or any combination of these and/or other networks.
The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture that involves virtualization (e.g., one or more flexible pools of logical storage devices that may be virtualized to maintain virtual storage devices of the server). In various embodiments, server 120 may run one or more services or software applications that provide the functionality described below.
The computing units in server 120 may run one or more operating systems including any of the operating systems described above as well as any commercially available server operating systems. Server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, etc.
In some implementations, server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of client devices 101, 102, 103, 104, 105, and/or 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of client devices 101, 102, 103, 104, 105, and/or 106.
In some implementations, the server 120 may be a server of a distributed system or a server that incorporates a blockchain. The server 120 may also be a cloud server, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology. The cloud server is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and Virtual special server (VPS PRIVATE SERVER) service.
The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of databases 130 may be used to store information such as audio files and video files. Database 130 may reside in various locations. For example, the database used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. Database 130 may be of different types. In some embodiments, the database used by server 120 may be, for example, a relational database. One or more of these databases may store, update, and retrieve the databases and data from the databases in response to the commands.
In some embodiments, one or more of databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key value stores, object stores, or conventional stores supported by the file system.
The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.
According to some embodiments, the client devices 101-106 may perform the methods of configuring images for text of embodiments of the present disclosure, automatically configuring appropriate images for text entered by a user. Specifically, the user may enter text by operating the client devices 101-106 (e.g., operating an input device such as a mouse, touch screen, etc.). The client devices 101-106 generate images that match text by performing the methods of configuring images for the text of embodiments of the present disclosure and output the images (e.g., via a display) to a user for selection by the user. It should be noted that the client devices 101-106 may implement the method for configuring images for text according to the embodiments of the present disclosure by calling a service interface provided by a server. These interfaces include, for example, a keyword extraction interface, an emotion recognition interface, a large language model interface, an image search interface, an AI image generation interface, and the like.
According to some embodiments, server 120 may also perform methods of configuring images for text according to embodiments of the present disclosure. Specifically, the user may enter text by operating the client devices 101-106 (e.g., operating an input device such as a mouse, touch screen, etc.). The client devices 101-106 send text entered by the user to the server 120. The server 120 generates an image that matches text by performing the method of configuring an image for the text of an embodiment of the present disclosure and outputs the image to the client devices 101-106. The client devices 101-106 further output the images (e.g., via a display) to a user for selection by the user.
Fig. 2 illustrates a flow chart of a method 200 of configuring an image for text in accordance with an embodiment of the present disclosure. As described above, the subject of execution of method 200 may be a client device, such as client devices 101-106 shown in FIG. 1; or may be a server, such as server 120 shown in fig. 1.
As shown in fig. 2, the method 200 includes steps S210-S240.
In step S210, a first text input by a user is acquired.
In step S220, in response to determining that the user has an intention to configure the image for the first text, an image acquisition interface for acquiring a target image that matches the first text is output.
In step S230, a target image matched with the first text is determined using a preset text-to-text matching policy.
In step S240, a target image is output in response to the user' S operation of the image acquisition interface.
According to the embodiment of the disclosure, when the user edits the text content, the user's map matching requirement is automatically identified and the user is configured with the proper image, so that the operation complexity of the user is reduced, the configured image can be ensured to be matched with the text input by the user, and the content editing efficiency and quality are improved.
The steps of method 200 are described in detail below.
According to some embodiments, in step S210, a first text input by a user in a content editing interface may be acquired. The content editing interface may be any graphical user interface (GRAPHICAL USER INTERFACE, GUI) for content editing including, but not limited to, a publisher interface in an information, social, etc. type of application, an editing interface in a notepad application, etc.
Fig. 3A shows a schematic diagram of a content editing interface according to an embodiment of the present disclosure. The content editing interface is a publisher interface in the application that includes a text editing area 302, an image editing control 304, and functionality controls 306, 308, and 310. The user inputs the first text "autumn true to the united states" by operating within the text editing area 302. The user may also select an image from a local library or take a photograph and add it to the content editing interface by operating the image editing control 304. By clicking the preview button 308, the user can combine the first text and the image in the content editing interface as the content authored by the user, and simulate and generate the display effect of the released content for the user to preview. The user may publish the authored content for presentation to other users by clicking on the publish button 306. The user may cancel the currently edited content and exit the content editing interface by clicking the cancel button 310.
According to some embodiments, after the first text input by the user is acquired through step S210, it may be further determined whether the user has an intention to configure an image (i.e., a map) for the first text.
According to some embodiments, it may be determined whether the user has an intention to configure the image for the first text based on the input behavior data of the user. Therefore, the potential map matching requirement of the user can be accurately identified.
The input behavior data of the user includes feature data of the input behavior of the user in the content editing interface and feature data of a result (i.e., first text) generated by the input behavior, such as a time period after the user enters the content editing interface (i.e., a difference between a current time and a time when the user enters the content editing interface), a time period when the user edits the first text (i.e., a difference between a time when the user stops editing the first text and a time when the user starts editing the first text), a stay time period after the user inputs the first text (i.e., a difference between a current time and a time when the user stops editing the first text), a length of the first text (i.e., a number of characters included in the first text), a content of the first text, and the like.
According to some embodiments, it may be determined whether the user has an intention to configure the image for the first text based on a dwell time after the user inputs the first text. For example, in response to the dwell time after the user entered the first text being greater than a first threshold, it is determined that the user has an intent to configure an image for the first text. The first threshold may be, for example, 3 seconds, 5 seconds, etc. The longer dwell time after the user enters the first text may be due to the user being in charge of the expression of the first text, or being in mind, looking for an image that matches the first text. The stay time after the user inputs the first text can express the user's map matching requirement, so that the potential map matching requirement of the user can be rapidly and accurately identified.
According to some embodiments, it may be determined whether the user has an intention to configure the image for the first text based on a stay time period after the user inputs the first text and a length of the first text. For example, in response to the dwell time after the user entered the first text being greater than a first threshold and the length of the first text being greater than a second threshold, it is determined that the user has an intent to configure an image for the first text. The first threshold may be, for example, 3 seconds, 5 seconds, etc. The second threshold may be, for example, 4, 5, etc. The longer dwell time after the user enters the first text may be due to the user being taking into account the expression of the first text or being thinking about, looking for an image that matches the first text. The stay time after the user inputs the first text can express the user's map matching requirement, so that the potential map matching requirement of the user can be rapidly and accurately identified. The longer length of the first text generally means that the semantics of the first text are relatively complete and accurate, so that the semantics of the target image determined in the subsequent step S230 can be ensured to be matched with the semantics of the first text, and the quality of the target image is improved.
According to the embodiment, whether the user has the intention of configuring the image for the first text or not is determined by combining the stay time after the user inputs the first text and the length of the first text, so that the matching degree of the graphics and texts can be ensured and the quality of the configured target image can be improved while the graphics matching requirement can be rapidly and accurately identified.
According to some embodiments, it may be determined whether the user has an intention to configure an image for the first text based on the content of the first text. For example, the first text may be input into a trained intent recognition model to obtain a determination of the intent recognition model output. The intent recognition model may be any neural network model, such as a text classification model, a large language model, and the like.
In step S220, in response to determining that the user has an intention to configure the image for the first text, an image acquisition interface for acquiring a target image that matches the first text is output.
The image acquisition interface is a control for acquiring a target image, which may be, for example, a button, a pop-up window, or the like. The image acquisition interface may interact with a user. In the case that the user performs some preset operation (such as clicking, dragging, long pressing, etc.) on the image acquisition interface, the content editing interface pops up a map matching panel, in which one or more target images matched with the first text are displayed. It will be appreciated that the image acquisition interface is a portal to the graphics panel, i.e., the image acquisition interface is a graphics portal.
Still taking fig. 3A as an example, the first threshold corresponding to the content editing interface shown in fig. 3A is 3 seconds, and the second threshold is 4. The length 5 of the first text "autumn true-to-face" entered by the user is greater than the second threshold 4. If the user does not enter any more within 3 seconds after entering the first text, the user is considered to have an intention to configure an image for the first text, and an image acquisition interface 312 for acquiring a target image matching the first text is displayed in the content editing interface, as shown in fig. 3B. The image acquisition interface 312 includes a button 314. The user may obtain a target image that matches the first text by clicking on button 314.
In step S230, a target image matched with the first text is determined using a preset text-to-text matching policy.
It should be noted that the present disclosure does not limit the execution sequence of step S230 and step S220. Steps S220 (output image acquisition interface), S230 (determination of the target image) may be performed sequentially or in parallel. That is, in embodiments of the present disclosure, the image acquisition interface may be output first, and then the target image may be determined; or determining the target image and then outputting an image acquisition interface; the target image may also be determined while the (rendered) image acquisition interface is being output.
Matching the target image with the first text means that the target image has the same or similar semantics as the first text.
The grammar map matching strategy can have one or more. The document map matching policy includes, for example, an image search policy, an image generation policy, and the like. The image searching strategy is used for searching out a target image matched with the first text from a preset gallery. The image search strategy may have one or more. Different image search strategies may use different gallery and search algorithms. The image generation policy is used to generate a new image as a target image, subject to the first text. The image generation policy may be one or more. Different image generation strategies may use different image generation algorithms, such as exposing the first text as an image (i.e., a word group map) according to certain rules, generating an image (i.e., an AI document map) using a trained document map model, and so forth.
According to some embodiments, where there are multiple document matching policies, different document matching policies may be invoked at different occasions. For example, the graph matching policy may include a first matching policy and a second matching policy. The first matching policy may be invoked before or while the image acquisition interface is output, and the second matching policy may be invoked after the image acquisition interface is output or further after a user operates the image acquisition interface. Therefore, flexible and efficient text-to-graph matching can be realized, so that the computing efficiency is improved, and the computing resources are saved.
According to some embodiments, step S230 may include steps S232 and S234.
In step S232, a first target image that matches the first text is determined using a first matching policy before the user operates the image acquisition interface.
In step S234, in response to the user' S operation of the image acquisition interface, a second target image that matches the first text is determined using a second matching policy. The resource consumption of the second matching policy is higher than the resource consumption of the first matching policy.
The user operating the image acquisition interface means that the user has issued an explicit indication to configure the image for the first text (i.e. a map-making indication). According to the embodiment, under the condition that the potential map matching requirement of the user is detected but the map matching indication which is clear by the user is not obtained, a first matching strategy (AI search map) with lower cost is utilized for matching the text map of the user, so that the response can be quickly carried out after the map matching indication of the user is received subsequently. After the explicit map matching instruction of the user is obtained, a second matching strategy (AI map generation) with higher cost is called for text map matching of the user, so that unnecessary calculation can be avoided, and calculation resources can be saved.
For example, after determining that the user has an intention to configure an image for the first text, an image acquisition interface for acquiring a target image that matches the first text is output. At the same time, a first target image that matches the first text is determined using a first matching policy. At this time, the first target image is determined but not outputted. Then, in response to a user operation (e.g., a single click, a double click, a long press, etc.) of the image acquisition interface, a map panel is displayed, a first target image is output in the map panel, and at the same time, a second target image that matches the first text is determined using a second matching policy and output into the map panel.
The first matching policy may be, for example, an image search policy. The second matching policy may be, for example, an image generation policy, including a text group map policy, an AI document map policy, and the like. The first matching policy and the second matching policy may be implemented by invoking corresponding service interfaces. The resource consumption of the first matching policy and the second matching policy can be represented by indexes such as response time of a single request of the corresponding service interface, request processing amount in unit time, and the like. It will be appreciated that the higher the resource consumption of the matching policy, the greater the cost and the longer the computation time.
The first matching policy and the second matching policy may be one or more respectively.
Fig. 4 shows a schematic diagram of a content editing process according to an embodiment of the present disclosure. As shown in fig. 4, the graph matching policy is implemented by invoking the graph matching interface 410 of the server. The context map matching interface 410 includes an AI search map interface 412, an AI generation map interface 414, and an AI generation map result polling interface 416. The AI search map interface 412 is used to implement an image search policy (first matching policy). The AI-generation interface 414 is used to implement an image generation policy (second matching policy). The AI generation interface 414 can be invoked multiple times, with each invocation generating the same or different target image. The AI-rendering results poll interface 416 is configured to obtain the target images generated by the AI-rendering interface 414 each time it is invoked, respectively.
In step S422, the user inputs text (i.e., first text). If the user pauses the input for 3s after inputting the text, the AI search interface 412 is invoked to search the preset gallery for the first target image, i.e., search image data, that matches the text.
Subsequently, in step S424, it is determined whether an AI profile entry has been displayed in the current content editing interface. If not, the user inputs the text for the first time in the content editing interface, and step S432 is executed to display the AI join in marriage graph entry, and associate the search graph data with the AI join in graph entry, so that after the subsequent user clicks the AI join in graph entry and pops up the AI join in graph panel, the search graph data can be displayed in the AI join in graph panel. If yes, the user inputs text again in the content editing interface, step S426 is executed, the animation of the update data is displayed, and the update of the target image is indicated.
In step S428, the user clicks the AI profile entry.
Subsequently, in step S430, an AI profile panel is displayed in the content editing interface. By invoking the AI-generated graph interface 414 and AI-generated graph result polling interface 416, a second target image, i.e., generated graph data, is generated that matches the first text.
Subsequently, in step S432, the search map data and the generated map data are displayed in the AI profile panel.
According to some embodiments, the first matching policy is an image search policy. Accordingly, step S232 may include steps S2322 and S2324.
In step S2322, keywords of the first text are extracted.
In step S2324, a preset gallery is searched based on the keywords to obtain a first target image.
According to some embodiments, in step S2322, keywords of the first text may be extracted using a trained large language model or keyword extraction model. It is also possible to segment the first text and remove stop words (e.g., prepositions, conjunctions, punctuations, etc.), with the remaining words being used as keywords.
According to some embodiments, in step S2324, each image in the gallery may have a corresponding descriptive text. And calculating the similarity between the keywords and the descriptive text of each image, and taking one or more images with highest similarity as a first target image.
According to some embodiments, in step S2324, each image in the gallery may have a respective vector representation. And calculating the vector representation of the keyword, and further calculating the similarity between the vector representation of the keyword and the vector representation of each image, wherein one or more images with highest similarity are used as the first target image.
According to some embodiments, the second matching policy comprises at least one image generation policy. Accordingly, step S234 may include: for each of the at least one image generation policy, a second target image is generated that matches the first text using the image generation policy. Each image generation policy may generate one or more second target images.
The image generation policy may be, for example, a text group map policy, an AI document generation map policy, or the like.
According to some embodiments, the image generation policy may be a text group graph policy. Accordingly, step S234 may include steps S2341-S2343.
In step S2341, a visual style of the text element in the second target image is determined based on the number of characters included in the first text.
In step S2342, a visual style of the non-text element in the second target image is determined based on the emotion type of the first text.
In step S2343, the text element and the non-text element are rendered based on the respective visual styles, respectively, to generate a second target image.
According to the above embodiment, the first text can be presented as the second target image. Since the second target image contains the first text in the form of a graph, the semantics of the second target image can be ensured to be the same as that of the first text, and thus the target image matched with the first text can be quickly generated.
According to some embodiments, the text elements in the second target image include second text generated based on the first text. In the case where the length of the first text is less than or equal to a third threshold (e.g., 20, 30, etc.), the first text may be regarded as the second text. In the case that the length of the first text is greater than the third threshold, the first text may be refined, and the refined text is used as the second text. According to some embodiments, the first text may be refined using the trained large language model to obtain the second text, and keywords in the first text may also be used as the second text.
According to some embodiments, the visual style of the text element includes a font, a font size (i.e., a size of a character), etc. of the second text. In step S2341, a correspondence of the preset number of characters and the visual style may be acquired, and then the visual style of the second text is determined based on the correspondence. For example, a table of correspondence between the range of number of characters and font and a functional expression of font size and number of characters may be obtained, the font of the second text (e.g., bold, song Ti, regular script, etc.) may be determined based on the table, and the font size of the second text may be calculated based on the functional expression. Generally, the larger the number of characters of the second text, the smaller the word size.
According to some embodiments, the non-text elements in the second target image include at least one of a background and an emoticon. Visual styles of non-text elements include color, pattern of background, type, size, location of emoticons, etc. In step S2342, the emotion type of the first text may be identified using the trained large language model or emotion classification model, and the visual style of the non-text element in the second target image may be determined based on the preset emotion type correspondence with the visual style of the non-text element. Specifically, by inputting the first text into a large language model or emotion classification model, the emotion type of the model output can be obtained.
For example, by emotion recognition of the first text, it is determined that the emotion type of the first text is "happy". The background color set corresponding to "happy" is { red, pink, orange }, and the corresponding emoticon type set isThe size of the emoticon is proportional to the word size of the second text, and the position set of the emoticon is { around the second text, in the middle of the second text, behind the word with the most definite emotion semantics }. Accordingly, one of the set of background colors may be selected as the background color of the second target image, one of the set of emoticons may be selected as the emotion symbol for use in the second target image, the size of the emotion symbol may be determined based on the word size of the second text, and one of the set of location of the emotion symbol may be selected as the location of the emotion symbol.
In step S2343, a second target image is generated by combined rendering of the text element and the non-text element.
Fig. 5 shows a schematic diagram of a literal group map strategy according to an embodiment of the present disclosure.
As shown in fig. 5, in step S502, a first text input by a user is acquired.
In step S504, it is determined whether the number of characters included in the first text is greater than a third threshold. If yes, step S506 is executed to refine the first text, take the refined text as the second text, and determine the visual style thereof. If not, step S508 is executed to directly take the first text as the second text and determine the visual style thereof.
In step S510, emotion recognition is performed on the first text to determine an emotion type of the first text.
In step S512, based on the emotion type of the first text, a visual style of the background and the emoticon is determined.
In step S514, the second text, background, and emoticon are combined based on the visual patterns of the second text, background, and emoticon to obtain a second target image.
According to some embodiments, the image generation policy may be an AI document map policy. Accordingly, step S234 may include steps S2344 and S2345.
In step S2344, first hint text is generated based on the first text for directing the trained textbook model to generate the second target image.
In step S2345, the first hint text is input into the meridional graph model to obtain a second target image that is output by the meridional graph model.
According to the embodiment, the second target image is generated by using the draft graph model, so that the content of the second target image can be more flexible and diversified while the semantic meaning of the second target image is ensured to be the same as or similar to that of the first text.
According to some embodiments, in step S2344, the first text may be rewritten with the trained language model to obtain the first prompt text. According to the embodiment, intelligent rewriting can be realized by utilizing the language understanding capability and the generating capability of the large language model, the image generating requirement of the user is deeply understood, and the rewritten first prompt text is matched with the requirement of the user, so that the generated second target image is matched with the first text, and the accuracy of image generation is improved.
According to some embodiments, in step S2344, the first text may be directly input into the language model to obtain the first prompt text output by the language model.
According to some embodiments, in step S2344, a preset alert template (prompt template) may be acquired, and the first text is filled into a blank slot of the alert template to generate the query text. Further, the query text is input into the language model to obtain a first prompt text output by the language model. According to the embodiment, the prompt template is used for guiding the large language model to rewrite the first text, so that the rewriting effect of the large language model can be improved, the rewritten first prompt text is matched with the requirement of a user, and the accuracy of rewriting is improved.
According to some embodiments, step S2344 may further include steps S23441-S23444.
In step S23441, keywords of the first text are extracted.
In step S23442, a style type of the second target image is determined based on the keywords.
In step S23443, based on the keyword and the style, a second prompt text of the language model to be input is generated by using a preset prompt template.
In step S23444, the second hint text is input into the language model to obtain the first hint text of the to-be-input text-to-be-generated graphic model output by the language model.
According to the embodiment, the first prompt text is generated by combining the text content and the image style, so that the image generation requirement of the user can be deeply understood, and the first prompt text is matched with the requirement of the user, thereby ensuring that the second target image generated based on the first prompt text is matched with the semantics of the first text, and improving the accuracy of image generation.
The specific embodiment of step S23441 may refer to step S2322 described above, and will not be described herein.
The style type of the second target image may be, for example, small freshness, fun, beauty, feeling of injury, etc.
According to some embodiments, in step S23442, the style type of the second target image may be determined using the trained large language model or the style classification model. For example, keywords are input into a large language model or style classification model to obtain the style type of model output.
According to some embodiments, in step S23442, a set of preset style types may be obtained. And calculating the similarity of the keywords and each style type in the style type set by using a text classification model (e.g. BERT model), wherein the style type with the highest similarity is used as the style type of the second target image.
According to some embodiments, the hint template obtained in step S23443 includes a guide for guiding the large language model to generate the first hint text and slots to be filled in. And filling the keywords and the style types into the corresponding slots, so that a second prompt text can be generated. And inputting the second prompt text into the language model to obtain the first prompt text of the to-be-input text-to-be-drawn model generated by the language prompt model.
After the first prompt text is obtained in step S2344, step S2345 is performed to obtain a second target image output by the text-to-text model by inputting the first prompt text into the text-to-text model.
The meridional graph Model may be, for example, a Diffusion Model (DM), a Generated Against Network (GAN), a variational self-encoder (Variational AutoEncoder, VAE), and the like.
The meridional graph model may have one or more. Further, each of the textbook models may be invoked one or more times. The second target image generated by different textbook models, different invocations of the same textbook model, is typically different.
According to some embodiments, step S234 may further include steps S2346 and S2347.
In step S2346, the semantic type of the first text is identified.
In step S2347, a second target image that matches the first text is determined using a second matching policy in response to the semantic type satisfying the preset condition.
The second matching policy may be, for example, an AI-text-to-graph policy implemented using a text-to-graph model. For certain types of text, such as text that does not contain an abstraction of entity words (e.g., "philosophy thinking"), text of a particular semantic type (e.g., "entertainment" type), AI-text-generated graphic models have difficulty accurately generating images that match the semantics of these texts, resulting in the generated images not meeting the needs of the user. According to the embodiment, the semantic types of the text can be utilized to screen the matching requirements, so that the accuracy of image generation is ensured, unnecessary drawing generation requests are avoided, and computing resources are saved.
The semantic type of the first text may be, for example, scenery, weather, entertainment, sports, etc.
According to some embodiments, in step S2346, the semantic type of the first text may be determined using the trained large language model or the semantic classification model. For example, the first text or keywords of the first text are input into a large language model or semantic classification model to obtain the semantic type of the model output.
According to some embodiments, in step S2346, a preset set of semantic types may be obtained. And calculating the similarity of the first text (or keywords of the first text) and each semantic type in the semantic type set by using a text classification model (such as a BERT model), wherein the semantic type with the highest similarity is used as the semantic type of the first text.
According to some embodiments, in step S2347, a pre-set white list or black list of semantic types may be obtained. A second target image that matches the first text is determined using a second matching policy (e.g., AI-text-rendering policy) in response to the semantic type of the first text being either within the whitelist or not within the blacklist. Specific embodiments of determining the second target image by using the second matching policy may refer to step S234 described above, and will not be described herein.
Fig. 6 shows a schematic diagram of an AI document map policy, according to an embodiment of the disclosure.
As shown in fig. 6, in step S602, a first text input by a user is acquired.
In step S604, the first text is semantically understood to determine its semantic type. If the semantic type of the first text is in the white list or not in the black list, step S606 is performed.
In step S606, keywords of the first text are extracted.
In step S608, the style type of the second target image to be generated is determined based on the keyword.
In step S610, a prompt text (i.e., a first prompt text) to be input to the textbook model is generated using the large language model based on the keyword and the style.
In step S612, a meridional graph model is invoked based on the prompt text to generate a second target image.
In step S240, a target image is output in response to the user' S operation of the image acquisition interface.
The image acquisition interface may be, for example, a button, a pop-up window, or the like. The image acquisition interface may interact with a user. The user operating the image acquisition interface means that the user has issued an explicit indication to configure the image for the first text, i.e. a map-making indication. Upon hearing a specific operation (e.g., clicking, dragging, long pressing, etc.) of the image acquisition interface by the user, the content editing interface will pop up a graphic panel in which one or more target images matching the first text are displayed.
As described above, the target image may include a first target image generated using a first matching policy and a second target image generated using a second matching policy.
The first target image is generated before the operation of the image acquisition interface by the user is monitored (even before the image acquisition interface is displayed), so that after the operation of the image acquisition interface by the user is monitored, the map panel can be displayed, and the first target image can be directly displayed in the map panel.
Because the second matching policy consumes more computing resources, after the explicit mapping instruction of the user is obtained (i.e., after the user's operation on the image acquisition interface is monitored and the mapping panel is entered), the second matching policy is used to generate a second target image, and the second target image is output.
According to some embodiments, a user may select a target image (including a first target image and a second target image). The selected target image is combined with the first text entered by the user to form user authored content.
FIG. 3C shows a schematic diagram of the content editing interface after a user has operated the image acquisition interface. As shown in fig. 3C, after the user clicks the button 314 in the image acquisition interface 312, the map panel 316 is popped up in the content editing interface. The map panel 316 includes target images 318, 320, 322, 324. The target images 318 and 320 are first target images searched for using an image search policy (first matching policy) before the graphic panel 316 is popped up. Thus, in FIG. 3C, the target images 318 and 320 are displayed in their entirety. The target images 322 and 324 are second target images generated using the image generation policy after the click operation of the button 314 by the user is monitored. The target image 322 may be, for example, an image generated using a text group map policy, and the target image 324 may be, for example, an image generated using an AI text generation map policy. The target images 322 and 324 do not display image contents in an initial state, but display progress of image generation, for example, 20%, 50%, or the like. After the progress of image generation reaches 100%, the target images 322 and 324 will be displayed in full.
As shown in fig. 3C, the graphics panel 316 also includes a refresh button 328. The user may update the target images 318-324, i.e., regenerate the target images 318-324, by clicking on the refresh button 328.
A selection box 326 is displayed on each target image. The user may select the corresponding target image by clicking on the selection box 326, and the order in which the target images were selected by the user (e.g., 1, 2, 3, etc.) will be displayed in the selection box. As shown in fig. 3C, the user has selected the target image 324. Further, by clicking on the ok button 330 in the map panel 316, the map panel 316 may be closed and the target image 324 displayed in-line with the image editing control 304 as part of the user authored content, as shown in FIG. 3D.
According to an embodiment of the present disclosure, there is also provided an apparatus for configuring an image for text. Fig. 7 shows a block diagram of an apparatus 700 for configuring an image for text according to an embodiment of the present disclosure. As shown in fig. 7, the apparatus 700 includes an acquisition module 710, an interface output module 720, a matching module 730, and an image output module 740.
The acquisition module 710 is configured to acquire a first text entered by a user.
The interface output module 720 is configured to output an image acquisition interface for acquiring a target image matching the first text in response to determining that the user has an intention to configure an image for the first text.
The matching module 730 is configured to determine a target image matching the first text using a preset text-to-text matching policy.
The image output module 740 is configured to output the target image in response to an operation of the image acquisition interface by the user.
According to the embodiment of the disclosure, when the user edits the text content, the user's map matching requirement is automatically identified and the user is configured with the proper image, so that the operation complexity of the user is reduced, the configured image can be ensured to be matched with the text input by the user, and the content editing efficiency and quality are improved.
According to some embodiments, the matching module comprises: a first matching unit configured to determine a first target image matching the first text using a first matching policy before the user operates the image acquisition interface; and a second matching unit configured to determine a second target image matching the first text using a second matching policy in response to an operation of the image acquisition interface by the user, wherein a resource consumption of the second matching policy is higher than a resource consumption of the first matching policy.
According to some embodiments, the first matching policy is an image search policy, and the first matching unit includes: an extraction subunit configured to extract keywords of the first text; and a searching subunit configured to search a preset gallery based on the keyword to obtain the first target image.
According to some embodiments, the second matching policy comprises at least one image generation policy, the second matching unit being further configured to: for each of the at least one image generation policy, generating a second target image matching the first text using the image generation policy.
According to some embodiments, the second matching unit comprises: a first determination subunit configured to determine a visual style of a text element in the second target image based on a number of characters included in the first text; a second determination subunit configured to determine a visual style of a non-text element in the second target image based on the emotion type of the first text; and a rendering subunit configured to render the text element and the non-text element based on the respective visual styles, respectively, to generate the second target image.
According to some embodiments, the text element includes a second text generated based on the first text; the non-text element includes at least one of a background and an emoticon.
According to some embodiments, the second matching unit comprises: a first generation subunit configured to generate, based on the first text, first hint text for guiding a trained text graph model to generate the second target image; and a second generation subunit configured to input the first hint text into the meridional graph model to obtain the second target image output by the meridional graph model.
According to some embodiments, the first generation subunit is further configured to: and overwriting the first text by using the trained language model to obtain the first prompt text.
According to some embodiments, the first generating subunit comprises: an extraction subunit configured to extract keywords of the first text; a third determination subunit configured to determine a style type of the second target image based on the keyword; a third generation subunit configured to generate a second prompt text to be input into the language model by using a preset prompt template based on the keyword and the style type; and a fourth generation subunit configured to input the second prompt text into the language model to obtain the first prompt text output by the language model.
According to some embodiments, the second matching unit comprises: an identification subunit configured to identify a semantic type of the first text; and a fifth determining subunit configured to determine, in response to the semantic type satisfying a preset condition, a second target image matching the first text using a second matching policy.
According to some embodiments, the apparatus 700 further comprises: a determination module configured to determine, based on the user's input behavior data, whether the user has an intention to configure an image for the first text.
According to some embodiments, the determination module is further configured to: responsive to the dwell time after the user entered the first text being greater than a first threshold and the length of the first text being greater than a second threshold, it is determined that the user has an intent to configure an image for the first text.
It should be appreciated that the various modules and units of the apparatus 700 shown in fig. 7 may correspond to the various steps in the method 200 described with reference to fig. 2. Thus, the operations, features and advantages described above with respect to method 200 are equally applicable to apparatus 700 and the modules and units comprising the same. For brevity, certain operations, features and advantages are not described in detail herein.
Although specific functions are discussed above with reference to specific modules, it should be noted that the functions of the various modules discussed herein may be divided into multiple modules and/or at least some of the functions of the multiple modules may be combined into a single module.
It should also be appreciated that various techniques may be described herein in the general context of software hardware elements or program modules. The various elements described above with respect to fig. 7 may be implemented in hardware or in hardware in combination with software and/or firmware. For example, the units may be implemented as computer program code/instructions configured to be executed in one or more processors and stored in a computer-readable storage medium. Alternatively, these units may be implemented as hardware logic/circuitry. For example, in some embodiments, one or more of the modules 710-740 may be implemented together in a System on Chip (SoC). The SoC may include an integrated circuit chip including one or more components of a Processor (e.g., a central processing unit (Central Processing Unit, CPU), microcontroller, microprocessor, digital signal Processor (DIGITAL SIGNAL Processor, DSP), etc.), memory, one or more communication interfaces, and/or other circuitry, and may optionally execute received program code and/or include embedded firmware to perform functions.
There is also provided, in accordance with an embodiment of the present disclosure, an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor, the memory storing instructions executable by the at least one processor to enable the at least one processor to perform a method of configuring an image for text in accordance with an embodiment of the present disclosure.
There is also provided, in accordance with an embodiment of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method of configuring an image for text of an embodiment of the present disclosure.
There is also provided, in accordance with an embodiment of the present disclosure, a computer program product comprising computer program instructions which, when executed by a processor, implement a method of configuring an image for text of an embodiment of the present disclosure.
Referring to fig. 8, a block diagram of an electronic device 800 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 8, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.
Various components in electronic device 800 are connected to I/O interface 805, including: an input unit 806, an output unit 807, a storage unit 808, and a communication unit 809. The input unit 806 may be any type of device capable of inputting information to the electronic device 800, the input unit 806 may receive input numeric or character information and generate key signal inputs related to user settings and/or function control of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a trackpad, a trackball, a joystick, a microphone, and/or a remote control. The output unit 807 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. The storage unit 808 may include, but is not limited to, magnetic disks, optical disks. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices over computer networks, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth devices, 802.11 devices, wi-Fi devices, wiMAX devices, cellular communication devices, and/or the like.
The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the various methods and processes described above, such as method 200. For example, in some embodiments, the method 200 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. When a computer program is loaded into RAM 803 and executed by computing unit 801, one or more steps of method 200 described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the method 200 by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the foregoing methods, systems, and apparatus are merely illustrative embodiments or examples and that the scope of the present disclosure is not limited by these embodiments or examples but only by the claims following the grant and their equivalents. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.

Claims (27)

1. A method of configuring an image for text, comprising:
acquiring a first text input by a user;
responsive to determining that the user has an intent to configure an image for the first text, outputting an image acquisition interface for acquiring a target image that matches the first text;
Determining a target image matched with the first text by using a preset text-to-text matching strategy; and
And outputting the target image in response to the operation of the image acquisition interface by the user.
2. The method of claim 1, wherein the determining, using a preset text-to-text matching policy, a target image that matches the first text comprises:
Determining a first target image matched with the first text by utilizing a first matching strategy before the user operates the image acquisition interface; and
And responding to the operation of the user on the image acquisition interface, and determining a second target image matched with the first text by utilizing a second matching strategy, wherein the resource consumption of the second matching strategy is higher than that of the first matching strategy.
3. The method of claim 2, wherein the first matching policy is an image search policy, and the determining a first target image that matches the first text using the first matching policy comprises:
extracting keywords of the first text; and
Searching a preset gallery based on the keywords to obtain the first target image.
4. A method according to claim 2 or 3, wherein the second matching policy comprises at least one image generation policy, the determining a second target image matching the first text using the second matching policy comprising:
For each of the at least one image generation policy, generating a second target image matching the first text using the image generation policy.
5. The method of claim 4, wherein the generating a second target image that matches the first text using the image generation policy comprises:
determining a visual style of a text element in the second target image based on a number of characters included in the first text;
Determining a visual style of a non-text element in the second target image based on the emotion type of the first text; and
Rendering the text element and the non-text element based on the respective visual styles, respectively, to generate the second target image.
6. The method of claim 5, wherein,
The text element includes a second text generated based on the first text;
The non-text element includes at least one of a background and an emoticon.
7. The method of claim 4, wherein the generating a second target image that matches the first text using the image generation policy comprises:
Generating a first prompt text for directing a trained textbook model to generate the second target image based on the first text; and
And inputting the first prompt text into the draft graph model to obtain the second target image output by the draft graph model.
8. The method of claim 7, wherein the generating, based on the first text, first hint text for directing a trained textbook model to generate the second target image comprises:
and overwriting the first text by using the trained language model to obtain the first prompt text.
9. The method of claim 8, wherein the overwriting the first text with the trained language model to obtain the first prompt text comprises:
extracting keywords of the first text;
Determining the style type of the second target image based on the keywords;
generating a second prompt text to be input into the language model by using a preset prompt template based on the keywords and the style types; and
And inputting the second prompt text into the language model to obtain the first prompt text output by the language model.
10. The method of any of claims 2-9, wherein the determining a second target image that matches the first text using a second matching policy comprises:
Identifying a semantic type of the first text; and
And determining a second target image matched with the first text by using a second matching strategy in response to the semantic type meeting a preset condition.
11. The method of any of claims 1-10, further comprising:
based on the user's input behavior data, it is determined whether the user has an intent to configure an image for the first text.
12. The method of claim 11, wherein the determining whether the user has an intent to configure an image for the first text based on the user's input behavior data comprises:
Responsive to the dwell time after the user entered the first text being greater than a first threshold and the length of the first text being greater than a second threshold, it is determined that the user has an intent to configure an image for the first text.
13. An apparatus for configuring an image for text, comprising:
The acquisition module is configured to acquire a first text input by a user;
An interface output module configured to output an image acquisition interface for acquiring a target image matching the first text in response to determining that the user has an intention to configure an image for the first text;
The matching module is configured to determine a target image matched with the first text by using a preset text-to-text matching strategy; and
And an image output module configured to output the target image in response to an operation of the image acquisition interface by the user.
14. The apparatus of claim 13, wherein the matching module comprises:
A first matching unit configured to determine a first target image matching the first text using a first matching policy before the user operates the image acquisition interface; and
And a second matching unit configured to determine a second target image matched with the first text by using a second matching policy in response to an operation of the image acquisition interface by the user, wherein the resource consumption of the second matching policy is higher than the resource consumption of the first matching policy.
15. The apparatus of claim 14, wherein the first matching policy is an image search policy, the first matching unit comprising:
an extraction subunit configured to extract keywords of the first text; and
And the searching subunit is configured to search a preset gallery based on the keywords so as to obtain the first target image.
16. The apparatus of claim 14 or 15, wherein the second matching policy comprises at least one image generation policy, the second matching unit further configured to:
For each of the at least one image generation policy, generating a second target image matching the first text using the image generation policy.
17. The apparatus of claim 16, wherein the second matching unit comprises:
A first determination subunit configured to determine a visual style of a text element in the second target image based on a number of characters included in the first text;
a second determination subunit configured to determine a visual style of a non-text element in the second target image based on the emotion type of the first text; and
And a rendering subunit configured to render the text element and the non-text element based on the respective visual styles, respectively, to generate the second target image.
18. The apparatus of claim 17, wherein,
The text element includes a second text generated based on the first text;
The non-text element includes at least one of a background and an emoticon.
19. The apparatus of claim 16, wherein the second matching unit comprises:
A first generation subunit configured to generate, based on the first text, first hint text for guiding a trained text graph model to generate the second target image; and
And a second generation subunit configured to input the first prompt text into the draft image model to obtain the second target image output by the draft image model.
20. The apparatus of claim 19, wherein the first generation subunit is further configured to:
and overwriting the first text by using the trained language model to obtain the first prompt text.
21. The apparatus of claim 20, wherein the first generation subunit comprises:
An extraction subunit configured to extract keywords of the first text;
A third determination subunit configured to determine a style type of the second target image based on the keyword;
A third generation subunit configured to generate a second prompt text to be input into the language model by using a preset prompt template based on the keyword and the style type; and
A fourth generation subunit configured to input the second prompt text into the language model to obtain the first prompt text output by the language model.
22. The apparatus of any of claims 14-21, wherein the second matching unit comprises:
An identification subunit configured to identify a semantic type of the first text; and
And a fifth determining subunit configured to determine, in response to the semantic type meeting a preset condition, a second target image matching the first text using a second matching policy.
23. The apparatus of any of claims 13-22, further comprising:
A determination module configured to determine, based on the user's input behavior data, whether the user has an intention to configure an image for the first text.
24. The apparatus of claim 23, wherein the determination module is further configured to:
Responsive to the dwell time after the user entered the first text being greater than a first threshold and the length of the first text being greater than a second threshold, it is determined that the user has an intent to configure an image for the first text.
25. An electronic device, comprising:
At least one processor; and
A memory communicatively coupled to the at least one processor; wherein the method comprises the steps of
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-12.
26. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-12.
27. A computer program product comprising computer program instructions, wherein the computer program instructions, when executed by a processor, implement the method of any one of claims 1-12.
CN202410198650.5A 2024-02-22 2024-02-22 Method, device, electronic equipment and storage medium for configuring image for text Pending CN118152617A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410198650.5A CN118152617A (en) 2024-02-22 2024-02-22 Method, device, electronic equipment and storage medium for configuring image for text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410198650.5A CN118152617A (en) 2024-02-22 2024-02-22 Method, device, electronic equipment and storage medium for configuring image for text

Publications (1)

Publication Number Publication Date
CN118152617A true CN118152617A (en) 2024-06-07

Family

ID=91287973

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410198650.5A Pending CN118152617A (en) 2024-02-22 2024-02-22 Method, device, electronic equipment and storage medium for configuring image for text

Country Status (1)

Country Link
CN (1) CN118152617A (en)

Similar Documents

Publication Publication Date Title
US20230005284A1 (en) Method for training image-text matching model, computing device, and storage medium
US11934780B2 (en) Content suggestion system
CN105511873B (en) User interface control display method and device
US10762678B2 (en) Representing an immersive content feed using extended reality based on relevancy
CN113728308B (en) Visualization of training sessions for conversational robots
CN116303962B (en) Dialogue generation method, training method, device and equipment for deep learning model
US20230252639A1 (en) Image segmentation system
CN113656587B (en) Text classification method, device, electronic equipment and storage medium
CN116521841B (en) Method, device, equipment and medium for generating reply information
CN111176533A (en) Wallpaper switching method, device, storage medium and terminal
CN113836303A (en) Text type identification method and device, computer equipment and medium
US20190227634A1 (en) Contextual gesture-based image searching
CN117539975A (en) Method, device, equipment and medium for generating prompt word information of large language model
CN114490986B (en) Computer-implemented data mining method, device, electronic equipment and storage medium
CN115269989B (en) Object recommendation method, device, electronic equipment and storage medium
CN114880498B (en) Event information display method and device, equipment and medium
CN116361547A (en) Information display method, device, equipment and medium
CN118152617A (en) Method, device, electronic equipment and storage medium for configuring image for text
CN112906387B (en) Risk content identification method, apparatus, device, medium and computer program product
CN110209880A (en) Video content retrieval method, Video content retrieval device and storage medium
US20220358931A1 (en) Task information management
CN117520489A (en) Interaction method, device, equipment and storage medium based on AIGC
CN116028593A (en) Character identity information recognition method and device in text, electronic equipment and medium
CN117273107A (en) Training method and training device for text generation model
CN112765447A (en) Data searching method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination