CN113205569A

CN113205569A - Image drawing method and device, computer readable medium and electronic device

Info

Publication number: CN113205569A
Application number: CN202110448969.5A
Authority: CN
Inventors: 董岩岩
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2021-04-25
Filing date: 2021-04-25
Publication date: 2021-08-03

Abstract

The disclosure provides an image drawing method and device, a computer readable medium and electronic equipment, and relates to the technical field of artificial intelligence. The method comprises the following steps: acquiring an input voice instruction, and determining key characteristic information in the voice instruction; performing image drawing processing according to the key characteristic information to generate a candidate image list; and in response to the selection operation of the displayed candidate image list, taking the selected candidate image as a target image corresponding to the voice instruction to finish drawing the target image. According to the method and the device, the drawing of the image can be completed through the assistance of the voice instruction of the user, a new interaction mode is added, the interestingness of voice interaction is improved, and the user experience is improved.

Description

Image drawing method and device, computer readable medium and electronic device

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to an image rendering method, an image rendering apparatus, a computer-readable medium, and an electronic device.

Background

Along with the continuous improvement of living standard of people, Natural Language Understanding (NLU) technology is getting more and more attention. The voice assistant is an intelligent application program constructed based on a natural language understanding technology, and helps a user to solve problems through intelligent interaction of intelligent conversation and instant question and answer.

At present, in the related technical solutions, either the voice assistant cannot complete the task of drawing an image according to the voice instruction of the user, or the provided image must be modified by the voice instruction to complete the drawing when the image is drawn. On one hand, the voice assistant cannot complete the task of image drawing according to the voice instruction of the user, the user can only draw in a manual mode, and the interaction mode is single; on the other hand, the user-provided image must be used for drawing the image, resulting in low flexibility in drawing the image and limited generation of the image.

Disclosure of Invention

The present disclosure is directed to an image drawing method, an image drawing apparatus, a computer-readable medium, and an electronic device, and further to provide an interactive mode capable of completing image drawing only through a voice command to at least a certain extent, so as to improve interestingness of voice interaction and flexibility and diversity of image drawing.

According to a first aspect of the present disclosure, there is provided an image drawing method including:

acquiring an input voice instruction, and determining key characteristic information in the voice instruction;

performing image drawing processing according to the key characteristic information to generate a candidate image list;

and in response to the selection operation of the displayed candidate image list, taking the selected candidate image as a target image corresponding to the voice instruction to finish drawing the target image.

According to a second aspect of the present disclosure, there is provided an image drawing apparatus comprising:

the characteristic information determining module is used for acquiring an input voice command and determining key characteristic information in the voice command;

the candidate image list generating module is used for performing image drawing processing according to the key characteristic information to generate a candidate image list;

and the target image determining module is used for responding to the selection operation of the displayed candidate image list, and taking the selected candidate image as the target image corresponding to the voice instruction so as to finish the drawing of the target image.

According to a third aspect of the present disclosure, a computer-readable medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out the above-mentioned method.

According to a fourth aspect of the present disclosure, there is provided an electronic apparatus, comprising:

a processor; and

a memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the above-described method.

According to the image drawing method provided by the embodiment of the disclosure, the key feature information included in the voice command input by the user is determined, then the image drawing processing is performed according to the key feature information to generate the candidate image list, and finally the selected candidate image is used as the target image corresponding to the voice command according to the selection operation of the displayed candidate image list, so that the drawing of the target image is completed. On one hand, the key characteristic information contained in the voice command input by the user is determined, and the candidate image is generated according to the key characteristic information, so that the image can be drawn through the voice command, the interestingness of voice interaction is improved, and the application range of the voice interaction is enlarged; on the other hand, different candidate images are generated for the user to select, so that the user can select the desired candidate image independently, and the flexibility and diversity of the voice drawing image are improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:

FIG. 1 illustrates a schematic diagram of an exemplary system architecture to which embodiments of the present disclosure may be applied;

FIG. 2 shows a schematic diagram of an electronic device to which embodiments of the present disclosure may be applied;

FIG. 3 schematically illustrates a flow chart of a method of image rendering in an exemplary embodiment of the disclosure;

FIG. 4 schematically illustrates a flow chart for determining key feature information in an exemplary embodiment of the present disclosure;

FIG. 5 schematically illustrates a flow chart for determining an intent type corresponding to a voice instruction in an exemplary embodiment of the disclosure;

FIG. 6 schematically illustrates a flow chart for modifying candidate images in an exemplary embodiment of the disclosure;

FIG. 7 is a schematic diagram illustrating an application of modifying candidate images according to an exemplary embodiment of the disclosure;

FIG. 8 schematically illustrates another flow chart for modifying candidate images in an exemplary embodiment of the disclosure;

FIG. 9 is a schematic diagram illustrating another application of modifying candidate images in an exemplary embodiment of the present disclosure;

FIG. 10 schematically illustrates a flow chart for implementing target image rendering in an exemplary embodiment of the present disclosure;

fig. 11 schematically shows a composition diagram of an image drawing apparatus in an exemplary embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

Fig. 1 is a schematic diagram illustrating a system architecture of an exemplary application environment to which an image rendering method and apparatus according to an embodiment of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include one or more of

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. The

terminal devices

101, 102, 103 may be various electronic devices having an image processing function, including but not limited to desktop computers, portable computers, smart phones, tablet computers, and the like. It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, server 105 may be a server cluster comprised of multiple servers, or the like.

The image drawing method provided by the embodiment of the present disclosure is generally executed by the

terminal devices

101, 102, 103, and accordingly, the image drawing apparatus is generally provided in the

terminal devices

101, 102, 103. However, it is easily understood by those skilled in the art that the image drawing method provided in the embodiment of the present disclosure may also be executed by the server 105, and accordingly, the image drawing apparatus may also be disposed in the server 105, which is not particularly limited in the exemplary embodiment. For example, in an exemplary embodiment, the user may collect a voice instruction through a voice acquisition unit (e.g., a microphone) included in the

terminal device

101, 102, 103 for acquiring voice information, and then upload the voice instruction to the server 105, and after the server generates a candidate image by using the image drawing method provided by the embodiment of the present disclosure, the server transmits the candidate image to the

terminal device

101, 102, 103, etc. in the form of a candidate image list for presentation.

An exemplary embodiment of the present disclosure provides an electronic device for implementing an image drawing method, which may be the

terminal device

101, 102, 103 or the server 105 in fig. 1. The electronic device comprises at least a processor and a memory for storing executable instructions of the processor, the processor being configured to perform the image rendering method via execution of the executable instructions.

The following takes the mobile terminal 200 in fig. 2 as an example, and exemplifies the configuration of the electronic device. It will be appreciated by those skilled in the art that the configuration of figure 2 can also be applied to fixed type devices, in addition to components specifically intended for mobile purposes. In other embodiments, mobile terminal 200 may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware. The interfacing relationship between the components is only schematically illustrated and does not constitute a structural limitation of the mobile terminal 200. In other embodiments, the mobile terminal 200 may also interface differently than shown in fig. 2, or a combination of multiple interfaces.

As shown in fig. 2, the mobile terminal 200 may specifically include: a processor 210, an internal memory 221, an external memory interface 222, a Universal Serial Bus (USB) interface 230, a charging management module 240, a power management module 241, a battery 242, an antenna 1, an antenna 2, a mobile communication module 250, a wireless communication module 260, an audio module 270, a speaker 271, a microphone 272, a microphone 273, an earphone interface 274, a sensor module 280, a display 290, a camera module 291, an indicator 292, a motor 293, a button 294, and a Subscriber Identity Module (SIM) card interface 295. Wherein the sensor module 280 may include a depth sensor 2801, a pressure sensor 2802, a gyroscope sensor 2803, and the like.

Processor 210 may include one or more processing units, such as: the Processor 210 may include an Application Processor (AP), a modem Processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a video codec, a Digital Signal Processor (DSP), a baseband Processor, and/or a Neural-Network Processing Unit (NPU), and the like. The different processing units may be separate devices or may be integrated into one or more processors.

The NPU is a Neural-Network (NN) computing processor, which processes input information quickly by using a biological Neural Network structure, for example, by using a transfer mode between neurons of a human brain, and can also learn by itself continuously. The NPU can implement applications such as intelligent recognition of the mobile terminal 200, for example: image recognition, face recognition, speech recognition, text understanding, and the like.

A memory is provided in the processor 210. The memory may store instructions for implementing six modular functions: detection instructions, connection instructions, information management instructions, analysis instructions, data transmission instructions, and notification instructions, and execution is controlled by processor 210.

The charge management module 240 is configured to receive a charging input from a charger. The power management module 241 is used for connecting the battery 242, the charging management module 240 and the processor 210. The power management module 241 receives the input of the battery 242 and/or the charging management module 240, and supplies power to the processor 210, the internal memory 221, the display screen 290, the camera module 291, the wireless communication module 260, and the like.

The wireless communication function of the mobile terminal 200 may be implemented by the antenna 1, the antenna 2, the mobile communication module 250, the wireless communication module 260, a modem processor, a baseband processor, and the like. Wherein, the antenna 1 and the antenna 2 are used for transmitting and receiving electromagnetic wave signals; the mobile communication module 250 may provide a solution including wireless communication of 2G/3G/4G/5G, etc. applied to the mobile terminal 200; the modem processor may include a modulator and a demodulator; the Wireless communication module 260 may provide a solution for Wireless communication including a Wireless Local Area Network (WLAN) (e.g., a Wireless Fidelity (Wi-Fi) network), Bluetooth (BT), and the like, applied to the mobile terminal 200. In some embodiments, antenna 1 of the mobile terminal 200 is coupled to the mobile communication module 250 and antenna 2 is coupled to the wireless communication module 260, such that the mobile terminal 200 may communicate with networks and other devices via wireless communication techniques.

The mobile terminal 200 implements a display function through the GPU, the display screen 290, the application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display screen 290 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 210 may include one or more GPUs that execute program instructions to generate or alter display information.

The mobile terminal 200 may implement a photographing function through the ISP, the camera module 291, the video codec, the GPU, the display screen 290, the application processor, and the like. The ISP is used for processing data fed back by the camera module 291; the camera module 291 is used for capturing still images or videos; the digital signal processor is used for processing digital signals, and can process other digital signals besides digital image signals; the video codec is used to compress or decompress digital video, and the mobile terminal 200 may also support one or more video codecs.

The external memory interface 222 may be used to connect an external memory card, such as a Micro SD card, to extend the memory capability of the mobile terminal 200. The external memory card communicates with the processor 210 through the external memory interface 222 to implement a data storage function. For example, files such as music, video, etc. are saved in an external memory card.

Internal memory 221 may be used to store computer-executable program code, which includes instructions. The internal memory 221 may include a program storage area and a data storage area. The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like. The storage data area may store data (e.g., audio data, a phonebook, etc.) created during use of the mobile terminal 200, and the like. In addition, the internal memory 221 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk Storage device, a Flash memory device, a Universal Flash Storage (UFS), and the like. The processor 210 executes various functional applications of the mobile terminal 200 and data processing by executing instructions stored in the internal memory 221 and/or instructions stored in a memory provided in the processor.

The mobile terminal 200 may implement an audio function through the audio module 270, the speaker 271, the receiver 272, the microphone 273, the earphone interface 274, the application processor, and the like. Such as music playing, recording, etc.

The depth sensor 2801 is used to acquire depth information of a scene. In some embodiments, a depth sensor may be provided to the camera module 291.

The pressure sensor 2802 is used to sense a pressure signal and convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 2802 may be disposed on the display screen 290. Pressure sensor 2802 can be of a wide variety, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like.

The gyro sensor 2803 may be used to determine a motion gesture of the mobile terminal 200. In some embodiments, the angular velocity of the mobile terminal 200 about three axes (i.e., x, y, and z axes) may be determined by the gyroscope sensor 2803. The gyro sensor 2803 can be used to photograph anti-shake, navigation, body-feel game scenes, and the like.

In addition, other functional sensors, such as an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity light sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, etc., may be provided in the sensor module 280 according to actual needs.

Other devices for providing auxiliary functions may also be included in mobile terminal 200. For example, the keys 294 include a power-on key, a volume key, and the like, and a user can generate key signal inputs related to user settings and function control of the mobile terminal 200 through key inputs. Further examples include indicator 292, motor 293, SIM card interface 295, etc.

In a related voice interaction technical scheme, firstly, based on collected images and voice commands, a drawing vector diagram and drawing semantic information are fused, whether the drawing tasks are the same or not is judged, if yes, based on the drawing vector diagram and the drawing semantic, after repeated regions are combined, a new drawing vector is generated, and a robot executes drawing actions to complete the drawing tasks. However, in this technical solution, the painting scene is limited only to the robot device, the painting operation is completed through the hardware facilities of the robot, and the image painting cannot be completed only through software; secondly, the image quality of robot painting can be improved only by combining the collected image information and the user instruction and performing multi-mode information supplement, high-quality image painting cannot be completed through a voice instruction alone, and the flexibility of image painting is reduced.

The following describes an image rendering method according to an exemplary embodiment of the present disclosure in detail by taking an example of execution by a terminal device.

Fig. 3 shows a flow of an image drawing method in the present exemplary embodiment, which may include the following steps S310 to S330:

in step S310, an input voice command is acquired, and key feature information in the voice command is determined.

In an exemplary embodiment, the voice instruction refers to a control instruction issued by the user in a form of voice to the terminal device, for example, the voice instruction may be "i want to draw a goldfish with white heads and black tails", or "please help me put a song", which is not particularly limited in this exemplary embodiment. The voice instruction issued by the user can be acquired through a voice assistant installed in the terminal device, and the voice instruction issued by the user can also be acquired through a voice acquisition device such as an intelligent sound box in communication connection with the terminal device.

The key feature information is information used for representing semantic features and included in text data corresponding to a voice instruction, and is extracted from the text data based on a natural language understanding technology, and the key feature information may be independent information or multiple pieces of information having an association relationship, which is not particularly limited in this example embodiment, for example, for a voice instruction "i want to draw a chair", the key feature information corresponding to the image drawing voice instruction may be a "chair", for a voice instruction "i want to draw a chair in the shape of avocado", the key feature information corresponding to the image drawing voice instruction may be a "avocado shape", "avocado color", and a "chair", which is of course only exemplified here, and this example embodiment is not limited thereto.

Specifically, the voice command may be converted into text data by an Automatic Speech Recognition technology (Automatic Speech Recognition), for example, the voice command may be converted into text data by a dynamic Time Warping model (dynamic Time Warping), a Vector Quantization model (Vector Quantization) or a Hidden Markov model (Hidden Markov Models), and then the text data may be input into a pre-trained feature extraction model (such as a bag-of-words model and a frequency-inverse document frequency TF-IDF model), and key feature information corresponding to the voice command may be output.

Of course, the embodiment is not limited to the voice command, and the input text command may also be directly obtained, for example, the text command input by the user may be obtained through a text input interface of a voice assistant provided by the terminal device, so as to determine the key feature information in the text command.

In step S320, an image drawing process is performed according to the key feature information, and a candidate image list is generated.

In an exemplary embodiment, the image drawing process refers to a process of generating an image associated with key feature information, for example, for a voice instruction "i want to draw a chair in a avocado shape", the key feature information corresponding to the image drawing voice instruction may be "avocado shape", "avocado color" and "chair", then the image drawing process may be to match or learn and predict image information associated with "avocado shape", "avocado color" and "chair", respectively, and then perform processes such as relative positioning, object stacking and attribute control on the three types of obtained image information based on the key feature information, perform fusion drawing to obtain a plurality of candidate images, and construct a candidate image list.

In step S330, in response to a selection operation on the presented candidate image list, the selected candidate image is taken as a target image corresponding to the voice instruction to complete drawing of the target image.

In an exemplary embodiment, after obtaining the candidate image list, a plurality of candidate images in the candidate image list may be displayed on a visualization platform provided to the user, for example, the visualization platform may be a human-computer interaction interface of the terminal device, and may also be a wearable device such as smart glasses connected to the terminal device, which is not limited to this exemplary embodiment.

The selection operation may refer to an operation for selecting a determination target image in the candidate image list, for example, the selection operation may be a user performing a touch operation (e.g., a click operation, a double click operation, a slide operation, a re-press operation, a long-press operation, or the like), a key operation (e.g., a key operation input through a volume key, a confirmation key of an external device, or the like), or a gravity sensing operation (e.g., an operation for moving a selection image through a gravity sensing function provided by a gyroscope to implement selection of an image) on a provided visualization platform; the selection operation may also be an operation in which the user selects the target image through a voice instruction, for example, each candidate image in the candidate image list may be numbered, and the user selects the candidate image with the number XX corresponding to the number XX as the target image through issuing a voice instruction of "select image XX", although the above selection operation is only a schematic illustration, and the selection operation for selecting the target image is not limited in any way in this exemplary embodiment.

Next, step S310 to step S330 will be further described.

In an exemplary embodiment, since the obtained voice instruction may not only be used for image drawing, but also may have a voice instruction with other intentions, before extracting key feature information in the voice instruction, it is first necessary to judge an intention of the voice instruction to improve an execution accuracy of the voice instruction, as shown in fig. 4, specifically, the method may include:

step S410, acquiring an input voice command and determining the intention type of the voice command;

step S420, if the intention type is an image drawing intention, determining key feature information in the voice instruction.

The intention type refers to a classification of the control command, for example, for the voice instruction "i want to listen to a song", the corresponding intention type may be a music playing intention, for the voice instruction "i want to draw a chair", the corresponding intention type may be an image drawing intention, which is not particularly limited in this example embodiment.

Specifically, the determination of the intended type of the voice command may be implemented by the steps in fig. 5, and as shown in fig. 5, the method specifically includes:

step S510, carrying out voice recognition on the voice command to obtain text data corresponding to the voice command;

step S520, inputting the text data into a pre-trained intention classification model, and outputting a plurality of intention types and confidence data of the intention types;

step S530, sorting the confidence level data, and using the intention type with the maximum confidence level data as the intention type of the voice instruction.

For example, after audio in the voice command is subjected to signal processing, the audio is split according to frames (or millisecond levels), the split small-segment waveforms are changed into multi-dimensional vector information according to human ear characteristics, the multi-dimensional vector information is recognized into states (which can be understood as an intermediate process and a process smaller than phonemes), then the obtained states are combined to form phonemes (usually 3 states are 1 phoneme), and finally the obtained phonemes are combined into words and are connected in series to form sentences, so that the voice command can be converted into corresponding text data.

The intention classification model refers to a classification model obtained by training a training sample according to text-intention in advance, and the confidence data refers to score data corresponding to each intention type, for example, for text data "i want to draw a chair", input to the intention classification model and output "intention type: image rendering, confidence data: 8; type of intent: music play, confidence data: 2; … … ", where the confidence data may depend on the number of hits in the text data for key features under a certain type of intent.

The order may be performed according to the confidence level data of the intent type output by the intent classification model, for example, the order may be from large to small, or from small to large, which is not limited in this example embodiment. Determining an intention type with the maximum confidence data from the sorted intention types, and taking the intention type with the maximum confidence data as the intention type of the voice instruction, for example, for the voice instruction "i want to draw a chair", inputting the voice instruction into an intention classification model and outputting "intention type: image rendering, confidence data: 8; type of intent: music play, confidence data: 2; … … ", the confidence data for determining the intention type as the image drawing intention is the largest after sorting, so the image drawing intention is taken as the intention type of the voice instruction.

In an exemplary embodiment, the target image information may be matched from a preset text image pair database according to the key feature information; the target image information may then be image fused to generate a plurality of candidate images, and a candidate image list may be constructed from the plurality of candidate images.

The text image pair database refers to a database of a matching relationship between pre-constructed feature information and image information, and the target image information refers to a basic image obtained by matching key feature information from the text image pair database based on the matching relationship, for example, if the key feature information is "goldfish," the target image information that can be obtained by matching from the text image pair database is an image containing various "goldfish" elements.

After target image information is obtained by matching from a preset text image database according to the key feature information, the target image information corresponding to different key feature information can be freely combined, and then the target image information with different combinations is subjected to processing such as relative positioning, object stacking and attribute control to realize fusion of the target image information, so that a plurality of candidate images associated with the key feature information can be obtained. For example, for a voice instruction "i want to draw a chair with a avocado shape", the key feature information corresponding to the image drawing voice instruction may be "avocado shape", "avocado color" and "chair", then image information related to a plurality of avocado shapes associated with the key feature information "avocado shape", image information related to a plurality of avocado colors associated with the key feature information "avocado color", and image information related to a plurality of avocado colors associated with the key feature information "chair" may be matched in the text image pair database, and the image information under different key feature information is freely combined and image-fused to obtain different kinds of candidate images corresponding to the voice instruction "i want to draw a chair with a avocado shape". Of course, this is only an illustrative example, and the present exemplary embodiment is not limited thereto.

In an exemplary embodiment, since the target image desired by the user cannot be generated at one time when the image is drawn, after the candidate image list is generated according to the first voice instruction of the user, the candidate images in the candidate image list may be modified according to the modification voice instruction of the user, and as shown in fig. 6, the method specifically includes:

step S610, acquiring an input voice modification instruction, and determining modification key characteristic information in the voice modification instruction;

step S620, modifying each candidate image in the candidate image list according to the modification key feature information, and generating a modified candidate image list.

The modification voice instruction refers to a voice instruction for performing modification adjustment on each candidate image in the candidate image list, for example, for the first input voice command "i want to draw a avocado-shaped chair", and generate the candidate image list corresponding to the voice command, then, modifying the voice command may be "add a avocado-shaped table beside the avocado-shaped chair", the corresponding modification key feature information may be "beside", "added", "avocado shape", and "table", and thus, new image information may be obtained by modifying the matching of key feature information in the database of text image pairs, and adding new image information to each candidate image in the candidate image list for further image fusion, resulting in a modified candidate image list consisting of each candidate image with the "avocado-shaped table" added next to the "avocado-shaped chair".

Fig. 7 schematically illustrates an application diagram of modifying a candidate image in an exemplary embodiment of the disclosure.

Referring to fig. 7, a manner of modifying candidate images is described below by taking as an example that a voice assistant of a mobile terminal implements the image drawing method in the present exemplary embodiment:

step S710, acquiring a voice instruction of a user through a voice assistant, converting the voice instruction into text data, extracting key feature information in the text data when detecting that the intention type of the text data is an image drawing intention, further matching a plurality of target image information according to the key feature information, performing image fusion on the plurality of target image information to obtain a candidate image list consisting of a plurality of candidate images (such as candidate image 1, candidate image 2, candidate image 3 and candidate image 4 … …), and prompting the user to continue to modify through the voice instruction;

step S720, at least one round of modification voice commands issued by the user may be continuously obtained to modify the candidate images, after a valid modification voice command is detected, modified image information used for modifying the content of the candidate images in the modification voice command is extracted, and the modified image information is fused to each candidate image in the candidate image list to obtain a modified candidate image list.

In another exemplary embodiment, since there are more candidate images in the candidate image list generated according to the first voice instruction, more combination modes are generated when all candidate images in the candidate image list are modified according to the image information obtained by modifying the voice instruction matching, so that the magnitude of the generated candidate images may increase exponentially with the increase of the modification instruction, resulting in an excessive calculation amount. Therefore, after the candidate image list is generated according to the voice instruction, the user can be prompted to determine a candidate image from the generated candidate image list, and then when a modification voice instruction input by the user is received, the determined candidate image can be modified to generate the modified candidate image list, so that the number of candidate images in the candidate image list can be effectively reduced, the calculation efficiency is improved, and the user can be facilitated to quickly determine the expected target image. Referring to fig. 7, specifically, the modifying of the candidate image may be implemented by the steps in fig. 8, and may include:

step S810, acquiring an input voice modification instruction, and determining modification key characteristic information in the voice modification instruction;

step S820, responding to the selection operation of the displayed candidate image list, and determining a selected candidate image;

step S830, modifying the selected candidate image according to the modification key feature information, and generating a modified candidate image list.

Wherein, the modifying voice instruction is a voice instruction for modifying and adjusting the candidate image selected from the candidate image list by the selecting operation, for example, for the first input voice instruction, "i want to draw a chair in the shape of avocado", generating the candidate image list corresponding to the voice instruction, then prompting the user to select a candidate image from the candidate image list as the most desirable image, and further prompting the user to modify the selected image by modifying the voice instruction, assuming that the modifying voice instruction is "adding a table in the shape of avocado beside the chair in the shape of avocado", the corresponding modifying key feature information may be "beside", "adding", "avocado shape", and "table", so that new image information can be obtained by matching the modifying key feature information in the text image database, and adding new image information into the selected candidate images and carrying out image fusion (such as relative positioning, object stacking, attribute control and the like) to obtain different types of candidate images with the 'avocado-shaped table' added beside the 'avocado-shaped chair', and forming a modified candidate image list.

Fig. 9 schematically illustrates another application diagram for modifying candidate images in an exemplary embodiment of the disclosure.

Referring to fig. 9, another way of modifying candidate images is described below by taking as an example that a voice assistant of a mobile terminal implements the image drawing method in the present exemplary embodiment:

step S910, obtaining a voice command of a user through a voice assistant, converting the voice command into text data, extracting key feature information in the text data when detecting that the intention type of the text data is an image drawing intention, further matching a plurality of target image information according to the key feature information, performing image fusion on the plurality of target image information to obtain a candidate image list consisting of a plurality of candidate images (such as candidate image 1, candidate image 2, candidate image 3, candidate image 4 … …), and prompting the user to select one candidate image in the candidate image list;

step S920, selecting a candidate image in various ways, for example, selecting a candidate image such as candidate image 1 by issuing a voice command, and prompting the user to modify the candidate image continuously through the voice command;

step S930, continuously obtaining at least one round of modification voice command issued by the user, such as "add triangle to fish" for the modification voice command, and after detecting a valid modification voice command, extracting modification image information (such as "fish", "up", "add" and "triangle") for modifying the candidate image content in the modification voice command, and fusing the modification image information to the selected candidate image, such as candidate image 1, to obtain a modified candidate image list composed of different types of modified candidate images 1.

In an exemplary embodiment, after the first voice instruction, the multiple voice modification instructions and the final user determination, a main part of an image that the user desires to draw may be obtained, at this time, a suitable scene (i.e., a background part) may be added to the image, a more complete target image is obtained, and specifically, a background image list associated with the target image may be provided; and fusing the target background image selected from the background image list with the target image to obtain the target image with the scene.

For example, if the voice instruction for setting the image scene input by the user is "add one fish tank", a plurality of background images including the "fish tank" may be matched from the text image database according to the key feature information, and a background image list may be constructed. Of course, by identifying the image element in the image finally determined by the user, the text image pair database may be matched with a predefined background image associated with the image element, for example, for the image element in the finally determined image being "goldfish", a plurality of background images such as "fish tank", "ocean", "stream" and the like related to the image element "goldfish" may be matched in the text image pair database, and constructed as a background image list to be provided for the user to select.

The fusion of the target background image and the target image refers to a process of fusing a main body part in the target image to the target background image, specifically, edge detection can be performed on the target image through an edge detection algorithm, a main body part image is separated from the target image, and the main body part image is fused to the target background image to obtain the target image with a scene. For example, if the finally determined target image includes a main body part of "goldfish," and the background image selected by the user from the background image list is "fish bowl," the edge of the target image is detected, the main body part image corresponding to the "goldfish" is extracted, and the main body part images are fused with the background image to obtain the target image of the "goldfish swimming in the fish bowl.

In an exemplary embodiment, in order to improve the interest of a drawn target image, an interactive special effect animation may be added to the target image, so as to realize interaction between a user and the drawn target image, and specifically, the interactive special effect animation corresponding to the target image may be matched from a preset special effect database, and the interactive special effect animation may be associated with the target image, so as to display the interactive special effect animation when an interactive action triggering the target image is detected. For example, if the finally obtained target image is "goldfish in a fish tank", the "bubble spitting" special effect animation may be matched in a preset special effect database, and the "bubble spitting" special effect animation is relatively positioned with the target image to position the "bubble spitting" special effect animation to the head position of the "goldfish" in the target image, and when a user triggers a voice instruction such as "spit bubbles" or clicks the target image, it may be determined that an interactive action triggering the target image is detected, at this time, the interactive special effect animation may be displayed, and the interactive special effect animation may be continuously displayed after being triggered, or may disappear after being displayed for a certain time, which is not particularly limited in this example.

Fig. 10 schematically illustrates a flowchart for implementing target image rendering in an exemplary embodiment of the present disclosure.

Referring to fig. 10, in step S1001, a voice command input by a user is obtained, and the voice command is subjected to voice recognition to obtain text data corresponding to the voice command;

step S1002, inputting text data into a pre-trained intention recall model and outputting an intention type;

step S1003, determining whether the intention type of the voice command is an image drawing intention, if it is detected that the intention type of the voice command is an image drawing intention, executing step S1004, otherwise executing step S1005;

step S1004, extracting key feature information contained in the text data through a feature extraction model;

step S1005, executing the voice command to realize corresponding operation;

step S1006, matching target image information from a preset text image pair database based on the key feature information;

step S1007, the target image information is processed based on image fusion such as relative positioning, object stacking and attribute control, a plurality of candidate images are obtained and displayed on a visual platform for the user to select;

step S1008, prompting the user whether to modify or draw the candidate image, if it is determined that the user is to modify or draw the candidate image, executing step S1009, otherwise executing step S1010;

step S1009, acquiring at least one round of modification voice instruction, and extracting modification key characteristic information in the modification voice instruction;

step S1010, prompting a user to select one of the displayed candidate images as a target image;

step S1011, matching modified image information from a preset text image database based on the modified key feature information, and modifying a plurality of candidate images according to the modified image information to obtain new candidate images;

step S1012, determining a target image from the plurality of candidate images in response to the selection operation of the user, and providing a background image list to allow the user to select the background image and add it to the target image;

step S1013, matching the interactive special effect animation and associating the interactive special effect animation with the target image so as to display the interactive special effect animation when the target image is triggered by the user.

In summary, in the exemplary embodiment, first, the key feature information included in the voice command input by the user is determined, then, the image drawing processing is performed according to the key feature information to generate the candidate image list, and finally, according to the selection operation on the displayed candidate image list, the selected candidate image is taken as the target image corresponding to the voice command, and the drawing of the target image is completed. On one hand, the key characteristic information contained in the voice command input by the user is determined, and the candidate image is generated according to the key characteristic information, so that the image can be drawn through the voice command, the interestingness of voice interaction is improved, and the application range of the voice interaction is enlarged; on the other hand, different candidate images are generated for the user to select, so that the user can select the desired candidate image independently, and the flexibility and diversity of the voice drawing image are improved.

It is noted that the above-mentioned figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

Further, referring to fig. 11, the image rendering apparatus 1100 according to the embodiment of the present disclosure may include a feature information determining module 1110, a candidate image list generating module 1120, and a target image determining module 1130. Wherein:

the feature information determining module 1110 is configured to obtain an input voice instruction and determine key feature information in the voice instruction;

the candidate image list generating module 1120 is configured to perform image drawing processing according to the key feature information to generate a candidate image list;

the target image determination module 1130 is configured to, in response to a selection operation on the presented candidate image list, take the selected candidate image as a target image corresponding to the voice instruction to complete drawing of the target image.

In an exemplary embodiment, the characteristic information determination module 1110 may be configured to:

acquiring an input voice instruction, and determining the intention type of the voice instruction;

and if the intention type is an image drawing intention, determining key characteristic information in the voice instruction.

In an exemplary embodiment, the characteristic information determination module 1110 may be further configured to:

performing voice recognition on the voice instruction to obtain text data corresponding to the voice instruction;

inputting the text data into a pre-trained intention classification model, and outputting a plurality of intention types and confidence data of the intention types;

and sequencing the confidence coefficient data, and taking the intention type with the maximum confidence coefficient data as the intention type of the voice instruction.

In an exemplary embodiment, the candidate image list generation module 1120 may be configured to:

matching target image information from a preset text image database according to the key feature information;

and carrying out image fusion on the target image information to generate a plurality of candidate images, and constructing a candidate image list through the candidate images.

In an exemplary embodiment, the image rendering device 1100 may further comprise an image modification unit, which may be configured to:

acquiring an input voice modification instruction, and determining modification key characteristic information in the voice modification instruction;

and modifying each candidate image in the candidate image list according to the modification key characteristic information to generate a modified candidate image list.

In an exemplary embodiment, the image modification unit may be further configured to:

in response to a selection operation on the presented candidate image list, determining a selected candidate image;

and modifying the selected candidate image according to the modification key characteristic information to generate a modified candidate image list.

In an exemplary embodiment, the image rendering device 1100 may further include a scene adding unit, and the scene adding unit may be configured to:

providing a background image list associated with the target image;

and fusing the target background image selected from the background image list with the target image to obtain a target image with a scene.

In an exemplary embodiment, the image drawing apparatus 1100 may further include a special effect animation adding unit, and the special effect animation adding unit may be configured to:

matching the interactive special effect animation corresponding to the target image from a preset special effect database, and associating the interactive special effect animation with the target image so as to display the interactive special effect animation when the target image is triggered.

The specific details of each module in the above apparatus have been described in detail in the method section, and details that are not disclosed may refer to the method section, and thus are not described again.

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

Exemplary embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product including program code for causing a terminal device to perform the steps according to various exemplary embodiments of the disclosure described in the above-mentioned "exemplary methods" section of this specification, when the program product is run on the terminal device, for example, any one or more of the steps in fig. 3 to 8 may be performed.

It should be noted that the computer readable media shown in the present disclosure may be computer readable signal media or computer readable storage media or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Furthermore, program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims

1. An image rendering method, comprising:

2. The method of claim 1, wherein obtaining an input voice command and determining key feature information in the voice command comprises:

3. The method of claim 2, wherein the determining the intent type of the voice instruction comprises:

4. The method according to claim 1, wherein performing image rendering processing according to the key feature information to generate a candidate image list comprises:

5. The method according to claim 1, wherein performing image rendering processing according to the key feature information to generate a candidate image list, further comprises:

6. The method according to claim 1, wherein performing image rendering processing according to the key feature information to generate a candidate image list, further comprises:

7. The method of claim 1, further comprising:

providing a background image list associated with the target image;

8. The method of claim 1, further comprising:

9. An image drawing apparatus characterized by comprising:

10. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 8.

11. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any of claims 1 to 8 via execution of the executable instructions.