CN117729421A

CN117729421A - Image processing method, electronic device, and computer-readable storage medium

Info

Publication number: CN117729421A
Application number: CN202311044671.3A
Authority: CN
Inventors: 蒋雪涵; 李宇
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2023-08-17
Filing date: 2023-08-17
Publication date: 2024-03-19

Abstract

The application relates to the technical field of image processing, and provides an image processing method, electronic equipment and a computer readable storage medium, which comprise the following steps: after the electronic equipment displays the first image of the first image style on the user interface, receiving a first operation of a user on the user interface, wherein the first operation is used for triggering the electronic equipment to convert the image style of the image in the user interface. Then, the electronic device extracts image contents of the first image in response to the first operation, and takes an image style corresponding to the candidate image contents with the highest similarity of the image contents as a candidate image style. And finally, the electronic equipment further selects a second image style according to the preference degree of the user on the candidate image content, and further displays the second image of the second image style on the user interface. Therefore, style processing is carried out on the image based on personal aesthetic preference of the user, the image processing effect is ensured to be in accordance with personal aesthetic of the user, and the accuracy of image processing is improved.

Description

Image processing method, electronic device, and computer-readable storage medium

Technical Field

Embodiments of the present disclosure relate to the field of image processing technologies, and in particular, to an image processing method, an electronic device, and a computer readable storage medium.

Background

With the development of electronic device technology, more and more electronic devices are configured with a camera module for users to take images. Based on the user's shooting needs and aesthetic, the user has different preferences for the images that are shot. For example, different users have different demands on the hue, light, composition, etc. of the image.

Currently, in order to meet the requirements of users on image preference, a camera function in an electronic device may automatically adjust brightness or select a corresponding shooting mode according to a type of a shooting object, for example, select modes such as a portrait, a macro, a night scene, and the like. However, the existing method mainly considers the type of the photographed object, does not fully consider the personal aesthetic preference of the user, causes deviation of the image processing effect, and reduces the image processing accuracy.

Disclosure of Invention

The embodiment of the application provides an image processing method, electronic equipment and a computer readable storage medium, which are used for solving the problem that the image processing accuracy is reduced due to deviation of image processing effect caused by the fact that the personal aesthetic preference of a user is not considered in the existing image processing.

In order to achieve the above purpose, the embodiments of the present application adopt the following technical solutions:

in a first aspect, there is provided an image processing method applied to an electronic device, in which a plurality of candidate image contents are stored, and each of the candidate image contents corresponds to at least one image style, comprising: after the electronic equipment displays the first image of the first image style on the user interface, receiving a first operation of the user on the user interface, wherein the first operation is used for triggering the electronic equipment to convert the image style of the image in the user interface. Then, the electronic device extracts image content of the first image in response to the first operation, and takes at least one image style corresponding to candidate image content with highest similarity of the image content of the first image as a candidate image style. And finally, obtaining the preference degree of the user for the candidate image style, selecting a second image style from the candidate image styles according to the preference degree, and displaying a second image of the second image style on a user interface. Wherein the user interface is a large graphic display page that may be a gallery in an electronic device, the first image is an image within the gallery, and the second image is an image resulting from converting the first image from a first image style to a second image style. Or, the user interface may be a preview interface of the image captured by the electronic device, where the first image is obtained by processing an image captured by the camera of the electronic device in real time using a first image style, and the second image is obtained by processing an image captured by the camera of the electronic device in real time using a second image style.

Because the preference degree of the user can generally reflect the personal preference of the user, the electronic equipment can select a second image style which accords with the aesthetic preference of the personal style of the user in a matching way according to the preference degree of the personal user for the candidate image style, and further, the style processing of the image based on the personal aesthetic preference of the user is realized, so that the image processing effect is ensured to accord with the personal aesthetic of the user, and the accuracy of image processing is improved. Meanwhile, the image style conforming to the image content can be obtained based on the image content matching candidate image style of the first image displayed by the current user interface, so that the accuracy of image style conversion processing is further improved while the personal aesthetic preference of a user is met.

In a possible implementation manner of the first aspect, in general, the more frequently the user uses the image style, the higher the preference degree of the user for the image style. Thus, the user's preference for candidate image styles may be: the frequency of personal use of the candidate image style. Accordingly, the personal use frequency of each image style corresponding to each candidate image content is also stored in the electronic device. In addition, since the electronic device is always used by the user, the personal use frequency of each image style stored in the electronic device may vary due to the selection behavior of the user, so in order to ensure that the personal use frequency can timely reflect the most realistic preference, the electronic device may update the personal use frequency according to the operation of the user.

Based on this, the above image processing method further includes: receiving a second operation of a user on the user interface, wherein the second operation is used for triggering the electronic equipment to store a second image of a second image style; in response to the second operation, the saved personal frequency of use of the second image style is updated. Therefore, for the second image of the second image style which is determined to be selected and stored by the user, the electronic equipment timely updates the personal use frequency of the second image style so as to ensure the authenticity of the personal use frequency corresponding to the candidate image style.

In a possible implementation manner of the first aspect, extracting image content of the first image includes: the first image is input into a first multi-mode large language model, and the content output through the image encoder and the image-text alignment layer processing in the first multi-mode large language model is encoded as the image content of the first image. The electronic equipment comprises a first multi-mode large language model, wherein the first multi-mode large language model is a trained second multi-mode large language model.

The image content of the first image is extracted through the first multi-mode large-scale language model, so that accuracy of similarity calculation between the candidate image content and the image content of the first image can be improved.

In a possible implementation manner of the first aspect, if the candidate image content, the image style, and other data are stored in the electronic device directly in the form of an image, a large amount of storage space is occupied. Thus, the image processing method further includes: extracting content codes of candidate images by using an image encoder and an image-text alignment layer in the first multi-mode large-scale language model, clustering the content codes, and labeling the content of each class after clustering to obtain candidate image contents; and extracting text features of the candidate style description texts by using a large language model in the first multi-mode large language model, performing image-text alignment processing on the text features by using a linear projection layer to obtain style codes, clustering the style codes, and performing style labeling on each class after clustering to obtain the image style corresponding to the candidate image content. Wherein the first multimodal large language model is a trained second multimodal large language model; the first multi-modal large language model comprises an image encoder, a graph-text alignment layer and a large language model. Thus, the candidate image content and the image style in the encoded form are obtained through the first multi-mode large language model, and each candidate image content and each image style in the same cluster are used as one candidate image content and each image style in a clustering mode, so that the stored data volume can be reduced.

In a possible implementation manner of the first aspect, the image processing method further includes: training the image content extraction capacity of the second multi-modal large language model after carrying out classification training of style decision factors on the second multi-modal large language model to obtain a first multi-modal large language model; wherein, style decision factors include at least one or more of environment, light, theme; the second multi-mode large language model comprises an image encoder, an image-text alignment layer and a large language model; the output of the image encoder is the input of the image-text alignment layer, and the output of the image-text alignment layer is the input of the large language model.

In the implementation mode, the second multi-mode large language model is trained in two stages to obtain the first multi-mode large language model, so that the first multi-mode large language model obtained through training has the capability of extracting fine-granularity image content and the capability of extracting style decision factors.

In one possible implementation manner of the first aspect, in general, the more the number of uses, the higher the characterization frequency of use. Conversely, the fewer the number of uses, the lower the frequency of use. Accordingly, the electronic device can determine the individual use frequency by counting the number of individual uses of the users of various image styles. Based on this, the personal use frequency of the candidate image style by the user is obtained, including: counting the personal use times of the users of each candidate image style corresponding to the candidate image content; and carrying out normalization calculation on the personal use times of the users of the candidate image styles to obtain the personal use frequency of the candidate image styles. Thus, the personal use frequency is determined by counting the number of times of personal use of the user in the image style, and the accuracy of the personal use frequency can be ensured.

In a possible implementation manner of the first aspect, the electronic device further stores a public use frequency of each image style corresponding to each candidate image content; the public use frequency is obtained by carrying out normalization calculation after counting the public use times of each candidate image style corresponding to each candidate image content; obtaining personal use frequency of a candidate image style by a user, wherein the personal use frequency comprises the following steps: and carrying out weighting calculation on the public use frequency and the personal use frequency corresponding to the same candidate image style, and taking the obtained weighting frequency as the final personal use frequency of the candidate image style.

In this implementation, the electronic device combines the mass usage frequency to obtain the final personal usage frequency of the user, and the accuracy of the personal usage frequency can be ensured in the case that the user wants to refer to mass aesthetic or the personal usage data of the user is less.

In a possible implementation manner of the first aspect, the electronic device includes a first image generation model, and the first image generation model is a trained second image generation model; displaying a second image of a second image style at the user interface, comprising: and inputting the images in the gallery or the images acquired by the cameras of the electronic equipment in real time and the second image style into a first image generation model to obtain and display the second image on a user interface. In the implementation manner, the trained neural network model is utilized to quickly and accurately generate the second image of the second image style.

In a possible implementation manner of the first aspect, the image processing method further includes: training the second image generation model in the first stage by using the noise image and the style description text to obtain a first stage image generation model; training the first-stage image generation model in a second stage by using the first training image and the style description text to obtain a second-stage image generation model; training the second-stage image generation model in a third stage by using the second training image and the style description text to obtain a first image generation model; wherein the image style of the second training image is opposite to the image style described by the style description text.

In a possible implementation manner of the first aspect, the electronic device may select at least two second image styles from the candidate image styles according to the preference degree, and then the second image includes at least two second sub-images, where the second sub-images correspond to different second image styles. Further, the electronic device displaying a second image of a second image style at the user interface may include: displaying a second sub-image on the user interface; receiving a third operation of a user on the user interface, wherein the third operation is used for triggering the electronic equipment to switch and display the second sub-image; in response to the third operation, another second sub-image is displayed in a switching manner. Therefore, the realization mode is matched with at least two second image styles according to the preference degree of the user personal to the candidate image styles, more style selections can be provided for the user, so that the personal aesthetic preference of the user is met, and the accuracy of image processing is improved.

In a possible implementation manner of the first aspect, in order to meet a requirement that the user may close the personalized processing, after the user interface displays the second image of the second image style, the image processing method further includes: receiving a fourth operation of a user on the user interface, wherein the fourth operation is used for triggering the electronic equipment to restore the image style of the image in the user interface; in response to the fourth operation, a first image of the first image style is displayed at the user interface.

In a second aspect, the present application provides an electronic device, comprising: the display comprises a memory, a display screen and one or more processors, wherein the memory and the display screen are coupled with the processors; the display screen is used for displaying a user interface, one or more computer program codes are stored in the memory, and the computer program codes comprise computer instructions; the computer instructions, when executed by the processor, cause the electronic device to perform the steps of: displaying a first image of a first image style at a user interface; a first operation of a user on a user interface is received, the first operation being used to trigger the electronic device to convert an image style of an image in the user interface. Then, the electronic device extracts image content of the first image in response to the first operation, and takes at least one image style corresponding to candidate image content with highest similarity of the image content of the first image as a candidate image style. Finally, obtaining the preference degree of the user for the candidate image style, selecting a second image style from the candidate image styles according to the preference degree, and displaying a second image of the second image style on a user interface; wherein the user interface is a large graphic display page that may be a gallery in an electronic device, the first image is an image within the gallery, and the second image is an image resulting from converting the first image from a first image style to a second image style. Or, the user interface may be a preview interface of the image captured by the electronic device, where the first image is obtained by processing an image captured by the camera of the electronic device in real time using a first image style, and the second image is obtained by processing an image captured by the camera of the electronic device in real time using a second image style.

In a possible implementation manner of the second aspect, the computer instructions, when executed by the processor, cause the electronic device to further perform the steps of: receiving a second operation of a user on the user interface, wherein the second operation is used for triggering the electronic equipment to store a second image of a second image style; in response to the second operation, the saved personal frequency of use of the second image style is updated.

In a possible implementation manner of the second aspect, the computer instructions, when executed by the processor, cause the electronic device to further perform the steps of: the first image is input into a first multi-mode large language model, and the content output through the image encoder and the image-text alignment layer processing in the first multi-mode large language model is encoded as the image content of the first image.

In a possible implementation manner of the second aspect, the computer instructions, when executed by the processor, cause the electronic device to further perform the steps of: extracting content codes of candidate images by using an image encoder and an image-text alignment layer in the first multi-mode large-scale language model, clustering the content codes, and labeling the content of each class after clustering to obtain candidate image contents; and extracting text features of the candidate style description texts by using a large language model in the first multi-mode large language model, performing image-text alignment processing on the text features by using a linear projection layer to obtain style codes, clustering the style codes, and performing style labeling on each class after clustering to obtain the image style corresponding to the candidate image content.

In a possible implementation manner of the second aspect, the computer instructions, when executed by the processor, cause the electronic device to further perform the steps of: training the image content extraction capacity of the second multi-modal large language model after carrying out classification training of style decision factors on the second multi-modal large language model to obtain a first multi-modal large language model; wherein, style decision factors include at least one or more of environment, light, theme; the second multi-mode large language model comprises an image encoder, an image-text alignment layer and a large language model; the output of the image encoder is the input of the image-text alignment layer, and the output of the image-text alignment layer is the input of the large language model.

In a possible implementation manner of the second aspect, the computer instructions, when executed by the processor, cause the electronic device to further perform the steps of: counting the personal use times of the users of each candidate image style corresponding to the candidate image content; and carrying out normalization calculation on the personal use times of the users of the candidate image styles to obtain the personal use frequency of the candidate image styles.

In a possible implementation manner of the second aspect, the computer instructions, when executed by the processor, cause the electronic device to further perform the steps of: and carrying out weighting calculation on the public use frequency and the personal use frequency corresponding to the same candidate image style, and taking the obtained weighting frequency as the final personal use frequency of the candidate image style.

In a possible implementation manner of the second aspect, the computer instructions, when executed by the processor, cause the electronic device to further perform the steps of: and inputting the images in the gallery or the images acquired by the cameras of the electronic equipment in real time and the second image style into a first image generation model to obtain and display the second image on a user interface.

In a possible implementation manner of the second aspect, the computer instructions, when executed by the processor, cause the electronic device to further perform the steps of: training the second image generation model in the first stage by using the noise image and the style description text to obtain a first stage image generation model; training the first-stage image generation model in a second stage by using the first training image and the style description text to obtain a second-stage image generation model; training the second-stage image generation model in a third stage by using the second training image and the style description text to obtain a first image generation model; wherein the image style of the second training image is opposite to the image style described by the style description text.

In a possible implementation manner of the second aspect, the computer instructions, when executed by the processor, cause the electronic device to further perform the steps of: displaying a second sub-image on the user interface; receiving a third operation of a user on the user interface, wherein the third operation is used for triggering the electronic equipment to switch and display the second sub-image; in response to the third operation, another second sub-image is displayed in a switching manner. The second sub-images correspond to different second image styles, and at least two second image styles are selected from the candidate image styles according to the preference degree.

In a possible implementation manner of the second aspect, the computer instructions, when executed by the processor, cause the electronic device to further perform the steps of: receiving a fourth operation of a user on the user interface, wherein the fourth operation is used for triggering the electronic equipment to restore the image style of the image in the user interface; in response to the fourth operation, a first image of the first image style is displayed at the user interface.

In a third aspect, the present application provides a computer-readable storage medium, on which a computer program is stored which, when executed by a processor in an electronic device, causes the electronic device to perform the image processing method as in the first aspect and any one of its possible designs.

In a fourth aspect, the present application provides a computer program product for, when run on a computer, causing the computer to perform the method as in the first aspect and any one of its possible designs. The computer may be the electronic device described above.

It will be appreciated that the advantages achieved by the electronic device according to any of the possible designs of the second aspect, the computer readable storage medium according to the third aspect and the computer program product according to the fourth aspect may refer to the advantages as in the first aspect and any of the possible designs thereof, and are not described herein.

Drawings

Fig. 1 is a schematic view of image style preference of different users in the same scene according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an operation performed on a user interface according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an electronic device 100 according to an embodiment of the present application;

fig. 4 is a software block diagram of an electronic device 100 according to an embodiment of the present application;

fig. 5 is a schematic flow chart of an image processing method according to an embodiment of the present application;

FIG. 6 is a training schematic diagram of a multimodal large language model according to an embodiment of the present application;

fig. 7 is a schematic diagram of a construction flow of candidate image content and image style according to an embodiment of the present application;

FIG. 8 is a training schematic diagram of a linear projection layer according to an embodiment of the present disclosure;

fig. 9 is a schematic diagram of image content extraction of a first image according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a personal frequency matrix according to an embodiment of the present disclosure;

FIG. 11 is a schematic diagram of a personal number matrix according to an embodiment of the present disclosure;

FIG. 12 is a schematic diagram of a training process of an image generation model according to an embodiment of the present application;

Fig. 13 is a schematic diagram of a switching operation of a second image according to an embodiment of the present application;

fig. 14 is a flowchart of another image processing method according to an embodiment of the present application;

FIG. 15 is a schematic operation diagram of performing a fourth operation on a user interface according to an embodiment of the present application;

fig. 16 is a flowchart of another image processing method according to an embodiment of the present application.

Detailed Description

The technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. In the description of the embodiments of the present application, the terminology used in the embodiments below is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. In addition, in order to clearly describe the technical solutions of the embodiments of the present application, in the embodiments of the present application, the words "first", "second", and the like are used to distinguish the same item or similar items having substantially the same function and effect. Those skilled in the art will appreciate that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ. And, in the description of the embodiments of the present application, unless otherwise indicated, the meaning of "plurality" means two or more.

With the development of electronic device technology, more and more electronic devices are configured with a camera module for users to take images. Based on the shooting requirements and aesthetic of the user, the user has different image style preferences for the shot images in the same scene. For example, different users have different demands on the hue, light, composition, etc. of the image. As shown in fig. 1, a schematic view of image style preferences of different users in the same scene is provided. Where (1 a) in fig. 1 is an image style that user a prefers in a portrait scene, and (1B) in fig. 1 is an image style that user B prefers in a portrait scene. As can be seen from the figure, user B prefers a gray-scale image style in a portrait scene than user a.

Currently, in order to meet the requirements of different image style preferences of users, a camera function in an electronic device may automatically adjust brightness or select a corresponding shooting mode according to a type of a shooting object. For example, when the photographic subject is a person, the electronic device may correspondingly select the portrait mode. When the photographing object is food, the electronic device may correspondingly select the food mode. The electronic equipment can also select shooting modes such as macro, night scenes and the like according to other actual shooting conditions.

However, the existing method mainly considers the type of the photographed object, namely the photographed object type, which does not fully consider the personal aesthetic preference of the user, so that the image processing effect is deviated, and the image processing accuracy is reduced.

In order to solve the above-described problems, an embodiment of the present application provides an image processing method. The image processing method is applied to electronic equipment, and the image processing method provided by the embodiment of the application is described below by taking the electronic equipment as an example.

First, the electronic device displays a first image of a first image style on a user interface in response to a user operation. The user interface may be a preview interface of an image captured by the electronic device. Then, the first image may be an image displayed in a camera viewfinder after the electronic device turns on a camera function in response to a user operation. That is, the first image may be an image acquired in real time by a camera of the electronic device. In addition, the user interface may also be a large map display page of a gallery (album) in the electronic device. Then, the first image may also be an image that has been stored after completing the shooting, for example, an image stored in a gallery (album) in the electronic device. The first image style is the image style of the image in the user interface before the electronic device performs the image style conversion and can be understood as the original image style.

The electronic device then waits to receive a first operation of the user at the user interface. Wherein the first operation is for triggering the electronic device to switch an image style of an image displayed in the user interface. In the embodiment of the application, the conversion of the image style can be understood as the personalized style processing of the image. The personalized style processing refers to performing corresponding style processing on the image based on the aesthetic preference of the personal style of the user, and comprises tone processing, light ray processing and the like of the image.

Secondly, since the preference degree of the user can generally reflect the personal preference of the user, in the embodiment of the application, after receiving the first operation of the user interface, the electronic device responds to the first operation, and matches the second image style conforming to the aesthetic preference of the personal style of the user according to the preference degree of the personal user for all the candidate image styles. The candidate image style is obtained through image content matching of the first image. In the electronic device, a plurality of candidate image contents are stored in advance, and each of the candidate image contents corresponds to at least one image style. Further, when the electronic device receives the first operation and determines that the image style needs to be converted, the image content of the first image is extracted first. The electronic device then matches the extracted image content of the first image with the stored plurality of candidate image content. The image style corresponding to the candidate image content matched with the image content of the first image is the image style corresponding to the image content of the first image, which is called as the candidate image style in the embodiment of the present application. Whether the image contents match or not can be determined by calculating the similarity, and the image style corresponding to the candidate image content with the highest similarity of the image contents of the first image is taken as the candidate image style.

After the electronic device obtains the candidate image styles, according to the preference degree (preference degree) of the user on each candidate image style, selecting the image style from the candidate image styles as a second image style. For example, the image style having the highest user preference (preference degree) among the candidate image styles may be selected as the second image style conforming to the user's personal aesthetic preference.

In the embodiment of the application, the image style is selected as the second image style based on the image content of the first image displayed on the current user interface and the preference degree of the user, so that the image style more conforming to the current displayed image content can be obtained, and the accuracy of the image style conversion processing is further improved while the personal aesthetic preference of the user is met.

And finally, the electronic equipment converts the image style according to the matched second image style, converts the image style of the image in the user interface and converts the image style into the selected second image style. After the image style conversion is completed, the electronic device displays a second image of a second image style on the user interface.

When the user interface is a large-scale image display page of a gallery (album), and the first image is an image stored in the gallery (album), the content of the first image is not changed in general because the first image is a well-stored image. Therefore, the second image is an image that converts the first image from the first image style to the second image style. That is, in this case, the image style of the first image and the second image is different, but the image content is identical.

When the user interface is a preview interface of a camera function in the electronic device, the image in the view-finding frame in the preview interface is an image acquired by the camera of the electronic device in real time. Therefore, when the electronic device is held by a user, the image captured by the camera is liable to change in image content in the viewfinder due to the action of the user. Alternatively, when the object captured by the camera of the electronic device is dynamically changed (such as sunrise, sunset, etc.), there is also a slight change in the image content in the viewfinder. Therefore, the first image and the second image are images acquired in real time by the camera of the current electronic device at different times. The image content of the first image and the second image is not necessarily identical. Thus, the second image of the second image style displayed by the user interface is not necessarily the image that converts the first image from the first image style to the second image style. At this time, the second image should be an image acquired in real time by the camera processed using the second image style. Similarly, the first image of the first image style displayed on the user interface is an image captured by the camera currently in real time, processed with the first image style.

In summary, when the user interface is a preview interface, the image content of the first image and the image content of the second image are both based on the image content acquired by the camera of the electronic device in real time. In this case, after the electronic device performs the image style conversion process, the first image and the second image may have identical image contents except for different image styles, and may also have differences due to the user's hand-holding action or dynamic changes of the object collected by the camera. The second image is understood to be an image that converts the first image from the first image style to the second image style only if the images acquired by the cameras in real time before and after the image style conversion are completely unchanged by the electronic device.

By way of example, the first operation described above may be a user click operation of an "image style processing" button/control on the electronic device. The above-mentioned "image style processing" is displayed on the user interface, and the "image style processing" button/control may be any icon or character according to the design requirement of the actual functional interface of the user interface.

Taking a mobile phone as an example, the electronic device refers to a user interface shown in fig. 2, where the user interface is a preview interface of a camera in the mobile phone, and the "image style processing" button/control is an "AI" button/control in the preview interface shown in fig. 2. Thus, when the electronic device receives an open click operation (first operation) of the "AI" button/control by the user on the preview interface, the electronic device starts responding to the first operation and performs style conversion on the character image in the preview interface shown in fig. 2.

Exemplary, as shown in fig. 3, a schematic structural diagram of an electronic device 100 is provided.

The electronic device 100 in the embodiments of the present application may include at least one of a mobile phone, a camera, a video camera, a foldable electronic device, a tablet computer, a desktop computer, a laptop computer, a handheld computer, a notebook computer, an ultra-mobile personal computer (ultra-mobile personal computer, UMPC), a netbook, a cellular phone, a personal digital assistant (personal digital assistant, PDA), an augmented reality (augmented reality, AR) device, a Virtual Reality (VR) device, an artificial intelligence (artificial intelligence, AI) device, a wearable device, a vehicle-mounted device, a smart home device, or a smart city device.

The embodiment of the application does not particularly limit the specific type of the electronic device.

The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) connector 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, keys 190, a motor 191, an indicator 192, a camera module 193, a display 194, and a subscriber identity module (subscriber identification module, SIM) card interface 195, etc. The sensor module 180 may include, among other things, a pressure sensor, a gyroscope sensor, a barometric sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity light sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, etc.

The processor 110 may include one or more processing units, such as: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), a controller, a video codec, a digital signal processor (digital signal processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

The processor 110 may generate operation control signals according to the instruction operation code and the timing signals to complete instruction fetching and instruction execution control. In particular, for the embodiments of the present application, the conversion of the image style may be achieved by the processor 110 of the electronic device 100 executing the relevant instructions. That is, the processor 110 of the electronic device extracts the first image to obtain image contents in response to the first operation, and at least one image style corresponding to the candidate image content having the highest similarity to the image contents of the first image among the plurality of candidate image contents is taken as the candidate image style. The processor 110 then obtains a user preference for the candidate image styles, and selects a second image style from the candidate image styles based on the preference. And finally displaying a second image of a second image style on the user interface.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 may be a cache memory. The memory may hold instructions or data that are used or used more frequently by the processor 110. If the processor 110 needs to use the instruction or data, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system. In particular, for embodiments of the present application, the memory may store a plurality of candidate image content, and at least one image style corresponding to each candidate image content.

In some embodiments, the processor 110 may include one or more interfaces. The interfaces may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose input/output (GPIO) interface, a subscriber identity module (subscriber identity module, SIM) interface, and/or a universal serial bus (universal serial bus, USB) interface, among others. The processor 110 may be connected to the touch sensor, the audio module, the wireless communication module, the display screen, the camera module, and the like through at least one of the above interfaces.

It should be understood that the interfacing relationship between the modules illustrated in the embodiments of the present application is only illustrative, and does not limit the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also use different interfacing manners, or a combination of multiple interfacing manners in the foregoing embodiments.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to enable expansion of the memory capabilities of the electronic device 100. The external memory card communicates with the processor 110 through an external memory interface 120 to implement data storage functions. For example, files such as music, video, etc. are stored in an external memory card. Or transfer files such as music, video, etc. from the electronic device to an external memory card.

The internal memory 121 may be used to store computer executable program code that includes instructions. The internal memory 121 may include a storage program area and a storage data area. The storage program area may store an application program (such as an image capturing function, a sound playing function, an image playing function, etc.) required for at least one function of the operating system, and the like. The storage data area may store data created during use of the electronic device 100 (e.g., audio data, phonebook, etc.), and so on. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (universal flash storage, UFS), and the like. The processor 110 performs various functional methods or data processing of the electronic device 100 by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.

The charge management module 140 is configured to receive a charge input from a charger and charge the battery 142. The charging management module 140 may also supply power to the electronic device through the power management module 141 while charging the battery 142.

The power management module 141 is used for connecting the battery 142, and the charge management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140 to power the processor 110, the internal memory 121, the display 194, the camera module 193, the wireless communication module 160, etc. The power management module 141 may also be configured to monitor battery capacity, battery cycle number, battery health (leakage, impedance) and other parameters. In other embodiments, the power management module 141 may also be provided in the processor 110. In other embodiments, the power management module 141 and the charge management module 140 may be disposed in the same device.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

Wherein the antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. The mobile communication module 150 may provide a solution for wireless communication including 2G/3G/4G/5G, etc., applied to the electronic device 100. The wireless communication module 160 may provide solutions for wireless communication including wireless local area network (wireless local area networks, WLAN) (e.g., wireless fidelity (wireless fidelity, wi-Fi) network), bluetooth (BT), bluetooth low energy (bluetooth low energy, BLE), ultra Wide Band (UWB), global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field wireless communication technology (near field communication, NFC), infrared technology (IR), etc., as applied on the electronic device 100.

In some embodiments, antenna 1 and mobile communication module 150 of electronic device 100 are coupled, and antenna 2 and wireless communication module 160 are coupled, such that electronic device 100 may communicate with networks and other electronic devices through wireless communication techniques.

The electronic device 100 may implement display functions through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The display screen 194 is used to display images, videos, and the like. In particular to embodiments of the present application, the display 194 may display a user interface in embodiments of the present application and display a first image of a first image style and a second image of a second image style in the user interface. The display 194 includes a display panel. In the embodiment of the present application, the display panel may employ an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (AMOLED) or an active-matrix organic light-emitting diode (matrix organic light emitting diode), a flex light-emitting diode (flex), a mini, a Micro led, a Micro-OLED, a quantum dot light-emitting diode (quantum dot light emitting diodes, QLED), or the like. In some embodiments, the electronic device 100 may include 1 or more display screens 194.

The electronic device 100 may implement camera functions through a camera module 193, an isp, a video codec, a GPU, a display screen 194, and an application processor AP, a neural network processor NPU, etc.

The camera module 193 may be used to acquire color image data as well as depth data of a subject. The ISP may be used to process color image data acquired by the camera module 193. For example, when photographing, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing, so as to convert the electrical signal into macroscopic images (such as the first image and the second image in the embodiment of the application). ISP can also optimize the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be disposed in the camera module 193.

In some embodiments, the camera module 193 may be composed of a color camera module and a 3D sensing module.

In some embodiments, the photosensitive element of the camera of the color camera module may be a charge coupled device (charge coupled device, CCD) or a complementary metal oxide semiconductor (complementary metal-oxide-semiconductor, CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, which is then transferred to the ISP to be converted into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into an image signal in a standard RGB, YUV, or the like format.

In some embodiments, the 3D sensing module may be a (time of flight) 3D sensing module or a structured light (3D) sensing module. The structured light 3D sensing is an active depth sensing technology, and basic components of the structured light 3D sensing module may include an Infrared (Infrared) emitter, an IR camera module, and the like. The working principle of the structured light 3D sensing module is that a light spot (pattern) with a specific pattern is emitted to a shot object, then a light spot pattern code (light coding) on the surface of the object is received, and the difference between the light spot and an original projected light spot is compared, and the three-dimensional coordinate of the object is calculated by utilizing the triangle principle. The three-dimensional coordinates include the distance from the electronic device 100 to the subject. The TOF 3D sensing may be an active depth sensing technology, and the basic components of the TOF 3D sensing module may include an Infrared (Infrared) emitter, an IR camera module, and the like. The working principle of the TOF 3D sensing module is to calculate the distance (namely depth) between the TOF 3D sensing module and the shot object through the time of infrared ray turn-back so as to obtain a 3D depth map.

The structured light 3D sensing module can also be applied to the fields of face recognition, somatosensory game machines, industrial machine vision detection and the like. The TOF 3D sensing module can also be applied to the fields of game machines, augmented reality (augmented reality, AR)/Virtual Reality (VR), and the like.

In other embodiments, camera module 193 may also be comprised of two or more cameras. The two or more cameras may include a color camera that may be used to capture color image data of the object being photographed. The two or more cameras may employ stereoscopic vision (stereo) technology to acquire depth data of the photographed object. The stereoscopic vision technology is based on the principle of parallax of human eyes, and obtains distance information, i.e., depth information, between the electronic device 100 and the object to be photographed by shooting images of the same object from different angles through two or more cameras under a natural light source and performing operations such as triangulation.

In some embodiments, electronic device 100 may include 1 or more camera modules 193. Specifically, the electronic device 100 may include 1 front camera module 193 and 1 rear camera module 193. The front camera module 193 can be used to collect color image data and depth data of a photographer facing the display screen 194, and the rear camera module can be used to collect color image data and depth data of a photographed object (such as a person, a landscape, etc.) facing the photographer.

In some embodiments, a CPU or GPU or NPU in the processor 110 may process color image data and depth data acquired by the camera module 193. In some embodiments, the NPU may identify color image data acquired by the camera module 193 (specifically, the color camera module) by a neural network algorithm, such as a convolutional neural network algorithm (CNN), based on which the skeletal point identification technique is based, to determine the image content of the photographed object. The CPU or GPU may also be operable to run a neural network algorithm to enable determination of the image content of a subject to be photographed from the color image data.

It is to be understood that the structure illustrated in the embodiments of the present application does not constitute a specific limitation on the electronic device 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Referring to fig. 4, a software architecture block diagram of an electronic device 100 according to an embodiment of the present application is shown.

The layered architecture divides the software into several layers, each with distinct roles and branches. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into five layers, from top to bottom, an application layer (applications), an application framework layer (application framework), an Zhuoyun rows (Android run, ART) and native C/c++ libraries, a hardware abstraction layer (Hardware Abstract Layer, HAL) and a kernel layer (kernel), respectively.

The application layer may include a series of applications. As shown in fig. 4, the application package may include applications for cameras, gallery, calendar, phone calls, maps, navigation, WLAN, bluetooth, music, video, short messages, etc.

The application framework layer provides an application programming interface (application programming interface, API) and programming framework for application programs of the application layer. The application framework layer includes a number of predefined functions. As shown in FIG. 4, the application framework layer may include a window manager, a content provider, a view system, a resource manager, a notification manager, an activity manager, an input manager, and so forth.

The window manager provides window management services (Window Manager Service, WMS) that may be used for window management, window animation management, surface management, and as a transfer station to the input system.

The content provider is used to store and retrieve data and make such data accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phonebooks, etc.

The view system includes visual controls, such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, a display interface including a text message notification icon may include a view displaying text and a view displaying a picture.

The resource manager provides various resources for the application program, such as localization strings, icons, pictures, layout files, video files, and the like.

The notification manager allows the application to display notification information in a status bar, can be used to communicate notification type messages, can automatically disappear after a short dwell, and does not require user interaction. Such as notification manager is used to inform that the download is complete, message alerts, etc. The notification manager may also be a notification in the form of a chart or scroll bar text that appears on the system top status bar, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog window. For example, a text message is prompted in a status bar, a prompt tone is emitted, the electronic device vibrates, and an indicator light blinks, etc.

The activity manager may provide activity management services (Activity Manager Service, AMS) that may be used for system component (e.g., activity, service, content provider, broadcast receiver) start-up, handoff, scheduling, and application process management and scheduling tasks.

The input manager may provide input management services (Input Manager Service, IMS), which may be used to manage inputs to the system, such as touch screen inputs, key inputs, sensor inputs, and the like. The IMS retrieves events from the input device node and distributes the events to the appropriate windows through interactions with the WMS.

The android runtime includes a core library and An Zhuoyun rows. The android runtime is responsible for converting source code into machine code. Android runtime mainly includes employing Advanced Or Time (AOT) compilation techniques and Just In Time (JIT) compilation techniques.

The core library is mainly used for providing the functions of basic Java class libraries, such as basic data structures, mathematics, IO, tools, databases, networks and the like. The core library provides an API for the user to develop the android application.

The native C/c++ library may include a plurality of functional modules. For example: surface manager (surface manager), media Framework (Media Framework), libc, openGL ES, SQLite, webkit, etc.

The surface manager is used for managing the display subsystem and providing fusion of 2D and 3D layers for a plurality of application programs. Media frames support a variety of commonly used audio, video format playback and recording, still image files, and the like. The media library may support a variety of audio and video encoding formats, such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, etc. OpenGL ES provides for drawing and manipulation of 2D graphics and 3D graphics in applications. SQLite provides a lightweight relational database for applications of the electronic device 100.

The hardware abstraction layer runs in a user space (user space), encapsulates the kernel layer driver, and provides a call interface to the upper layer. The hardware abstraction layer at least comprises a display module, an audio module, a camera module and a Bluetooth module.

The kernel layer is a layer between hardware and software, and at least comprises a display driver, a camera driver, an audio driver and a Bluetooth driver.

The workflow of the electronic device 100 software and hardware is illustrated below in connection with capturing a photo scene.

When the touch sensor receives a touch operation, a corresponding hardware interrupt is sent to the kernel layer. The kernel layer processes the touch operation into the original input event (including information such as touch coordinates, time stamp of touch operation, etc.). The original input event is stored at the kernel layer. The application framework layer acquires an original input event from the kernel layer, and identifies a control corresponding to the input event. Taking the touch operation as a touch click operation, taking a control corresponding to the click operation as an example of a control of a camera application icon, the camera application calls an interface of an application framework layer, starts the camera application, further starts a camera driver by calling a kernel layer, and captures a still image or video by a camera module 193. Specifically, in the embodiment of the present application, the image or video captured by the camera module 193 can be regarded as the first image or the second image in the embodiment of the present application.

As shown in fig. 5, a flow chart of an image processing method is provided. The image processing method provided in the embodiment of the present application is described in detail below with reference to fig. 5. Referring to fig. 5, the image processing method includes steps S501 to S505.

S501, the electronic device displays a first image of a first image style on a user interface.

S502, the electronic equipment receives a first operation of a user on a user interface.

The user interface may be a large graphic display page of a gallery in the electronic device, and the first image style is an image style of the electronic device prior to style conversion, which may be understood as an original image style. The first image is an image displayed in a large-drawing display page, that is, an image stored in a gallery. The image style of the first image is the first image style.

The user interface may also be a preview interface of a shot image in the electronic device, and then the first image is an image currently acquired in real time by the camera of the electronic device, and the image is processed by adopting a first image style and then displayed on the user interface, that is, the image acquired in real time by the camera is displayed in the user interface in the first image style.

The first operation is for triggering the electronic device to convert an image style of an image in the user interface. For example, an image displayed in the user interface is converted from a first image style to a second image style.

S503, the electronic device responds to the first operation, extracts the image content of the first image, and takes at least one image style corresponding to the candidate image content with the highest image content similarity of the first image as a candidate image style.

S504, obtaining the preference degree of the user on the candidate image styles, and selecting a second image style from the candidate image styles according to the preference degree.

The electronic equipment stores a plurality of candidate image contents, and each candidate image content is correspondingly stored with at least one image style. The candidate image content and image style may be understood as being pre-stored candidate data for the electronic device to match the second image style. The electronic device determines a candidate image style corresponding to image content of the first image from the stored image styles.

The preference degree is used for measuring the preference degree of the user personal to the candidate image style. The higher the preference of the candidate image style, the more preferable the candidate image style is to characterize the user's shooting needs and aesthetic. Thus, the higher the preference candidate image style, the more desirable the user's personal aesthetic preference needs.

Specifically, the electronic device first extracts the image content of the first image, matches the image content with the locally stored candidate image content, and uses all image styles corresponding to the matched candidate image content as candidate image styles in the embodiment of the application. The matching process can be determined by calculating the similarity of the two image contents, and the electronic device selects the image style corresponding to the candidate image content with the highest similarity of the image contents of the first image as the candidate image style.

By way of example, it is assumed that the candidate image contents stored in the electronic device include image content 1, image content 2, … …, and image content n. Then, when the similarity between the image content of the first image and the image content 2 is highest, the image style 1 and the image style 2 … … corresponding to the image content 2 are determined as candidate image styles.

Then, the electronic apparatus selects an image style from among the candidate styles of image style 1, image style 2 … …, image style n, and the like, as a second image style according to the user's preference degree. For example, the electronic device may select, from the candidate image styles, the image style with the highest personal preference of the user as the second image style that is selected in a matching manner in the embodiment of the present application.

In some embodiments, a storage space may be divided locally at the electronic device specifically for storing candidate image content and image styles corresponding to the candidate image content. It can be understood that, equivalently, a style library is established locally in the electronic device, the determined candidate image content and the image style are stored in the style library, and the subsequent electronic device directly matches the candidate image style corresponding to the image content of the first image in the style library.

Based on this, if data such as candidate image contents and image styles are directly stored in the style library in the form of images, a large amount of storage space of the electronic device is necessarily occupied. Therefore, in order to reduce the amount of data stored in the local style gallery, the storage space occupied by the style gallery is reduced as much as possible. The candidate image contents and the image styles can be stored in a style library in a coded form, a large number of candidate image contents and image styles can be divided in advance in a clustering mode, and each candidate image content and image style in the same cluster after clustering is used as one candidate image content and image style, so that the aim of reducing the data quantity is achieved.

Thus, the construction of a plurality of candidate image contents and at least one image style corresponding to each candidate image content in the local style library of the electronic device may include: extracting content codes of candidate images by using an image encoder and an image-text alignment layer in the first multi-mode large-scale language model, clustering the content codes, and labeling the content of each class after clustering to obtain candidate image contents; and extracting text features of the candidate style description texts by using a large language model in the first multi-mode large language model, performing image-text alignment processing on the text features by using a linear projection layer to obtain style codes, clustering the style codes, and performing style labeling on each class after clustering to obtain the image style corresponding to the candidate image content.

The electronic equipment comprises a first multi-mode large language model, wherein the first multi-mode large language model is a second multi-mode large language model after training. It can be seen that the first multi-modal large language model and the second multi-modal large language model have the same structure. The difference is that the first multimodal large language model is a trained model and the second multimodal large language model is a model prior to training. Since the model weight of each structural module is updated after the model is trained, the first multi-modal large language model differs from the second multi-modal large language model only in the model weight.

The model structures of the first multi-modal large-scale language model and the second multi-modal large-scale language model can refer to any multi-modal large-scale language model, such as a miniGPT-4 model. The multi-modal large language model is a multi-modal neural network model with images as input and texts as output. Briefly, the second multimodal large language model may be understood as an open source multimodal large language model, which has a certain image recognition question-answering capability. The first multi-modal large language model in the embodiment of the application is a model obtained by performing targeted training based on the second multi-modal large language model further based on the actual application field (the image style processing field).

In an embodiment of the present application, the multimodal large language model includes an image encoder (image encoder), a text alignment layer (image & text alignment), and a large language model (large language model).

An image encoder is a module for extracting visual features from an image, and may also be referred to as a visual encoder. The image-text alignment layer is used for aligning the image with the text. Since images and text are data of different modalities, it is often necessary to align the image features extracted by the image encoder with the text by an alignment process so that both can be analyzed and understood together. Briefly, the alignment of graphics in a multimodal large language model can be understood as mapping the output of an image encoder to the same dimension as the large language model so that the large language model can acquire the output of the image encoder. The large language model may be simply called LLM, which is a module used to extract semantic features from text and generate text.

In the multimodal large language model, the input of an image encoder is an image, and the image encoder processes the input image first. Then the output of the image encoder is the input of the image-text alignment layer, and the output of the image-text alignment layer is the input of the large language model. The output of the large language model is used as the final output of the multi-modal large language model.

In order to facilitate solution understanding, a training process of training the second multimodal large language model to obtain the first multimodal large language model in the embodiment of the present application is described below.

In the embodiment of the application, in order to improve the capability of the second multi-mode large language model in the image style processing application field, besides the capability of the second multi-mode large language model for extracting fine granularity image content, the capability of the second multi-mode large language model for extracting style decision factors is trained. Furthermore, when the trained first multi-mode large language model is used for realizing image recognition question-answering, the model output not only comprises detail rich image content characteristics, but also can comprise style decision factors to assist in the construction of a style library. In summary, the training process of the second multimodal large language model in the embodiments of the present application may include: and training the image content extraction capacity of the second multi-modal large language model after carrying out classification training of the style decision factors on the second multi-modal large language model.

It follows that the training of the second multimodal large language model includes two training phases. The first training stage is used to train the second multi-modal large language model to extract the ability of the style decision factor features. The second training stage trains the ability of the second multimodal large language model to extract fine-grained image content. Wherein, style decision factors include at least one or more of environment, light, theme.

It should be appreciated that since the multimodal large language model is an open source chat robot model with image understanding capabilities that can implement image recognition questions and answers, it can describe images or answer questions about the content of the images. Therefore, whether the multi-modal large language model is trained or applied, besides inputting images, the multi-modal large language model needs to input questions which need to be answered, so that the multi-modal large language model can output corresponding image description text. The input problem needs to be determined based on the actual application situation, which is not limited in the embodiment of the present application. For example, for the embodiments of the present application, the question entered may be "please describe the environment/light/theme in the image", "please describe the image", "how the image should be modified", "what style the image is fit for", and so on.

Specifically, as shown in fig. 6, a training schematic diagram of a multimodal large language model is provided.

Referring to fig. 6, in the first training phase, the question input to the large language model is mainly "please describe the environment/light/subject in this image" due to the extraction capability of the style decision factor features of the main training multi-modal large predictive model. Correspondingly, the output of the large language model mainly carries text descriptions related to factors such as environment/light/theme.

In the second training phase, because of the main training image content extraction capability, the question input to the large language model other than the image is changed accordingly to "please describe the image", "how the image should be modified", "what style the image is suitable for", and so on. The corresponding output of the large language model is a text description carrying the image content. Meanwhile, since the multi-mode large language model has completed the training of the first stage, the output text description also comprises descriptions of some style decision factors.

Additionally, it should be appreciated that in an open-source multi-modal large language model, both the image encoder and the large language model are typically pre-trained, so the majority of the training of the existing multi-modal large language model is primarily training the teletext alignment layer. Therefore, in the model training process, the model weights of the image encoder and the large language model can be frozen, so that only the model weights of the image-text alignment layer are updated. Of course, parameters of the image encoder and the large language model can be selected based on actual conditions, so that model weights of all modules in the multi-mode large language model are updated to expected values through training. Fig. 6 of the embodiment of the present application illustrates updating only the model weights of the image-text alignment layer.

After the training of the multi-mode large-scale language model is completed, the parameters of the first multi-mode large-scale language model obtained through training are fixed. The first multimodal large language model is then invoked to build a style gallery.

Specifically, referring to fig. 7, when the candidate image content and the image style are acquired by constructing the style gallery, the electronic device first acquires the collected candidate image and candidate style description text. Wherein the candidate image is an image collected for constructing the content of the candidate image, and the candidate style description text is a text collected for constructing the style of the image.

And processing the candidate images through an image encoder and an image-text alignment layer in the first multi-mode large language model to obtain content codes. Meanwhile, the candidate style description text is input into a large language model in a multi-modal large language model to extract text characteristics, and then is input into an additional linear projection layer (projector) to be subjected to image-text alignment processing to obtain style codes. The linear projection layer and the image-text alignment layer in the first multi-mode large language model have the same function and are all modules for realizing image-text alignment. Thus, in embodiments of the present application, the specific network structures of the image alignment layer and the linear projection layer may be the same.

Then, the obtained content codes and style codes are clustered respectively, and any existing clustering method can be adopted for clustering, such as K-means clustering, mean shift clustering, density-based clustering method and the like. After clustering, the content codes and the style codes are divided into a plurality of clusters, namely, each cluster is briefly marked after being divided into a plurality of categories.

And the content coding cluster carries out content labeling to obtain candidate image content. For example, the candidate image content may be "nature-animal-close", "portrait-window-overexposure", or the like. The style coding clusters are also subjected to style labeling to obtain the image style, such as brightness improvement, definition, vividness and the like. The content annotation and the style annotation can be manual annotation, and candidate image content and image styles are stored in the electronic equipment after the manual annotation is completed.

Therefore, by pre-constructing candidate image contents and image styles and storing the candidate image contents and the image styles into the local style library of the electronic equipment, the subsequent electronic equipment can be directly matched with the corresponding second image styles from the local style library so as to improve the efficiency of image processing. Meanwhile, the construction of the style library is realized through clustering, so that a certain data volume can be further reduced to save the storage space.

In addition, since the image encoder, the graphic alignment layer and the large language model used in the embodiments of the present application are modules in the first multi-modal large language model. Because the first multimodal large language model is a trained model, the modules can be invoked directly for use. However, the linear projection layer is an additional module for obtaining the style code aligned with the content code, so that the linear projection layer is required to be trained and put into use by combining each module in the first multi-mode large language model in order to enable the linear projection layer to have corresponding graphic alignment capability. In addition, when the linear projection layer is trained, in order to avoid parameter variation of each module in the trained first multi-mode large-scale language model, the modules such as the image encoder, the image-text alignment layer and the large-scale language model are required to be frozen in the training process, so that the modules cannot be changed due to the training of the linear projection layer.

Illustratively, referring to fig. 8, training of the linear projection layer is mainly achieved by constructing contrast learning of content coding and style coding with true and false samples (positive and negative samples). Wherein the true sample includes a candidate image for training and style description text describing the candidate image. That is, the image content and style description text in the true sample are corresponding, and the image content and style description text are matched, then the label of the true sample is "1" to indicate recommendation. The false sample is a candidate image for training and style description text completely unrelated to the candidate image. That is, the image content and style description text in the dummy sample are different or even opposite, meaning that the image and style are not corresponding, then a label of "0" for the dummy sample indicates that it is not recommended.

And then training the linear projection layer by using the constructed true and false samples, guiding the linear projection layer to learn forward by using the true samples through a '0/1' label task, and simultaneously assisting to promote the understanding of the linear projection layer by using the false samples, so that the performance of the linear projection layer is improved, and the linear projection layer can learn to generate different style codes matched with corresponding content codes of image content for different style text descriptions.

After the training of the linear projection layer is completed, the linear projection layer can be used together with each module in the first multi-mode large language model, and any collected candidate images and candidate style description texts are processed, so that aligned content codes and style codes are obtained. And then, clustering the content codes and the style codes respectively to obtain candidate image content and image styles so as to construct a style library. The number of clusters in the embodiment of the present application may be set according to actual requirements, for example, determined according to a storage space of the electronic device, and may be 200 types, 100 types, 50 types, and the like, which is not limited in the embodiment of the present application.

In some embodiments, candidate image content stored in the local style library of the electronic device is characterized by content encoding output by an image encoder and a graphic alignment layer processing in the first multi-modal large language model. Then, to improve accuracy of similarity calculation between the candidate image content and the image content of the first image, the image content of the first image may directly call the first multi-modal large language model to extract.

Based on this, extracting the image content of the first image may include: the first image is input into a first multi-mode large language model, and the content output through the image encoder and the image-text alignment layer processing in the first multi-mode large language model is encoded as the image content of the first image.

Specifically, when the electronic device extracts the image content of the first image, the trained first multi-mode large language model is called first. After the electronic device calls the first multi-modal large language model, the first image is input to the first multi-modal large language model. Because the embodiment of the application only needs to acquire the image content of the first image, the output of the image-text alignment layer in the first multi-mode large-scale language model is directly acquired. That is, only the content codes output after being processed by the image encoder and the image-text alignment layer in the first multi-mode large language model need to be obtained. As shown in fig. 9, an embodiment of the present application provides a schematic diagram of image content extraction.

In some embodiments, in general, the more frequently a user uses an image style, the higher the user's preference for that image style. Therefore, the preference degree of the user on the image style in the embodiment of the application can be measured by the frequency of the user on the personal use frequency of the candidate image style. Based on this, the personal use frequency of each image style corresponding to each candidate image content is also stored in the electronic device. The personal frequency of use of the user may be a frequency that the electronic device chooses to use behavior statistics of the image style based on the user history.

Specifically, the candidate image content, image style, and personal frequency of use may be stored in a style library local to the electronic device in the form of a matrix, referred to herein in embodiments as a personal frequency matrix.

Illustratively, referring to FIG. 10, each row of the personal frequency matrix represents a candidate image content, each column represents an image style, and the matrix elements represent the personal frequency of use of the candidate image content with its corresponding image style. That is, the image contents 1, 2, and 3, … … in fig. 10 are candidate image contents. Image style 1 and image style 2 … … image style n are image styles corresponding to the respective candidate image contents. While matrix elements 11, 12, … … 1n, 21, 22, … … 2n, 31, 32, 3n, … … n1, n2, nn, etc. represent individual frequencies of use.

After determining candidate image styles corresponding to image contents of the first image, the electronic device selects an image style from the candidate image styles as a second image style based on individual use frequencies of the respective candidate image styles. Generally, the more frequently used image styles tend to be the more favored image styles for users. Thus, the electronic device can compare the individual use frequencies of all candidate image styles corresponding to the image content of the first image, and take the image style with the highest individual use frequency as the second image style. For example, assuming that the candidate image content having the highest similarity to the image content of the first image is image content 1 in fig. 10, the individual use frequency of the corresponding candidate image style includes 11, 12, …, 1n. When the personal use frequency 12 is the maximum value, the electronic device can take the image style 2 corresponding to the personal use frequency 12 as the second image style.

In some embodiments, the greater the number of uses, the higher the frequency of usage is characterized. Conversely, the fewer the number of uses, the lower the frequency of use. Accordingly, the electronic device can determine the individual use frequency by counting the number of individual uses of the users of various image styles.

Based on this, the acquisition of the individual use frequency of the candidate image style may include: counting the personal use times of the users of each candidate image style corresponding to the candidate image content; and carrying out normalization calculation on the personal use times of the users of the candidate image styles to obtain the personal use frequency of the candidate image styles.

Specifically, the electronic device counts the number of personal uses of the user of each image style corresponding to the candidate image content according to the image style selected by the user for the candidate image content history.

Taking the candidate image content and image style shown in fig. 10 as an example, assuming that the history user selects to use image style 1 3 times for image content 1, image style 1 is used 3 times by the user's individual under image content 1. While assuming that the history user selects to use image style 1 2 times for image content 2, image style 1 is used 2 times by the corresponding user person under image content 2. And then, the electronic equipment respectively carries out normalization calculation on the personal use times of the users of all the image styles corresponding to each candidate image content, so as to obtain the personal use frequency of each image style corresponding to each candidate image content.

In the embodiment of the application, the number of personal use times of the user can be represented by a matrix, and the embodiment of the application is called a personal number matrix. As shown in fig. 11, a personal number matrix schematic is provided.

Referring to fig. 11, the individual number matrix is different from the individual frequency matrix in that matrix elements of the individual number matrix characterize the number of individual uses of the user. For example, the first row of matrix elements in the person number matrix characterizes the number of times the user has used each image style historically under image content 1. That is, after the electronic device establishes the number matrix, the element in the matrix initially takes a value of 0. The electronic device updates matrix elements in the quantity matrix according to the image style selected by the user for different candidate image content. The value of the matrix element corresponding to the candidate image content and the image style is increased by 1 after the electronic device detects that the user selects one image style for one candidate image content.

For example, referring to fig. 11, when the user selects image style 1 to be used once for image content 1, the number of uses 11 is incremented by 1. If the number of uses 11 is currently the initial value of 0, the number of uses 11 is updated to 1 after being increased. If the number of uses 11 is currently 2, the number of uses 11 is updated to 3 after incrementing, and so on.

Then, the electronic device performs normalization calculation on each row of matrix elements in the personal number matrix, so that a personal frequency matrix can be obtained. In order to ensure the accuracy of the personal frequency matrix, once the personal number matrix is updated, the electronic device needs to recalculate the personal use frequency of the image style once, and the updating of the personal frequency matrix is completed. It should be understood that representing image content in matrix rows and matrix columns in the personal number matrix and personal frequency matrix, and matrix columns representing image styles are used as examples of embodiments of the present application, which are not limited in this regard. Based on the usage requirements, the image style may also be represented by matrix rows, with the matrix columns representing the image content, which does not affect the implementation of the embodiments of the present application.

In some embodiments, the electronic device may consider the mass aesthetic in addition to the user's personal aesthetic when selecting the image style as the second image style in the local style gallery for the case where the user wants to reference the mass aesthetic or the user's personal use data is less. Therefore, the frequency of use of the public for each image style corresponding to each candidate image content can also be stored in the electronic device in advance.

The public use frequency characterizes the preference of the public to the image style, and the image style commonly used by most users for certain image content can be determined according to the public use frequency. In the embodiment of the present application, the statistical manner of the public use frequency is similar to the statistical manner of the personal use frequency, and is obtained by counting the number of times of the public selecting the use image style for the candidate image content (i.e. the public use frequency), and then performing normalization calculation on the public use frequency of the image style corresponding to the candidate image content.

Specifically, after the candidate image content and the image style are determined, a number matrix may be established as a large-mode number matrix based on the candidate image content and the image style as well. Then, based on the collected public image data, the co-occurrence times of the candidate image content and the image style are counted, and the co-occurrence times are used as the public use times to be filled in the corresponding positions in the public quantity matrix. And then carrying out normalization calculation based on the mass matrix to obtain a mass frequency matrix, and storing the mass frequency matrix into the local of the electronic equipment.

That is, the personal number matrix and the personal frequency matrix are matrices that are maintained and updated in real time by the electronic device based on the behavior of the user using the image style for the candidate image content. And the mass frequency matrix is obtained by constructing a mass quantity matrix by counting the co-occurrence count of the candidate image content and the image style according to the collected public image data in advance after the candidate image content and the image style are obtained, and then carrying out normalization calculation on the line of the mass quantity matrix. Since the mass frequency matrix is constructed based on the collected, presently disclosed image data, frequencies in the mass frequency matrix can characterize the mass's preference to use the image style for different image content.

Based on this, the obtaining of the personal use frequency of the candidate image style may further include: and carrying out weighting calculation on the public use frequency and the user personal use frequency corresponding to the same candidate image style, and taking the obtained weighting frequency as the final personal use frequency of the candidate image style.

Specifically, when determining the preference degree of each candidate image style corresponding to the image content of the first image, the electronic device needs to acquire the public use frequency of the corresponding candidate image style from the locally stored public frequency matrix in addition to acquiring the personal use frequency of the corresponding candidate image style from the stored personal frequency matrix. Then, the mass use frequency and the individual use frequency belonging to the same candidate image style are subjected to weighting calculation. The electronic equipment takes the weighted frequency obtained after the weighted calculation as the final personal use frequency of the candidate image style, and the final personal use frequency of the candidate image style is used for measuring the preference degree of the user on the candidate image style.

That is, the electronic device may select the second image style directly based on the individual frequency of use of the candidate image styles in the individual frequency matrix as an indicator without regard to the aesthetic preferences of the public. In consideration of the aesthetic preference of the public, the electronic device needs to acquire the public frequency of use of the candidate image style in the public frequency matrix while acquiring the private frequency of use of the candidate image style in the private frequency matrix. And then, carrying out weighted calculation on the personal use frequency and the public use frequency corresponding to the same candidate image style, so as to obtain the personal use frequency of the candidate image style, which is finally used for measuring the preference degree of the user. I.e. the electronic device selects the second image style with the weighted personal frequency of use as an indicator.

In the weighting calculation, the weight ratio of the public use frequency to the personal use frequency may be set to an initial weight, for example, 0.5 each. The subsequent electronic equipment can always use the set initial weight for weighting, and the electronic equipment can also adjust and update the weight proportion based on the condition that the user uses the image style. For example, when the electronic device continuously counts a plurality of times that a certain image style is fixedly used by a user for a certain candidate image content, the weight ratio of the personal use frequency of the image style can be correspondingly increased for the candidate image content.

For example, suppose that the popular usage frequency corresponding to image style 1 in the candidate image style is a and the personal usage frequency is B. Taking the initial weight of 0.5 as an example, the personal use frequency of the final user of the image style 1 is 0.5a+0.5b.

S505, the electronic device displays a second image of a second image style on the user interface.

Specifically, after the electronic device matches and selects the second image style from the local style library, the image style of the first image displayed in the user interface may be converted, so as to generate and display the second image of the second image style.

When the user interface is a large map display page of a gallery (album), the first image is an image stored in the gallery (album), the second image is an image that converts the first image from a first image style to a second image style. When the user interface is a preview interface of a camera function in the electronic device, the first image is an image acquired by a camera of the electronic device in real time, and the second image is an image acquired by the camera processed by adopting the second image style in real time.

In some embodiments, to quickly and accurately generate the second image of the second image style, the generation of the second image may be accomplished using a trained neural network model.

Based on this, S505 may include: and inputting the images in the gallery or the images acquired by the camera of the electronic equipment in real time and the second image style into a first image generation model to obtain and display the second image on a user interface.

The electronic equipment comprises a first image generation model, wherein the first image generation model is a trained second image generation model. The second image generation model may be understood as an image generation model that is open-source without training by embodiments of the present application. Although existing open-source image generation models, such as diffusion models, generation countermeasure models (GAN), and the like, have already provided certain image generation capabilities, they can be used directly to generate images. The image generation capabilities of the open source image generation model are not necessarily capable of satisfying the image style conversion expectations required by embodiments of the present application.

Therefore, in order to enable the image generation model to have enough image style conversion capability, the embodiment of the application needs to perform model training on the second image generation model to obtain the first image generation model.

In order to facilitate solution understanding, the following embodiments of the present application first introduce a training process for training a second image generation model to obtain a first image generation model.

As shown in fig. 12, the training process of the image generation model mainly includes three training phases.

First, a second image generation model is trimmed based on the image style. In this training step, the second image generation model is trained in a first stage with the noise image and the style description text without any image content as training data. The training of the first stage is mainly aimed at enabling the second image generation model to recognize text descriptions of various styles. The style description text of the embodiment of the application can be obtained by collecting and crawling various images (the images are called as first training images in the embodiment of the application) and forming the style description text based on the actual image style of the first training images, so that the text description of various image styles is obtained. Some style description text may also be constructed manually as training data.

Illustratively, referring to fig. 12, a noise image and a style text description "this is a map of a warm tone" are input to an image generation model, and a warm tone image is output by the image generation model. The image generation model after the first stage training is referred to as a first stage image generation model in the embodiment of the present application.

The image style conversion capability of the first stage image generation model is then pre-trained. In this training step, the first-stage image generation model is trained in a second stage using an image with image content, such as a first training image and style description text collected in the first stage, as training data. The second stage training is mainly aimed at training the image style conversion capability of the image generation model, which requires the image to be style-converted while guaranteeing that the image content is unchanged. Wherein the first training image may be subjected to some preprocessing to increase the data volume of the first training image data set in order to ensure that the training is capable of supporting a large amount of training data. For example, the first training image may be subjected to operations such as noise addition, RGB shift (RGB shift), and the like. For example, referring to fig. 12, a first training image and a style text description "this is a map of a warm tone" are input to an image generation model, and a warm tone image is output by the image generation model. The image generation model after the second stage training is referred to as a second stage image generation model in the embodiment of the present application.

Finally, fine tuning the image generation capability of the second stage image generation model. In this training step, an image with intense style colors (this image is referred to as a second training image in the embodiments of the present application) is collected. And training the second-stage image generation model in a third stage by taking the second training image and the text description of the image style which is completely opposite to the image style corresponding to the second training image as training data, thereby obtaining a first image generation model after training.

The purpose of the third stage training is mainly to enhance the style conversion capability of the image generation model. Illustratively, referring to fig. 12, a cool tone image and a style text description "this is a map of warm tones" are input to an image generation model, and a warm tone image is output by the image generation model.

It can be seen that since the first image generation model is the second image generation model trained in three stages, the model structures of the first image generation model and the second image generation model are substantially identical, and differ only in model weights. The model structure of the first image generation model and the second image generation model may be any existing generation model capable of synthesizing images, for example, a diffusion model, and a model result such as a countermeasure model (GAN) is generated.

After model training is completed, deploying a first image generation model obtained through training into the electronic equipment. The electronic device can then directly invoke the first image generation model to generate a second image with a second image style.

Specifically, the electronic device inputs the original image and the image style to the invoked first image generation model. The original image is a first image displayed on the user interface in the embodiment of the present application, where the first image may be an image in a gallery or an image acquired by a camera in real time according to different user interfaces. The image style input to the first image generation model is the second image style matched and selected by the electronic equipment from the style library.

That is, the electronic device invokes the first image generation model, inputting the first image displayed by the user interface and the matched selected second image style into the first image generation model. A second image with a second image style is generated from the first image generation model.

When the first image is an image in the gallery, the second image output by the first image generation model is the first image with the second image style. When the first image is an image acquired by the camera in real time in the preview interface, the second image output by the first image generation model is an image acquired by the camera with the second image style in real time.

In some embodiments, the individual number of uses of the individual image styles stored in the electronic device may be updated over time as the electronic device is always used by the user. Then after the personal number matrix is updated in matrix elements, the electronic device should need to re-normalize the calculation to complete the updating of the personal frequency matrix.

Based on the above, when the electronic equipment receives the second operation of the user on the user interface, the electronic equipment responds to the second operation and updates the personal use frequency of the second image style triggered and stored by the user. Wherein the second operation is for triggering the electronic device to save a second image of a second image style.

For example, referring to fig. 10 and fig. 11, when the electronic device detects that the second image style corresponding to the second image stored in correspondence with the second operation is the image style 1 under the image content 1, the number of use 11 is incremented by 1 in the personal number matrix shown in fig. 11, so that updating of the number of personal use of the user in the personal number matrix is completed based on the current storage operation of the user.

Then, since the personal number matrix shown in fig. 11 is updated, the electronic device needs to perform normalization calculation again based on the updated personal number matrix to complete updating of the personal frequency, that is, to complete updating of the personal frequency matrix shown in fig. 10. Since the number of times of use of the image style 1 in the image content 1 is updated at this time, the updated personal use frequency is the personal use frequency of all the image styles in the image content 1, that is, the personal use frequency 11, the personal use frequency 12 and the personal use frequency … …, and the personal use frequency 1n may be updated. In the embodiment of the application, the personal use frequency of the user is updated in time through the user use behavior, so that the image style can be matched based on the latest personal use frequency conveniently next time, and the matching accuracy is improved.

In some embodiments, to give the user more style choices, at least two second image styles may be matched according to the user's personal preference for candidate image styles. Then, the corresponding second image includes at least two sheets. For convenience of description, in the case of at least two second images, the embodiments herein refer to them as second sub-images, and different second sub-images correspond to different second image styles.

Illustratively, when the electronic device matches the two second image styles of image style 1 and image style 2 from the candidate image styles according to the preference degree (personal frequency of use), the generated second image includes a second sub-image a and a second sub-image B. The second sub-image a corresponds to the image style 1, that is to say the second sub-image a is an image with the image style 1. The second sub-image B corresponds to image style 2, and is an image with image style 2.

Based on this, the electronic device displaying the second image with the second image style at the user interface may include: displaying a second sub-image on the user interface, and receiving a third operation of the user on the user interface; in response to the third operation, another second sub-image is displayed in a switching manner. Wherein the third operation is an operation for triggering switching of the display of the second sub-image.

Specifically, after the electronic device generates the corresponding second sub-images based on the at least two matched second image styles, the electronic device may display any one of the second sub-images on the display screen first, or may display the second sub-images corresponding to the high personal use frequency preferentially from high to low according to the personal use frequency corresponding to the second image style.

Then, the electronic device waits for receiving a third operation of the user on the user interface, and switches to display another second sub-image in response to the third operation.

For example, referring to fig. 13, the third operation may be a slide operation, and the electronic apparatus switches to display a different second sub-image in response to the user's slide operation. That is, the electronic apparatus can receive a sliding operation of the user within the viewfinder 20. When the sliding operation of the user is left sliding, the electronic device displays the next second sub-image. When the sliding operation of the user is to slide rightward, the electronic device displays a last second sub-image.

In some embodiments, referring to fig. 14, in order to meet the requirement that the user can turn off the personalization process, after S505, the image processing method further includes:

s506, the electronic equipment receives a fourth operation of the user on the user interface.

S507, the electronic device displays a first image of the first image style on the user interface in response to the fourth operation.

Specifically, the fourth operation is a click operation for triggering the electronic device to close the personalized style process, that is, for triggering the electronic device to restore the image style of the image in the user interface. The first operation is a click operation for triggering the electronic device to perform personalized style processing on the first image. It is understood that the event triggered by the fourth operation is opposite to the event triggered by the first operation. The electronic device, in response to a fourth operation by the user, turns off the personalized style process and resumes displaying the original image (i.e., resumes displaying the first image with the first image style on the user interface).

The fourth operation may also be, for example, a user click operation on an "image style processing" button/control on the electronic device. Referring to fig. 15, the "image style processing" button/control may also be an "AI" button/control, and the "AI" button/control is currently in an open state. Therefore, when the electronic device receives the closing click operation (fourth operation) of the "AI" button/control by the user in the "AI" button/control on state, the electronic device closes the personalized style processing of the person image in the preview interface viewfinder 20 shown in fig. 15, and redisplays the original first image with the first image style (refer to fig. 2).

In some embodiments, as shown in fig. 16, a flow diagram of an image processing method is provided. Hereinafter, an image processing method according to an embodiment of the present application will be described by taking fig. 16 as an example.

Specifically, the electronic device waits to receive a first operation of a user after displaying a first image of a first image style on a user interface. And after the electronic equipment receives the first operation of the user on the user interface, responding to the first operation to extract the image content of the first image. The extraction of the image content can call the first multi-mode large language model to realize. And after the first image is input into the first multi-mode large language model, acquiring the content code output by the image encoder and the image-text alignment layer in the first multi-mode large language model as the image content of the first image. Referring to fig. 16, the extracted image content is "outdoor+person".

The electronic device then utilizes the image content to match select a second image style in the local style gallery. The electronic equipment firstly matches the candidate image content with the highest similarity with the image content, and takes the image style corresponding to the candidate image content as the candidate image style of the image content. A number of second image styles are matched based on the personal usage frequency of the candidate image styles. Referring to fig. 16, the second image style matched for the image content "outdoor+person" includes "vivid" and "clear".

Finally, the electronic device inputs the first image and the second image style selected by matching into the trained first image generation model, and the first image generation model generates and outputs the second image with the second image style and displays the second image on the user interface. Referring to fig. 16, in the embodiment of the present application, the second image is illustrated in the case that the images acquired by the cameras before and after the image style conversion are unchanged, that is, the second image and the first image have the same image content, which are all the figures shown in fig. 16.

In addition, after the electronic device displays the second image in the second image style on the user interface, the electronic device may also respond to the operation that the user slides left and right in the viewfinder, that is, the electronic device switches to display the next second image in response to the third operation. For example, the electronic device first displays a second image having a "vivid" image style. And after the electronic equipment receives the third operation of the user, responding to the third operation to display a second image with the image style of clear.

Meanwhile, under the condition that the user is not satisfied with all recommended second image styles, the electronic equipment can also respond to fourth operation of the user, close personalized style processing and redisplay the first image of the first image style.

Another embodiment of the present application provides an electronic device, including: the display comprises a memory, a display screen and one or more processors, wherein the memory and the display screen are coupled with the processors; the display screen is used for displaying a user interface, one or more computer program codes are stored in the memory, and the computer program codes comprise computer instructions; the computer instructions, when executed by a processor, cause an electronic device to implement the image processing method described in any of the embodiments above.

Another embodiment of the present application provides a computer readable storage medium storing a computer program, which when executed by a processor in an electronic device, causes the electronic device to implement the image processing method according to any one of the above embodiments.

Embodiments of the present application also provide a computer program product which, when run on a computer, causes the computer to perform the functions or steps of the method embodiments described above.

It will be apparent to those skilled in the art from this description that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another apparatus, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and the parts shown as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions to cause a device (may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a read-only memory (read on ly memory, ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or the like, which can store program codes.

The foregoing is merely a specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An image processing method, characterized by being applied to an electronic device, in which a plurality of candidate image contents are stored, and each of the candidate image contents corresponds to at least one image style, the method comprising:

displaying a first image of a first image style at a user interface;

receiving a first operation of a user on the user interface, wherein the first operation is used for triggering the electronic equipment to convert the image style of the image in the user interface;

extracting image content of the first image in response to the first operation, and taking at least one image style corresponding to the candidate image content with the highest similarity of the image content of the first image as a candidate image style;

obtaining the preference degree of the user for the candidate image styles, and selecting a second image style from the candidate image styles according to the preference degree;

displaying a second image of the second image style at the user interface;

the user interface is a large-image display page of a gallery in the electronic equipment, the first image is an image in the gallery, and the second image is an image obtained by converting the first image from the first image style to the second image style; or the user interface is a preview interface of the image shot by the electronic equipment, the first image is obtained by adopting the first image style to process the image acquired by the camera of the electronic equipment in real time, and the second image is obtained by adopting the second image style to process the image acquired by the camera of the electronic equipment in real time.

2. The method of claim 1, wherein the user's preference for the candidate image style is: the personal frequency of use of the candidate image style; the electronic equipment also stores personal use frequencies of each image style corresponding to each candidate image content;

after the user interface displays the second image of the second image style, the method further comprises:

receiving a second operation of the user on the user interface, wherein the second operation is used for triggering the electronic equipment to store a second image of the second image style;

and updating the saved personal use frequency of the second image style in response to the second operation.

3. The method of claim 1, wherein a first multimodal large language model is included in the electronic device, the first multimodal large language model being a trained second multimodal large language model;

the extracting the image content of the first image includes:

inputting the first image into the first multi-mode large-scale language model, and encoding the content output through the image encoder and the image-text alignment layer processing in the first multi-mode large-scale language model as the image content of the first image.

4. The method of any of claims 1-3, wherein a first multimodal large language model is included in the electronic device, the first multimodal large language model being a trained second multimodal large language model; the first multi-mode large language model comprises an image encoder, an image-text alignment layer and a large language model;

the method further comprises the steps of:

extracting content codes of candidate images by using the image encoder and the image-text alignment layer in the first multi-mode large language model, clustering the content codes, and labeling the content of each class after clustering to obtain candidate image contents;

and extracting text features of the candidate style description texts by using the large language model in the first multi-mode large language model, performing image-text alignment processing on the text features by using a linear projection layer to obtain style codes, clustering the style codes, and performing style labeling on each category after clustering to obtain the image style corresponding to the candidate image content.

5. The method according to claim 3 or 4, characterized in that the method further comprises:

After carrying out classification training of style decision factors on the second multi-modal large language model, training the image content extraction capacity of the second multi-modal large language model to obtain the first multi-modal large language model;

wherein the style decision factors comprise at least one or more of environment, light and theme; the second multi-modal large language model comprises an image encoder, an image-text alignment layer and a large language model; the output of the image encoder is the input of the image-text alignment layer, and the output of the image-text alignment layer is the input of the large language model.

6. The method of claim 2, wherein obtaining the frequency of personal use of the candidate image style by the user comprises:

counting the number of personal use times of the user of each candidate image style corresponding to the candidate image content;

and carrying out normalization calculation on the personal use times of the user of the candidate image style to obtain the personal use frequency of the candidate image style.

7. The method of claim 2 or 6, wherein the electronic device further stores a mass usage frequency for each of the image styles corresponding to each of the candidate image contents; the public use frequency is obtained by carrying out normalization calculation after counting the public use times of each candidate image style corresponding to the candidate image content;

The obtaining the personal use frequency of the candidate image style by the user comprises the following steps:

and carrying out weighted calculation on the public use frequency and the personal use frequency corresponding to the same candidate image style, and taking the obtained weighted frequency as the final personal use frequency of the candidate image style.

8. The method of any of claims 1-7, wherein a first image generation model is included in the electronic device, the first image generation model being a trained second image generation model;

the displaying the second image of the second image style on the user interface includes:

and inputting the images in the gallery or the images acquired by the camera of the electronic equipment in real time and the second image style into the first image generation model to obtain and display the second image on the user interface.

9. The method of claim 8, wherein the method further comprises:

training the second image generation model in a first stage by using the noise image and the style description text to obtain a first stage image generation model;

training the first-stage image generation model in a second stage by using the first training image and the style description text to obtain a second-stage image generation model;

Training the second-stage image generation model in a third stage by using a second training image and the style description text to obtain a first image generation model; wherein the image style of the second training image is opposite to the image style described by the style description text.

10. The method according to any one of claims 1-9, wherein at least two second image styles are selected from the candidate image styles according to the preference level, the second image comprising at least two second sub-images, the second sub-images corresponding to different ones of the second image styles;

displaying a piece of the second sub-image on the user interface;

receiving a third operation of the user on the user interface, wherein the third operation is used for triggering the electronic equipment to switch and display the second sub-image;

and responding to the third operation, and switching to display another second sub-image.

11. The method of any of claims 1-10, wherein after the user interface displays the second image of the second image style, the method further comprises:

Receiving a fourth operation of the user on the user interface, wherein the fourth operation is used for triggering the electronic equipment to restore the image style of the image in the user interface;

and in response to the fourth operation, displaying a first image of the first image style on the user interface.

12. An electronic device, comprising: a memory, a display screen, and one or more processors, the memory, the display screen being coupled to the processors; the display screen is used for displaying a user interface, one or more computer program codes are stored in the memory, and the computer program codes comprise computer instructions; the computer instructions, when executed by the processor, cause the electronic device to perform the image processing method of any of claims 1-11.

13. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor of an electronic device, causes the electronic device to perform the image processing method according to any one of claims 1-11.