CN116304146B

CN116304146B - Image processing method and related device

Info

Publication number: CN116304146B
Application number: CN202310572984.XA
Authority: CN
Inventors: 李宇; 蒋雪涵
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2023-05-22
Filing date: 2023-05-22
Publication date: 2023-10-20
Anticipated expiration: 2043-05-22
Also published as: CN116304146A; CN117633269A

Abstract

The embodiment of the application provides an image processing method and a related device, and relates to the technical field of terminals. The method comprises the following steps: the electronic equipment receives text input by a user at a first interface; the electronic equipment displays a first image file and a second image file in a distinguishing mode, wherein the first image file is an image file matched with keywords in a text in the image file to be processed, and the second image file is an image file not matched with the keywords in the text in the image file to be processed; the first image file is determined based on a text and an image file to be processed for a first model in the electronic equipment, and the first model is obtained by training a sample set according to graphics context; the image-text pair sample set comprises: the sample image includes text corresponding to the sample image, one or more objects in the sample image, and text corresponding to each of the one or more objects. Therefore, the text input by the user can be better matched with the text description corresponding to the image file, and the association between the text and the image file is realized.

Description

Image processing method and related device

Technical Field

The present application relates to the field of terminal technologies, and in particular, to an image processing method and a related device.

Background

With the development of multimedia technology, some electronic devices may provide a one-key-pad function, where a one-key-pad may generate different styles, colors, or modification effects for an image file selected by a user.

However, one-touch sheeting does not support user input of custom documents and associates image files with documents.

Disclosure of Invention

The image processing method and the related device provided by the embodiment of the application can be used for carrying out the model training of image-text matching in advance, supporting the receiving of the text input by the user in the one-key-sheet forming interface and associating the image file with the text.

In a first aspect, an image processing method provided by an embodiment of the present application includes:

the electronic equipment receives text input by a user at a first interface; the electronic equipment displays a first image file and a second image file in a distinguishing mode, wherein the first image file is an image file matched with keywords in a text in the image file to be processed, and the second image file is an image file not matched with the keywords in the text in the image file to be processed; the first image file is determined based on a text and an image file to be processed for a first model in the electronic equipment, and the first model is obtained by training a sample set according to graphics context; the image-text pair sample set comprises: the sample image includes text corresponding to the sample image, one or more objects in the sample image, and text corresponding to each of the one or more objects. In this way, the electronic device can support receiving text input by a user in the interface, and can match the text input by the user with the image file to be processed, so that the association between the text and the image file is realized.

In one possible implementation, the text corresponding to the sample image is obtained by: identifying one or more objects in the sample image; obtaining text labels of one or more objects; and obtaining a text corresponding to the sample image based on the text labels of the one or more objects and the source text of the sample image, wherein the source text of the sample image is a text which is obtained in advance and used for describing the sample image, and keywords in the source text of the sample image are less than keywords in the text corresponding to the sample image. In this way, the input text can be better matched with the text corresponding to the sample image.

In a possible implementation, the method further includes: updating a first model according to an image-text pair formed by a target object and a target text label of the target object, wherein the target object and the target text label are obtained in advance in a target application, and the updated first model has the capability of matching the target object with the target text label. In this way, the updated first model can be provided with the capability of matching the target object with the target text label, so that the image-text pair marked by the user can be identified.

In a possible implementation, the first model is obtained based on a second model, where the second model includes a model obtained by training a sample set based on graphics context, and the method further includes: acquiring a third image file in the target application, wherein the third image file comprises a target object; uploading relevant data of a third image file to the second model, the relevant data of the third image file comprising: the image of the target object, the target text label and the image file obtained after the target object is removed from the third image file; training the second model based on the related data of the third image file to obtain a text corresponding to the third image file; the first model is updated based on the text corresponding to the third image file, and the updated first model has the capability of matching the third image file with the text corresponding to the third image file. In this way, the updated first model may be enabled with the ability to match the third image file with text corresponding to the third image file.

In one possible implementation, the target object comprises an image of a person and the target text label comprises a title of the person. In this way, pairs of text associated with the user-marked person can be identified, so that the first model can describe the image file in more detail and accurately.

In a possible implementation, the method further includes: the electronic equipment responds to the operation of triggering the second image file by the user, and displays a second interface, wherein the second interface comprises: and the information is used for prompting the second image file to be an image file which is not matched with the text in the image file to be processed. Therefore, the user can know that a certain image is not matched with the text in time by displaying the second interface, and the user can process the unmatched image, so that user experience is improved.

In a possible implementation, the second interface further includes a first button for canceling the display of the second image file; the method further comprises the steps of: and the electronic equipment responds to the operation of triggering the first button by the user, and displays a third interface which does not display the second image file. Therefore, the user can determine whether the unmatched image files are deleted or not, whether the second image files are unmatched image files or not can be further determined, and user experience is improved.

In a possible implementation, the second image file includes a second target image file, and the second interface further includes a second button for the user to hold the second target image file; the method further comprises the steps of: and the electronic equipment responds to the operation of triggering the second button by the user, and displays a fourth interface which displays the second target image file. Therefore, the user can determine whether the unmatched image files are deleted or not, whether the second image files are unmatched image files or not can be further determined, and user experience is improved.

In a possible implementation, the first image file includes a first target image file, the first target image file is adjacent to the second target image file in a first direction, the text includes a first keyword and a second keyword, the first keyword and the second keyword are adjacent in the first direction, the first target image file and the first keyword are matched pairs of graphics and texts, and the method further includes: and updating the first model according to the second target image file and the second keyword, wherein the updated first model learns the capability of matching the second target image file with the second keyword. In this way, the updated first model may be learned with the ability of the second target image file to match the second keyword.

In a possible implementation, before updating the first model according to the second target image file and the second keyword, the method further includes: generating a random number by a first model; updating the first model according to the second target image file and the second keyword, including: if the random number is larger than or equal to the preset value, the first model is updated according to the second target image file and the second keyword. In this way, the first model can learn the second keyword in iteration with a certain probability, so that matching between the second target image file and the second keyword is achieved.

In a possible implementation, before the first interface receives the text input by the user, the electronic device further includes: the electronic equipment displays a fifth interface, wherein the fifth interface comprises an image file and a third button, and the image file is in a state that the image file cannot be selected; the electronic equipment responds to the operation of triggering the third button by a user, and a sixth interface is displayed, wherein the sixth interface comprises an image file and a fourth button, and the image file is in a selectable state; the electronic equipment responds to the user to select an image file to be processed from the image files of the sixth interface, and triggers the operation of a fourth button, and the first interface is displayed, wherein the first interface comprises an area for displaying the image file to be processed, a text display area, a text input area and a fifth button; the electronic device receives text input by a user at a first interface, comprising: the electronic equipment receives text input by a user in a text input area of a first interface; after the electronic device receives the text input by the user at the first interface, the electronic device further comprises: the electronic equipment displays the text in a text display area of the first interface; the electronic device displays the first image file and the second image file in a distinguishing way, and the electronic device comprises: the electronic device responds to the operation of triggering the fifth button by the user, and the electronic device displays the first image file and the second image file in a distinguishing mode in the seventh interface. Thus, through the interfaces, the implementation of the application can support receiving the text input by the user in the one-key-pad interface and realize the association of the image file and the text.

In one possible implementation, the first model is a model obtained by compressing a second model, and the second model includes a model obtained by training a sample set through graphics and texts by a multi-modal contrast learning method. In this way, the compression method can reduce the size of the model while maintaining higher learning accuracy, and save the memory space, and in addition, the output result of the second model can be close to the correct value through the multi-mode comparison learning method, so that the image-text pairs can be matched relatively accurately.

In a second aspect, a method for generating video provided by an embodiment of the present application includes:

the electronic equipment responds to the operation of the user for indicating to generate the video file, and the target video is generated by adopting the first image file obtained by the image processing method provided by the embodiment of the application. In this way, the generated video can comprise the matched image and text, and a better video effect is displayed for the user.

In a third aspect, a video display method provided by an embodiment of the present application includes:

the electronic equipment displays an eighth interface, wherein the eighth interface comprises an image file and a sixth button, and the image file is in a state that the image file cannot be selected; the electronic equipment responds to the operation of triggering the sixth button by a user, and displays a ninth interface, wherein the ninth interface comprises an image file and a seventh button, and the image file is in a selectable state; the electronic equipment responds to the user to select an image file to be processed from the image files of the ninth interface, and triggers the operation of a seventh button, and a tenth interface is displayed, wherein the tenth interface comprises an area for displaying the image file to be processed, a text display area, a text input area and an eighth button; the electronic equipment receives text input by a user in a text input area of a tenth interface; the electronic equipment displays the text in a text display area of the tenth interface; the electronic equipment responds to the operation of triggering the eighth button by a user to display an eleventh interface, the eleventh interface comprises a ninth button, the electronic equipment displays a fourth image file and a fifth image file in the eleventh interface in a distinguishing mode, wherein the fourth image file is an image file matched with keywords in a text in the image file to be processed, and the fifth image file is an image file not matched with the keywords in the text in the image file to be processed; the electronic device displays a video file including a fourth image file in response to a user triggering operation of the ninth button. Thus, the video is generated according to the matched image-text pairs, and the text input by the user can be better matched with the image file, so that the association between the text and the image in the video is realized.

In one possible implementation, the fourth image file is determined based on the text and the image file to be processed for a first model in the electronic device, and the first model is obtained by training a sample set according to the graphics context; the image-text pair sample set comprises: the sample image includes text corresponding to the sample image, one or more objects in the sample image, and text corresponding to each of the one or more objects. The text corresponding to the sample image is obtained by the following method: identifying one or more objects in the sample image; obtaining text labels of one or more objects; and obtaining a text corresponding to the sample image based on the text labels of the one or more objects and the source text of the sample image, wherein the source text of the sample image is a text which is obtained in advance and used for describing the sample image, and keywords in the source text of the sample image are less than keywords in the text corresponding to the sample image. In this way, the input text can be better matched with the text corresponding to the sample image.

In one possible implementation, the first model is updated according to a graphic-text pair formed by a target object and a target text label of the target object, wherein the target object and the target text label are obtained in advance in a target application, and the updated first model has the capability of matching the target object with the target text label. In this way, the updated first model can be provided with the capability of matching the target object with the target text label, so that the image-text pair marked by the user can be identified.

In a possible implementation, the method further includes: the electronic device displays a twelfth interface in response to the user triggering the operation of the fifth image file, the twelfth interface comprising: and the information is used for prompting the fifth image file to be an image file which is not matched with the text in the image files to be processed. Therefore, the user can know that a certain image is not matched with the text in time by displaying the second interface, and the user can process the unmatched image, so that user experience is improved.

In a possible implementation, the twelfth interface further includes a tenth button for canceling the display of the fifth image file; the method further comprises the steps of: the electronic device displays a thirteenth interface in response to the user triggering operation of the tenth button, the thirteenth interface not displaying the fifth image file. Therefore, the user can determine whether the unmatched image files are deleted or not, whether the second image files are unmatched image files or not can be further determined, and user experience is improved.

In a possible implementation, the fifth image file includes a fourth target image file, and the twelfth interface further includes an eleventh button for the user to hold the fourth target image file; the method further comprises the steps of: the electronic device displays a fourteenth interface in response to the user triggering operation of the eleventh button, the fourteenth interface displaying the fourth target image file. Therefore, the user can determine whether the unmatched image files are deleted or not, whether the second image files are unmatched image files or not can be further determined, and user experience is improved.

In a possible implementation, the fourth image file includes a third target image file, the third target image file is adjacent to the fourth target image file in the second direction, the text includes a third keyword and a fourth keyword, the third keyword and the fourth keyword are adjacent in the second direction, the third target image file and the third keyword are matched image-text pairs, and the method further includes: and updating the first model according to the fourth target image file and the fourth keyword, wherein the updated first model learns the capability of matching the fourth target image file with the fourth keyword. In this way, the updated first model may be learned with the ability of the second target image file to match the second keyword.

In a possible implementation, before updating the first model according to the fourth target image file and the fourth keyword, the method further includes: generating a random number by a first model; updating the first model according to the fourth target image file and the fourth keyword, including: and if the random number is greater than or equal to the preset value, updating the first model according to the fourth target image file and the fourth keyword. In this way, the first model can learn the second keyword in iteration with a certain probability, so that matching between the second target image file and the second keyword is achieved.

It should be noted that, in the twelfth interface, the thirteenth interface, the fourteenth interface, the tenth button, the eleventh button, the third target image file, the fourth target image file, the third keyword, the fourth keyword, and the second direction are merely named distinction, in a specific embodiment, the twelfth interface may refer to the second interface, the tenth button may refer to the first button, the thirteenth interface may refer to the third interface, the fourth target image file may refer to the second target image file, the eleventh button may refer to the second button, the fourteenth interface may refer to the fourth interface, the third target image file may refer to the first target image file, the second direction may refer to the first direction, the third keyword may refer to the first keyword, and the fourth keyword may refer to the second keyword.

In a fourth aspect, an embodiment of the present application provides an apparatus for image processing, where the apparatus may be an electronic device, or may be a chip or a chip system in an electronic device. The apparatus may include a processing unit and a display unit. The processing unit is configured to implement any method related to processing performed by the electronic device in the first aspect or any of the possible implementations of the first aspect. The display unit is configured to implement any method performed by the electronic device in the first aspect or any possible implementation manner of the first aspect in relation to the display. When the apparatus is an electronic device, the processing unit may be a processor. The apparatus may further comprise a storage unit, which may be a memory. The storage unit is configured to store instructions, and the processing unit executes the instructions stored in the storage unit, so that the electronic device implements the method described in the first aspect or any one of the possible implementation manners of the first aspect. When the apparatus is a chip or a system of chips within an electronic device, the processing unit may be a processor. The processing unit executes instructions stored by the storage unit to cause the electronic device to implement the method described in the first aspect or any one of the possible implementations of the first aspect. The memory unit may be a memory unit (e.g., a register, a cache, etc.) within the chip, or a memory unit (e.g., a read-only memory, a random access memory, etc.) within the electronic device that is external to the chip.

Illustratively, a processing unit is configured to receive text input by a user; the display unit is used for displaying the first interface and distinguishing and displaying the first image file and the second image file.

In a possible implementation, the processing unit is configured to identify one or more objects in the sample image; and also for obtaining text labels for one or more objects; the method is particularly used for obtaining the text corresponding to the sample image based on the text labels of one or more objects and the source text of the sample image.

In a possible implementation manner, the processing unit is configured to update the first model according to the target object and a graphic pair formed by a target text label of the target object.

In a possible implementation manner, the processing unit is configured to obtain a third image file in the target application, and further is configured to upload relevant data of the third image file to the second model, specifically, further is configured to perform training based on the relevant data of the third image file, obtain a text corresponding to the third image file, and specifically, further is configured to update the first model based on the text corresponding to the third image file.

In one possible implementation, the target object comprises an image of a person and the target text label comprises a title of the person.

In a possible implementation, the display unit is configured to display the second interface.

In a possible implementation, the display unit is configured to display the third interface.

In a possible implementation, the display unit is configured to display the fourth interface.

In a possible implementation manner, the processing unit is configured to update the first model according to the second target image file and the second keyword.

In a possible implementation, the processing unit is configured to generate a random number.

In a possible implementation manner, the display unit is configured to display the fifth interface, display the sixth interface, and display the seventh interface.

In a possible implementation manner, the processing unit is configured to compress the second model to obtain a first model, and further is configured to train the sample set to obtain the second model by using a multi-mode contrast learning method.

In a fifth aspect, an embodiment of the present application provides an apparatus for generating video, where the apparatus may be an electronic device, or may be a chip or a chip system in the electronic device. The apparatus may include a processing unit and a display unit. The processing unit is configured to implement any method related to processing performed by the electronic device in the second aspect or any possible implementation of the second aspect. The display unit is configured to implement any method related to display performed by the electronic device in the second aspect or any possible implementation of the second aspect. When the apparatus is an electronic device, the processing unit may be a processor. The apparatus may further comprise a storage unit, which may be a memory. The storage unit is configured to store instructions, and the processing unit executes the instructions stored in the storage unit, so that the electronic device implements the method described in the second aspect or any one of possible implementation manners of the second aspect. When the apparatus is a chip or a system of chips within an electronic device, the processing unit may be a processor. The processing unit executes instructions stored by the storage unit to cause the electronic device to implement the method described in the second aspect or any one of the possible implementations of the second aspect. The memory unit may be a memory unit (e.g., a register, a cache, etc.) within the chip, or a memory unit (e.g., a read-only memory, a random access memory, etc.) within the electronic device that is external to the chip.

Illustratively, the processing unit is configured to respond to an operation of a user and further configured to generate a target video; and the display unit is used for displaying the target video.

In a sixth aspect, an embodiment of the present application provides a video display apparatus, where the apparatus may be an electronic device, or may be a chip or a chip system in the electronic device. The apparatus may include a processing unit and a display unit. The processing unit is configured to implement any method related to processing performed by the electronic device in the third aspect or any possible implementation of the third aspect. The display unit is configured to implement any method related to display performed by the electronic device in the third aspect or any possible implementation of the third aspect. When the apparatus is an electronic device, the processing unit may be a processor. The apparatus may further comprise a storage unit, which may be a memory. The storage unit is configured to store instructions, and the processing unit executes the instructions stored in the storage unit, so that the electronic device implements the method described in the third aspect or any one of the possible implementations of the third aspect. When the apparatus is a chip or a system of chips within an electronic device, the processing unit may be a processor. The processing unit executes instructions stored by the storage unit to cause the electronic device to implement the method described in the third aspect or any one of the possible implementations of the third aspect. The memory unit may be a memory unit (e.g., a register, a cache, etc.) within the chip, or a memory unit (e.g., a read-only memory, a random access memory, etc.) within the electronic device that is external to the chip.

Illustratively, a processing unit is configured to receive text input by a user; the display unit is used for displaying an eighth interface, displaying a ninth interface, displaying a tenth interface and displaying an eleventh interface, and is also used for distinguishing and displaying a fourth image file and a fifth image file, and particularly is also used for displaying a video file.

In a possible implementation, the display unit is configured to display a twelfth interface.

In a possible implementation, the display unit is configured to display a thirteenth interface.

In a possible implementation manner, the display unit is configured to display a fourteenth interface.

In a possible implementation manner, the processing unit is configured to update the first model according to the fourth target image file and the fourth keyword.

In a seventh aspect, embodiments of the present application provide an electronic device, comprising a processor and a memory, the memory being configured to store code instructions, the processor being configured to execute the code instructions to perform the method of the first aspect or any one of the possible implementations of the first aspect, and/or the method of the second aspect or any one of the possible implementations of the second aspect, and/or the method of the third aspect or any one of the possible implementations of the third aspect.

In an eighth aspect, embodiments of the present application provide a computer readable storage medium having stored therein a computer program or instructions which, when run on a computer, cause the computer to perform the method of the first aspect or any one of the possible implementations of the first aspect, and/or the method of the second aspect or any one of the possible implementations of the second aspect, and/or the method of the third aspect or any one of the possible implementations of the third aspect.

In a ninth aspect, embodiments of the present application provide a computer program product comprising a computer program which, when run on a computer, causes the computer to perform the method of the first aspect or any one of the possible implementations of the first aspect, and/or the method of the second aspect or any one of the possible implementations of the third aspect, and/or the method of any one of the possible implementations of the third aspect.

In a tenth aspect, the present application provides a chip or chip system comprising at least one processor and a communication interface, the communication interface and the at least one processor being interconnected by wires, the at least one processor being adapted to execute a computer program or instructions to perform the method of the first aspect or any one of the possible implementations of the first aspect, and/or the method of the second aspect or any one of the possible implementations of the second aspect, and/or the method of the third aspect or any one of the possible implementations of the third aspect. The communication interface in the chip can be an input/output interface, a pin, a circuit or the like.

In one possible implementation, the chip or chip system described above further includes at least one memory, where the at least one memory has instructions stored therein. The memory may be a memory unit within the chip, such as a register, a cache, etc., or may be a memory unit of the chip (e.g., a read-only memory, a random access memory, etc.).

It should be understood that the technical solutions in the embodiments of the present application may correspond to each other, and the beneficial effects obtained by each aspect and the corresponding possible implementation manner are similar, and are not repeated.

Drawings

FIG. 1 is a schematic diagram of an interface for displaying one-touch pad buttons according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an interface for selecting an image file according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an interface for receiving input text according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an interface for displaying input text and intelligent clipping buttons according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an intelligent clipping interface and a one-touch interface according to an embodiment of the present application;

FIG. 6 is a schematic diagram of an interface for a frame prompt according to an embodiment of the present application;

FIG. 7 is a schematic diagram of another intelligent clipping interface and one-touch-and-tablet interface according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a one-touch tablet function according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a graphic data set construction according to an embodiment of the present application;

FIG. 10 is a training schematic diagram of a multi-modal pre-training model according to an embodiment of the present application;

FIG. 11 is a schematic diagram of a learning process of person titles according to an embodiment of the present application;

FIG. 12 is a schematic diagram of a learning process of a special vocabulary according to an embodiment of the present application;

FIG. 13 is a schematic diagram of an image processing method according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of a chip according to an embodiment of the present application.

Detailed Description

In order to facilitate the clear description of the technical solutions of the embodiments of the present application, the following simply describes some terms and techniques involved in the embodiments of the present application:

1. terminology

In embodiments of the present application, the words "first," "second," and the like are used to distinguish between identical or similar items that have substantially the same function and effect. For example, the first chip and the second chip are merely for distinguishing different chips, and the order of the different chips is not limited. It will be appreciated by those of skill in the art that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ.

It should be noted that, in the embodiments of the present application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

In the embodiments of the present application, "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a alone, a and B together, and B alone, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.

2. Electronic equipment

The electronic device in the embodiment of the application can also be any form of terminal device, for example, the electronic device can include a handheld device, a vehicle-mounted device and the like. For example, some electronic devices are: a mobile phone, tablet, palm, notebook, mobile internet device (mobile internet device, MID), wearable device, virtual Reality (VR) device, augmented reality (augmented reality, AR) device, wireless terminal in industrial control (industrial control), wireless terminal in unmanned (self driving), wireless terminal in teleoperation (remote medical surgery), wireless terminal in smart grid (smart grid), wireless terminal in transportation security (transportation safety), wireless terminal in smart city (smart city), wireless terminal in smart home (smart home), cellular phone, cordless phone, session initiation protocol (session initiation protocol, SIP) phone, wireless local loop (wireless local loop, WLL) station, personal digital assistant (personal digital assistant, PDA), handheld device with wireless communication function, public computing device or other processing device connected to wireless modem, vehicle-mounted device, wearable device, electronic device in the 5G network or evolving land mobile network (public land mobile network), and the like, without limiting the application.

By way of example, and not limitation, in embodiments of the application, the electronic device may also be a wearable device. The wearable device can also be called as a wearable intelligent device, and is a generic name for intelligently designing daily wear by applying wearable technology and developing wearable devices, such as glasses, gloves, watches, clothes, shoes and the like. The wearable device is a portable device that is worn directly on the body or integrated into the clothing or accessories of the user. The wearable device is not only a hardware device, but also can realize a powerful function through software support, data interaction and cloud interaction. The generalized wearable intelligent device includes full functionality, large size, and may not rely on the smart phone to implement complete or partial functionality, such as: smart watches or smart glasses, etc., and focus on only certain types of application functions, and need to be used in combination with other devices, such as smart phones, for example, various smart bracelets, smart jewelry, etc. for physical sign monitoring.

In addition, in the embodiment of the application, the electronic equipment can also be electronic equipment in an internet of things (internet of things, ioT) system, and the IoT is an important component of the development of future information technology, and the main technical characteristics of the IoT are that the article is connected with a network through a communication technology, so that the man-machine interconnection and the intelligent network of the internet of things are realized.

The electronic device in the embodiment of the application may also be referred to as: a User Equipment (UE), a Mobile Station (MS), a Mobile Terminal (MT), an access terminal, a subscriber unit, a subscriber station, a mobile station, a remote terminal, a mobile device, a user terminal, a wireless communication device, a user agent, or a user equipment, etc.

In an embodiment of the present application, the electronic device or each network device includes a hardware layer, an operating system layer running on top of the hardware layer, and an application layer running on top of the operating system layer. The hardware layer includes hardware such as a central processing unit (central processing unit, CPU), a memory management unit (memory management unit, MMU), and a memory (also referred to as a main memory). The operating system may be any one or more computer operating systems that implement business processes through processes (processes), such as a Linux operating system, a Unix operating system, an Android operating system, an iOS operating system, or a windows operating system. The application layer comprises applications such as a browser, an address book, word processing software, instant messaging software and the like.

With the development of multimedia technology, some electronic devices may provide a one-key-pad function, where the one-key-pad may include different graphic styles, such as ancient style, girl style, chinese style, etc. After selecting the image file to be edited, the user can select different styles in the one-touch function to generate corresponding different styles, colors or modification effects. Wherein the image file may comprise pictures and/or video. However, one-touch sheeting does not support user input of custom documents and associates image files with documents.

In view of this, the image processing method provided by the embodiment of the application can perform the model training of image-text matching in advance, support receiving the text input by the user in the one-key-pad interface, and correlate the image file with the text. In addition, the embodiment of the application can also perform finer-granularity object recognition on the image file and generate more detailed text description corresponding to the image file, so that the text input by a user can be better matched with the text description corresponding to the image file, and the association between the text and the image file is realized.

Taking the one-touch-enabled functions in the album application as an example, fig. 1 to 6 exemplarily illustrate interface displays and usage scenarios of the one-touch-enabled functions in the electronic device.

As shown in fig. 1, in an interface 101 of an album application, a function of a key sheet 102 may be displayed. It will be appreciated that the one-key sheet 102 may be displayed below the album interface 101, above or beside the album interface 101, in more functions 103 of the album interface 101, or in any position of the album interface in the form of a bubble, and the display position of the one-key sheet 102 is not limited in the embodiment of the present application. In addition, the icon style of the one-touch tablet 102 is not limited in the embodiment of the present application.

After receiving the operation of clicking the one-button-pad 102 by the user, the electronic device may enter the selection picture and video interface 201 in response to the clicking operation, as shown in fig. 2. The select picture and video interface 201 may display image files in the electronic device, and the user may select one or more image files. It will be appreciated that the selecting picture and video interface 201 may further include other related functions for manipulating the image file, for example, a preview function of the image file, where the preview function may facilitate the user's zooming in and viewing, and enhance the user experience, and the specific picture and video interface 201 includes other related functions, which are not limited by the embodiments of the present application.

After the user has selected the image file, the submit button 202 may be clicked, and the electronic device may enter the smart clip interface 301 as shown in fig. 3 in response to the click operation. The smart cut interface 301 may include an image display area 302, a document display area 303, an input document area 304, and a smart cut button 305.

Wherein the image display area 302 may display one or more image files, and a corresponding image file name may be displayed under each image file. The document display area 303 may display the contents of the document input by the user, and when the user does not input the document, the document display area 303 may display input prompt information, for example, the input prompt information may include "please input text: ". The input manuscript region 304 may include voice input and text input, and may also input expressions and the like.

It is understood that the smart cut button 305 may be in an impossible state before the user inputs the manuscript, wherein the impossible state and the clickable state of the smart cut button 305 may be distinguished by a difference in color, size, gray scale, etc. of the smart cut button 305. The click status display of the smart clip button 305 is specifically not limited in the embodiments of the present application.

After the user inputs the manuscript, as shown in fig. 4, in the smart clip interface 401, the manuscript content input by the user may be displayed in the manuscript display area 402, and the smart clip button 403 may be in a clickable state. When the user completes the document input, the smart cut button 403 may be clicked.

After the electronic device receives the operation of clicking the smart clip button 403 by the user, the electronic device may search for an image matching the text in response to the clicking operation. As shown in a of fig. 5, in the smart cut interface 501, after finding an image matching a text, when a user clicks on a sentence in the manuscript display area 503, the electronic device can identify the corresponding image in the image display area 502. For example, the electronic device may highlight the name of the corresponding image, which may include color, size, gray scale, thickening, and the like. The image identification mode is not limited in the embodiment of the application.

Illustratively, when the user clicks on the text of "play cell phone on bus today" in the manuscript display area 503, the electronic device may recognize that the text is aligned with image 2, and the name "image 2" of the corresponding image in the image display area 502 may be bolded.

Alternatively, when the user clicks on an image in the image display area 502, the electronic device may identify the corresponding document in the document display area 503. For example, the electronic device may perform a process such as highlighting a corresponding document, and the highlighting process may include performing a process such as color, size, gradation, thickening, and the like on the document. The embodiment of the application is not limited by the specific identification mode of the manuscript.

In addition, the smart clip interface 501 shown in fig. 5 a may further include a one-click button 504, and the user may click on the one-click button 504 to generate a video.

Upon receiving the operation of clicking the one-touch button 504 by the user, the electronic device may display the generated video 506 in the one-touch interface 505 as shown in b of fig. 5 in response to the clicking operation. The user may view video 506. The user may also edit the video 506, and in particular, the video 506 may be edited, which is not limited by the embodiment of the present application.

In addition, the electronic device may identify an extraneous image that may also be understood as an image that is not aligned with the document. As shown in fig. 6, in the image display area 602 of the smart clip interface 601, assuming that the image 1 is an irrelevant image, the electronic device may mark the name of the image 1 with a frame, or may mark the image 1 in other display modes, and the embodiment of the present application is not limited to the specific identification mode of the irrelevant image.

When the electronic device receives an operation that the user triggers the irrelevant image, the electronic device can play a frame to prompt the user whether to delete the irrelevant image. A popup window 603 may be displayed on the interface of the electronic device, and prompt information, a delete button, and a hold button for prompting the user whether to delete the irrelevant image may be displayed in the popup window 603. Wherein, the prompt information may include "does image XX not match, is it deleted? "; the hold button is used to cancel the display of the pop-up window 603 and hold the irrelevant image; the delete button is used to cancel the display of the pop-up window 603 and may delete or not display an irrelevant image.

For example, when the user activates the hold button, the electronic device may not display the pop-up window 603 and hold the unrelated image. When the user triggers the delete button, as shown in a of fig. 7, the electronic device may not display a popup window, and in the image display area 702 of the smart clip interface 701, image 1 may be deleted or not displayed as an irrelevant image.

In addition, the smart clip interface 701 shown in fig. 7 a may further include a one-click button 703, and the user may click on the one-click button 703 to generate a video.

Upon receiving an operation of clicking the one-touch pad button 703 by the user, the electronic device may display the generated video 705 in the one-touch pad interface 704 as shown in b of fig. 7 in response to the clicking operation. The user may view video 705. The user may also edit the video 705, and in particular, the video 705 may be operated, which is not limited by the embodiments of the present application.

It can be understood that, in the above-mentioned intelligent clipping interface 401 in fig. 4, after the user clicks the intelligent clipping button 403, when the electronic device performs matching between the text and the image, the electronic device may match the corresponding image for the text according to the corresponding relationship between the image and the text, and the corresponding relationship may be a pre-trained model.

The implementation of text-to-image matching in the one-touch tablet function described above will be described in detail below with reference to fig. 8. Specific implementations may include: s801, constructing an image-text data set; s802, training a multi-mode pre-training model; s803, compressing and distilling the multi-mode pre-training model; s804, end-side model personalized learning and updating.

It should be noted that, S801, S802, and S803 may be executed by a first electronic device, where the first electronic device may be an electronic device with a larger memory and a stronger computing capability, and may process large-scale data, for example, the first electronic device may be a computer, a cloud server, or the like. S804 may be performed by a second electronic device, which may be a small memory, limited computing power, and portable electronic device, for example, a mobile phone, a tablet, a wearable device, etc. Wherein the second electronic device may also be referred to as end side.

S801, constructing an image-text data set.

In one possible implementation, the embodiment of the application can acquire a plurality of groups of image-text pairs from the network by adopting a web crawler (web crawler) technology as an image-text data set, and the image-text pairs acquired from the network can simply describe objects in the images. For example, the text corresponding to the picture in fig. 9 may be "girl and big tree". It will be appreciated that the manner of acquiring the teletext data set is not limited to the manner of acquiring the teletext data set from the internet, nor is it limited to the technical means employed for acquiring the teletext data set, in particular the manner of acquiring the teletext data pairs, and embodiments of the application are not limited.

In another possible implementation, because the image-text data pair obtained from the internet is simpler in description of the image and cannot describe the object in the image in detail, the embodiment of the application can use the plurality of groups of image-text pairs obtained from the internet as the original image-text data pair, and the object detector can identify the object in the image of the original image-text data pair in detail to obtain the image-text data set with fine granularity. Wherein the fine-grained teletext data set may also be referred to as a fine-grained multi-modal teletext pair.

As shown in fig. 9, to detect a detailed object in a picture of an original image-text data pair, the first electronic device may identify the object in the picture by using an object detector, so as to obtain a fine-grained object. Wherein the object detector can identify an object that may be present in the picture and provide positional information of the object in the picture.

Further, the first electronic device may input the fine-grained object into a multi-mode zero sample picture classifier, and obtain a class name of the fine-grained object as a fine-grained language text by using a zero sample classification capability of the multi-mode zero sample picture classifier. The multi-modal zero-sample picture classifier may also be referred to as a multi-modal picture classifier or a pre-trained multi-modal picture classifier. As shown in fig. 9, the fine-grained language text obtained from the multimodal image classifier may include: girls, big trees, antennas, sun, etc.

After the fine-grained language text is obtained, the first electronic equipment can combine the language generating capability of the large-scale language model to strengthen and rewrite the text in the original image-text data pair, so as to obtain the image-text pair with enhanced text. Wherein the large language model may be a deep learning model trained using a large amount of text data, which may be used to generate language text. It will be appreciated that the text enhanced graphic pair may possess a richer language description than the original graphic pair. For example, the text corresponding to the original pair of graphics may be "girls and big trees", while the text corresponding to the text-enhanced pair of graphics may be "girls standing in front of big trees and antennas under the sun".

After obtaining the text enhanced graphics context pairs, the first electronic device may group the original picture, the enhanced text, the fine-grained objects obtained with the object detector, and the class names of the fine-grained objects obtained with the multi-modal picture classifier into a set of text enhanced fine-grained multi-modal graphics context pairs as part of a fine-grained graphics context data set.

In the embodiment of the application, after the processing of a plurality of groups of original image-text data pairs, a plurality of groups of text enhanced fine-granularity multi-mode image-text pairs can be obtained, so that a fine-granularity image-text data set can be constructed.

S802, training of the multi-mode pre-training model.

After the fine-grained graphic data set is constructed, the first electronic device can train the multi-mode pre-training model according to the multi-mode contrast learning framework.

Fig. 10 shows a training schematic of a multimodal pre-training model.

The first electronic device may input an image file to be trained and text into the multimodal pre-training model, where the image file may include pictures and/or videos, and the text may be replaced with voice, and embodiments of the present application are not limited.

The multimodal pre-training model can train the input image files and texts according to the multimodal contrast learning framework. In the multimodal pre-training model, the frame pictures corresponding to the input image files may be represented digitally. For example, the first electronic device may represent a frame picture with 128 numbers, and may also understand that the frame picture is represented with 128-dimensional vectors, where the 128-dimensional vectors may be referred to as a representation of a high-dimensional space of the frame picture. It will be appreciated that the high-dimensional spatial representation of text is similar and will not be described in detail.

The principle of the contrast learning framework can be that similar graphic representations are similar, and dissimilar graphic pair representations are far away. In the learning process, the first electronic device may calculate, according to the vector representations of the respective image-text pairs, a similarity between the frame picture and the text, where the similarity has a threshold value from 0 to 1. Wherein, 1 may represent that the image-text pairs are similar, and may be understood that the similarity of the image-text pairs is higher, and 0 may represent that the image-text pairs are far away, and may be understood that the similarity of the image-text pairs is lower.

For example, if the frame picture a and the text b belong to one image-text pair, the similarity of the image-text pair may meet a preset similarity threshold, which indicates that the similarity of the image-text pair is higher, and therefore, the similarity of the frame picture a and the text b may be 1. If the frame picture a and the text b do not belong to one image-text pair, the similarity of the image-text pair does not meet the preset similarity threshold, which indicates that the similarity of the image-text pair is lower, so that the similarity of the frame picture a and the text b can be 0. It may be understood that the preset similarity threshold may take different similarity thresholds, for example, the preset similarity threshold may take a value of 0.6, which is not limited in the embodiment of the present application.

After multi-mode contrast learning, the multi-mode pre-training model can gradually converge, and at this time, the error between the output result and the correct value of the multi-mode pre-training model is smaller, so that relatively accurate graph-text pair representation can be performed.

S803, compressing and distilling the multi-mode pre-training model.

After the multi-modal pre-training model converges, learning and updating of the model can be performed in the second electronic device. However, since the memory and the computing power of the second electronic device are limited, and the multi-mode pre-training model contains more parameters and occupies more memory, the multi-mode pre-training model can be compressed and distilled into a small multi-mode pre-training model, so that the number of parameters is reduced, the complexity of the model is reduced, and the pre-training model can be operated in the second electronic device.

In view of this, embodiments of the present application may employ a distillation-style model compression method to simulate the behavior of a larger model that has been trained by training a smaller model, which may be referred to as a small multi-modal pre-training model. Therefore, the model size can be reduced while higher learning accuracy is maintained, and the memory occupied space is saved.

For ease of description, the multimodal pre-training model before compression and distillation is subsequently referred to as the large model, and the small multimodal pre-training model after compression and distillation is referred to as the small model.

It will be appreciated that the small model may possess the ability of a large model to convert image files and text into vectors for high dimensional spatial representation, and that the small model may include an image small model and a text small model. The image small model can be a small model for processing pictures or videos, and the text small model can be a small model for processing texts or voices. Illustratively, a frame of picture is input to the image small model, and the image small model can output a vector representation corresponding to the frame of picture; a phrase is input to the text gadget, which can output a vector representation corresponding to the phrase.

S804, end-side model personalized learning and updating.

In the selection picture and video interface 201 of fig. 2 described above, when the user selects one or more image files, the second electronic device may input the one or more image files into an image gadget, which may output a vector representation of a frame picture corresponding to the image file. In the smart clip interface 401 of fig. 4, when the user completes the document input and clicks the smart clip button 403, the second electronic device may input the text edited by the user into a small text model, which may output a vector representation corresponding to the input text.

The second electronic equipment can learn and update multi-mode image-text matching through the small model, calculate the similarity degree of the image and the manuscript, and obtain the optimal matching of the image and the manuscript through a Hungary algorithm (hungarian algorithm). Further, after the user arbitrarily clicks the document sentence, the second electronic device may display an image matching the document sentence, or after the user arbitrarily clicks the image, the second electronic device may display a document matching the image. The related interface display may refer to the related description of the corresponding embodiment of fig. 5, and will not be repeated.

For an image that is not aligned with the manuscript, the electronic device may identify the image, and determine whether to delete the image by the user, and the related interface display may refer to the related description of the corresponding embodiment of fig. 6 and fig. 7, which is not repeated.

The embodiment of the application can also support the learning of the personalized expression, wherein the learning of the personalized expression can comprise the following steps: (1) Learning of expression of a person name, and (2) learning of expression of a special vocabulary. For example, the persona scale expression may include: the baby, sister and the like refer to and express the relationship of the characters; the special vocabulary expressions may include user-defined object names, etc.

(1) The person refers to the study of the expression.

Learning of the person scale expression may be understood as providing the small model of the second electronic device with learning capabilities for understanding the person scale mentioned in the user input manuscript. For convenience of description, the person to be learned is referred to as a target person name, and the person picture corresponding to the target person name is referred to as a target person picture.

By way of example, FIG. 11 shows a schematic learning diagram of a person-call expression in a personalized expression.

Learning of the persona scale expression may include: (a) obtaining a picture of the target person and a picture-text pair of the target person, (b) updating the target person, and (c) generating a text description with the target person, and (d) updating the end-side model.

(a) And acquiring the target character picture and the image-text pair called by the target character.

The second electronic device may obtain a graphic pair of a person and a name according to a correspondence between a target person picture and a target person name in the album application, for example, the target person picture 1101 of "sister" and the target person name 1102 of "sister" in fig. 11 may be used as one graphic pair. In a possible implementation, the second electronic device may identify at an idle time whether there are unknown target person titles and target person pictures in the album application. For example, the idle time may include a time when the user does not use the second electronic device, a time when the second electronic device is being charged, or a time when the second electronic device performs version update, and the specific idle time, which is not limited by the embodiment of the present application. If the target person name and the target person picture which are not learned exist, the second electronic device can call the related interface to acquire the image-text pair of the target person name and the target person picture from the album application. If there is no unknown target person name and target person picture, the image-text pair may not be acquired.

(b) Updating the target person name.

The second electronic device may extract frames from an image including the target person picture in the album application to obtain a person target person picture in the image. The frame extraction may include steps of performing interval clipping, stitching, or secondary synthesis on the frame pictures of the image by using the clipped frame pictures by the second electronic device.

In a possible implementation, the second electronic device may extract the pictures marked with the "sister" label in the album application to construct the image-text pair, input the image-text pair into the small model, obtain the vector of the "sister" target person picture and the vector of the "sister" target person call, set the similarity between the "sister" target person call and the "sister" target person picture to 1, and further update the person call about the "sister" in the small model of the second electronic device.

(c) A textual description is generated with the title of the target person.

After the target person picture is obtained, the second electronic device cannot generate the corresponding text description for the target person picture, and the first electronic device can generate the text description corresponding to the target person picture. Thus, the second electronic device may upload relevant data related to the target person picture to the first electronic device for generation of the text description.

Wherein the related data may include: the target person calls the corresponding vector, the vector corresponding to the picture of the target person region, the vector corresponding to the environment picture obtained by masking the target person region in the target person picture, and the like. The target person region may be understood as a person-related picture region in the target person picture. Masking may be understood as masking a target person region to generate an environment-related picture. The second electronic device can upload each vector data to the first electronic device without uploading the image file in the album application and the target person name, so that the privacy information of the user can be protected in the process of transmitting the data.

In a possible implementation, the second electronic device may upload the related data of the plurality of image files in the album application in batch. In order not to affect the user experience, the data may be uploaded when the user does not use the second electronic device, for example, the second electronic device may upload data at night or midnight, or upload data in a user-defined time period.

After the first electronic device obtains the data uploaded by the second electronic device, the first electronic device can learn the fine-grained target character picture and the image-text pair called by the target character to generate a text description corresponding to the target character picture and provided with the target character name.

For example, before the second electronic device performs person name learning, as shown in fig. 9 described above, the text is described as "standing long-haired girls in front of big trees and antennas under the sun"; after the second electronic device performs person name learning, the text is described as "standing in front of the big tree and antenna in the sun sister" as shown in fig. 11 described above.

(d) And updating the end-side model.

After the large model of the first electronic device obtains the text description about the fine granularity called by the target person, the first electronic device can generate a new small model by processing the large model through distillation and the like, and update the new small model to the first electronic device, so that the small model is updated. In this way, the description of the image file by the small model can be more detailed and accurate.

(2) Learning of the expression of the specific vocabulary.

A special vocabulary is understood to be a user-defined vocabulary, i.e. different users may have different names for one object. For example, in fig. 12, "big tree house" in the manuscript display area 1203 can be understood as a special vocabulary.

For the user-defined special vocabulary, the second electronic device can combine the random sampling method with user feedback to update the model. For the images of the unmatched texts which are not deleted by the user, the small model of the second electronic equipment can take the matched image-text pairs as anchor points, and the unmatched image-text pairs before or after the anchor points are matched in a random sampling mode.

For example, as shown in fig. 12, in the smart clip interface 1201, the text of the image 2 of the image display area 1202 and the text of the document display area 1203 "when playing a mobile phone on a bus" may be a matching graphic pair; image 1 of image display area 1202 and the text "going to scroll home today" of document display area 1203 may be an unmatched pair of text. For convenience of description, the text "to go to a large tree house today" will be referred to as text 1 later.

If the user does not delete image 1, the small model of the second electronic device may match image 1 in a random sampling manner with image 2 as the anchor point.

In a possible implementation, the small model of the second electronic device may accept the matching of the image 1 and the text 1 with a preset probability as true, that is, the small model may accept the image 1 and the text 1 with a preset probability as a matched image-text pair, and learn and train the image 1 and the text 1 set up image-text pair. The preset probability can be set by the small model, and the small model can take different preset probabilities, for example, the preset probability can take 50%, and the specific value of the preset probability is not limited in the embodiment of the application.

Taking the preset probability as 50% as an example, the small model can generate a random number a, and if a is greater than or equal to 50%, the image 1 and the text 1 can be considered as a matched image-text pair; if a is less than 50%, the image 1 and the text 1 are considered as unmatched graphic pairs.

It will be appreciated that if the small model learns the personalized representation of a particular word in an iteration with a high probability when the user uses that word multiple times, matching of the image to the document can be achieved.

For example, if a small model parameter update occurs with image 1 and text 1 as matching pairs, then the similarity calculation for image 1 and text 1 will be higher when image 1 and text 1 reappear, making it more likely that matching will be completed, without the need to determine whether image 1 and text 1 are a pair by random matching of anchor points. If the model is updated by incorrectly regarding image 1 and text 1 as matching pairs, i.e. image 1 and text 1 are essentially irrelevant, then the probability of the next occurrence of image 1 together with text 1 is relatively small.

It will be appreciated that, if the matching pair of graphics is essentially a matching pair of graphics, it is irrelevant that the pair of graphics is not considered as a pair of updating models at a time, and because the pair of graphics is a pair, the number of times the pair of graphics is encountered is more, so that there is always a chance that the pair of graphics is considered as a pair of updating models; however, if the pair of images and texts which are not matched in nature is mistakenly regarded as a pair of data, the pair of images and texts is updated into a model, and therefore the probability that the pair of images and texts appear together is small, and even if the model parameters are updated, the pair of images and texts are difficult to meet at the same time.

The method according to the embodiment of the present application will be described in detail by way of specific examples. The following embodiments may be combined with each other or implemented independently, and the same or similar concepts or processes may not be described in detail in some embodiments.

Fig. 13 shows an image processing method of an embodiment of the present application. The method comprises the following steps:

s1301, the electronic equipment receives text input by a user on a first interface.

In the embodiment of the present application, the electronic device may be understood as the first electronic device in the above embodiment.

The first interface may be understood as an interface that may accept user input text, for example, the first interface may be an interface as described above with respect to fig. 3, and the electronic device may receive user input text in the input document region 304 of fig. 3. The user-entered text may include user-entered text and user-entered speech, where the electronic device may convert the user-entered speech to text.

S1302, the electronic equipment displays a first image file and a second image file in a distinguishing mode, wherein the first image file is an image file matched with keywords in a text in the image file to be processed, and the second image file is an image file not matched with the keywords in the text in the image file to be processed; the first image file is determined based on a text and an image file to be processed for a first model in the electronic equipment, and the first model is obtained by training a sample set according to graphics context; the image-text pair sample set comprises: the sample image includes text corresponding to the sample image, one or more objects in the sample image, and text corresponding to each of the one or more objects.

In the embodiment of the present application, the image file to be processed may be understood as one or more image files selected by a user, for example, the image file to be processed may be an image file displayed in the image display area 502 of fig. 5, and the image file to be processed may include an image 1, an image 2, and an image 3.

The first image file may include one or more image files, and the display of the first image file may refer to the related description in the embodiment corresponding to fig. 5, which is not repeated, for example, the first image file may be the image 2 in fig. 5.

The second image file may include one or more image files, and the display of the second image file may refer to the related description in the embodiment corresponding to fig. 6, which is not repeated, for example, the second image file may be the image 1 in fig. 6. It is understood that the second image file may be determined by the first model in the electronic device based on the text and the image file to be processed, or may be an image file obtained by removing the first image file from the image file to be processed. The specific manner of determining the second image file is not limited in the embodiment of the present application.

The image-text pair sample set may be constructed by the electronic device, and the construction of the image-text pair sample set may refer to the related description in step S801 of the embodiment corresponding to fig. 8, which is not repeated. The image-text pair sample set can be understood as an image-text data set in step S801, and the sample image can be understood as a picture of the original image-text data pair; the text corresponding to the sample image can be understood as the language text with fine granularity corresponding to the sample image, for example, the text corresponding to the sample image can be "the girl stands in front of the big tree and the antenna under the sun" in fig. 9; one or more objects in the sample image may be understood as images of fine-grained objects in the sample image; the text corresponding to each of the one or more objects may be understood as text corresponding to an image of a fine-grained object, e.g., the object may be the girl, the tree, the antenna, the sun, etc. as described above in fig. 9.

In the embodiment of the application, the electronic equipment can support receiving the text input by the user in the interface and can match the text input by the user with the image file to be processed, thereby realizing the association between the text and the image file.

Optionally, on the basis of the embodiment corresponding to fig. 13, the text corresponding to the sample image is obtained by: identifying one or more objects in the sample image; obtaining text labels of one or more objects; and obtaining a text corresponding to the sample image based on the text labels of the one or more objects and the source text of the sample image, wherein the source text of the sample image is a text which is obtained in advance and used for describing the sample image, and keywords in the source text of the sample image are less than keywords in the text corresponding to the sample image.

In the embodiment of the present application, the obtaining manner of the text corresponding to the sample image may refer to the related description in the embodiment corresponding to fig. 8 and fig. 9, which is not repeated. For example, one or more objects in the sample image may be identified by an object detector; obtaining text labels of one or more objects through a multi-mode zero sample picture classifier; the text corresponding to the sample image can be obtained through a large language model.

The source text of the sample image may be understood as the text in the original teletext data pair in the embodiment corresponding to fig. 9 described above, for example, the source text of the sample image may be the "girl and treelet" in fig. 9 described above.

The sample image can obtain texts corresponding to the sample image with fine granularity through the object detector, the multi-mode zero sample image classifier and the large language model, so that the input texts can be better matched with the texts corresponding to the sample image, and the correlation between the texts and the image files is realized.

Optionally, on the basis of the embodiment corresponding to fig. 13, the method may further include: updating a first model according to an image-text pair formed by a target object and a target text label of the target object, wherein the target object and the target text label are obtained in advance in a target application, and the updated first model has the capability of matching the target object with the target text label.

In the embodiment of the application, the target application may be an application capable of performing text marking on the image in the electronic device, for example, the target application may be an album application, where the album application may also be referred to as a gallery application.

The target object may be understood as an image of a target application that is text-tagged by a user, and may include an image of a person or other object. For example, the target object may be the target person picture 1101 in fig. 11 described above.

A target text label may be understood as text marked by a user on a target object in a target application, for example, the target text label may be the target person name 1102 in fig. 11 described above.

The first model is updated through the target object and the target text label, so that the updated first model has the capability of matching the target object with the target text label, and the image-text pair marked by the user can be identified.

Optionally, on the basis of the embodiment corresponding to fig. 13, the first model is obtained based on a second model, where the second model includes a model obtained by training a sample set based on graphics context, and the method may further include: acquiring a third image file in the target application, wherein the third image file comprises a target object; uploading relevant data of a third image file to the second model, the relevant data of the third image file comprising: the image of the target object, the target text label and the image file obtained after the target object is removed from the third image file; training the second model based on the related data of the third image file to obtain a text corresponding to the third image file; the first model is updated based on the text corresponding to the third image file, and the updated first model has the capability of matching the third image file with the text corresponding to the third image file.

In the embodiment of the present application, the first model may be obtained after compressing the second model, may not be obtained after compressing the second model, or may be obtained after performing other processing on the second model.

The second model may be implemented on a device with a relatively high computing capability, and in a possible implementation manner, the second model may be implemented in the first electronic device, or may be implemented in the second electronic device, or may be deployed in the cloud device.

The text corresponding to the third image file obtained by the second model may refer to the related description of the text description with the name of the target person generated in the embodiment (c) corresponding to fig. 11, which is not described herein. Updating the first model based on the text corresponding to the third image file may refer to the related description of updating the end-side model in the embodiment (d) corresponding to fig. 11, which is not repeated.

According to the text corresponding to the third image file, the first model is updated, so that the updated first model has the capability of matching the third image file with the text corresponding to the third image file, and the description of the third image file can be more detailed and accurate by the first model, so that the text input by a user can be better matched with the text corresponding to the third image file.

Alternatively, in the embodiment corresponding to fig. 13, the target object includes a character image, and the target text label includes a text label corresponding to the character image.

In the embodiment of the present application, the character image may be understood as the target character picture 1101 in fig. 11 described above. The target text label may be understood as the target person designation 1102 in fig. 11 described above. The description of the embodiment corresponding to fig. 11 may be referred to specifically, and will not be repeated.

The first model, by learning the person's title, may have the ability to match the person's image to the person's title, thereby identifying the graphic pairs associated with the person marked by the user, so that the first model may describe the image file in more detail and accuracy.

Optionally, on the basis of the embodiment corresponding to fig. 13, the method may further include: the electronic equipment responds to the operation of triggering the second image file by the user, and displays a second interface, wherein the second interface comprises: and the information is used for prompting the second image file to be an image file which is not matched with the text in the image file to be processed.

In the embodiment of the application, the second interface can be understood as the interface of fig. 6, and the user can know that a certain image is not matched with the text in time by displaying the second interface, so that the user can process the unmatched image, and the user experience is improved.

Optionally, on the basis of the embodiment corresponding to fig. 13, the second interface further includes a first button for canceling the display of the second image file; the method may further comprise: and the electronic equipment responds to the operation of triggering the first button by the user, and displays a third interface which does not display the second image file.

In the embodiment of the present application, the first button may be understood as the corresponding delete button in the interface of fig. 6.

The third interface may be understood as an interface that the electronic device displays after the unmatched image is deleted, wherein deleting the unmatched image includes the electronic device not displaying the unmatched image at the third interface. For example, the third interface may be understood as the interface of fig. 7 described above.

The user determines whether the unmatched image file is deleted, so that whether the second image file is the unmatched image file can be further determined, and user experience is improved.

Optionally, based on the embodiment corresponding to fig. 13, the second image file includes a second target image file, and the second interface further includes a second button for the user to hold the second target image file; the method may further comprise: and the electronic equipment responds to the operation of triggering the second button by the user, and displays a fourth interface which displays the second target image file.

In an embodiment of the present application, the second target image file may include one or more image files.

The second button may be understood as the corresponding hold button in the interface of fig. 6 described above.

The fourth interface may be understood as an interface that retains the unmatched images, and the fourth interface may be displayed the same as or different from the first interface, and in particular, the display of the fourth interface is not limited by the embodiment of the present application.

Optionally, on the basis of the embodiment corresponding to fig. 13, the first image file includes a first target image file, the first target image file is adjacent to the second target image file in a first direction, the text includes a first keyword and a second keyword, the first keyword and the second keyword are adjacent in the first direction, the first target image file and the first keyword are matched graphics-text pairs, and the method may further include: and updating the first model according to the second target image file and the second keyword, wherein the updated first model learns the capability of matching the second target image file with the second keyword.

In an embodiment of the present application, the first target image file may include one or more image files.

The first direction may include a left direction of the first target image file and a right direction of the second target image file, which are not limited by the embodiment of the present application.

The first model is updated according to the second target image file and the second keyword, so that the updated first model can learn the capability of matching the second target image file with the second keyword.

Optionally, before updating the first model according to the second target image file and the second keyword, on the basis of the embodiment corresponding to fig. 13, the method may further include: generating a random number; updating the first model based on the second target image file and the second keyword may include: if the random number is larger than or equal to the preset value, the first model is updated according to the second target image file and the second keyword.

In the embodiment of the present application, the preset value may be a preset value in the first model, and the preset value may take different values, for example, the preset value may take 50%, which is not limited in the embodiment of the present application. The preset value may be understood as a description of the preset probability in the embodiment corresponding to fig. 12, which is not repeated.

The first model is updated according to the second target image file and the second keyword, so that the first model learns the second keyword in iteration with a certain probability, and the matching of the second target image file and the second keyword is realized.

Optionally, on the basis of the embodiment corresponding to fig. 13, before the electronic device receives the text input by the user at the first interface, the method may further include: the electronic equipment displays a fifth interface, wherein the fifth interface comprises an image file and a third button, and the image file is in a state that the image file cannot be selected; the electronic equipment responds to the operation of triggering the third button by a user, and a sixth interface is displayed, wherein the sixth interface comprises an image file and a fourth button, and the image file is in a selectable state; the electronic equipment responds to the user to select an image file to be processed from the image files of the sixth interface, and triggers the operation of a fourth button, and the first interface is displayed, wherein the first interface comprises an area for displaying the image file to be processed, a text display area, a text input area and a fifth button; the electronic device receiving text entered by a user at a first interface may include: the electronic equipment receives text input by a user in a text input area of a first interface; after the electronic device receives the text input by the user at the first interface, the electronic device may further include: the electronic equipment displays the text in a text display area of the first interface; the electronic device for displaying the first image file and the second image file in a distinguishing way may include: the electronic device responds to the operation of triggering the fifth button by the user, and the electronic device displays the first image file and the second image file in a distinguishing mode in the seventh interface.

In the embodiment of the present application, the display of the one-touch function in the electronic device may refer to the related description of the corresponding embodiment of fig. 1 to 5, and will not be repeated. Wherein, the fifth interface may be the interface corresponding to fig. 1, and the third button may be the one-key pad 102 in fig. 1; the sixth interface may be the interface corresponding to fig. 2, and the fourth button may be the submit button 202 in fig. 2; in the first interface, the area for displaying the image file to be processed may be the image display area 302 in fig. 3, the text display area may be the document display area 303 in fig. 3, the text input area may be the input document area 304 in fig. 3, and the fifth button may be the smart clip button 305 in fig. 3; the seventh interface may be the interface corresponding to fig. 5 described above.

Through the interfaces, the implementation of the application can support receiving the text input by the user in the one-key-pad interface and realize the association of the image file and the text.

Optionally, on the basis of the embodiment corresponding to fig. 13, the first model is a model obtained by compressing a second model, and the second model includes a model obtained by training a sample set through graphics-text by using a multi-mode contrast learning method.

In the embodiment of the present application, the first model may be a model obtained by compressing the second model by a compression algorithm, and the compression algorithm may be a distillation method in the embodiment corresponding to fig. 8. The distillation method may refer to the related description in step S803, and will not be described again. The distillation method can reduce the size of the model while maintaining higher learning accuracy, and save the occupied space of the memory.

The method of multi-mode contrast learning may refer to the description of the embodiment corresponding to fig. 10, and will not be repeated. The output result of the second model can be close to the correct value by the multi-mode contrast learning method, so that the image-text pairs can be matched relatively accurately.

It can be understood that the image processing method provided by the embodiment of the application can be applied to other scenes with image-text matching, for example, the image processing method can be used for the retrieval function of image files. By way of example, the user can input the keyword in the search bar of the application, the application can search the image file matched with the keyword and display the image file in the interface of the electronic device, so that the user can conveniently and quickly search the image file related to the keyword, and the user experience is improved. The interface display of the retrieval function of the specific image file is not limited in the embodiment of the present application.

The embodiment of the application also provides a video generation method. The method comprises the following steps:

the electronic equipment responds to the operation of the user for indicating to generate the video file, and the target video is generated by adopting the first image file obtained by the image processing method provided by the embodiment of the application.

In the embodiment of the present application, the operation of the user for indicating to generate the video file may include an operation of clicking the one-key button 504 in a of fig. 5, an operation of clicking the one-key button 703 in a of fig. 7, and an operation of triggering the electronic device to generate the video file.

The target video may be understood as a video generated by the electronic device by using the first image file obtained by the image processing method provided by the embodiment of the present application. For example, the target video may be understood as video 506 in b of fig. 5, and may also be understood as video 705 in b of fig. 7.

The video generated by the image processing method can comprise the matched images and the characters, and a better video effect is displayed for the user.

The embodiment of the application also provides a video display method. The method comprises the following steps:

the electronic equipment displays an eighth interface, wherein the eighth interface comprises an image file and a sixth button, and the image file is in a state that the image file cannot be selected; the electronic equipment responds to the operation of triggering the sixth button by a user, and displays a ninth interface, wherein the ninth interface comprises an image file and a seventh button, and the image file is in a selectable state; the electronic equipment responds to the user to select an image file to be processed from the image files of the ninth interface, and triggers the operation of a seventh button, and a tenth interface is displayed, wherein the tenth interface comprises an area for displaying the image file to be processed, a text display area, a text input area and an eighth button; the electronic equipment receives text input by a user in a text input area of a tenth interface; the electronic equipment displays the text in a text display area of the tenth interface; the electronic equipment responds to the operation of triggering the eighth button by a user to display an eleventh interface, the eleventh interface comprises a ninth button, the electronic equipment displays a fourth image file and a fifth image file in the eleventh interface in a distinguishing mode, wherein the fourth image file is an image file matched with keywords in a text in the image file to be processed, and the fifth image file is an image file not matched with the keywords in the text in the image file to be processed; the electronic device displays a video file including a fourth image file in response to a user triggering operation of the ninth button.

In the embodiment of the present application, the eighth interface may be understood as the interface 101 of the album application corresponding to fig. 1, and the sixth button may be the one-key-slice button 102 in fig. 1. The specific eighth interface may refer to the related description in the embodiment corresponding to fig. 1, which is not repeated.

The ninth interface may be understood as the corresponding select picture and video interface 201 of fig. 2, and the seventh button may be the submit button 202 of fig. 2. The specific ninth interface may refer to the related description in the embodiment corresponding to fig. 2, which is not described herein.

The tenth interface may be understood as the smart clip interface 301 corresponding to fig. 3, the area for displaying the image file to be processed may be the image display area 302 in fig. 3, the text display area may be the document display area 303 in fig. 3, the text input area may be the input document area 304 in fig. 3, and the eighth button may be the smart clip button 305 in fig. 3. The tenth interface may refer to the related description in the embodiment corresponding to fig. 3, and will not be described again.

The eleventh interface may be understood as the corresponding smart clip interface 401 of fig. 4, and the ninth button may be the smart clip button 403 of fig. 4. The eleventh interface may refer to the related description in the embodiment corresponding to fig. 4, which is not repeated. The fourth image file may be the first image file in the embodiment corresponding to fig. 13, and the specific fourth image file may refer to the description related to the first image file in the embodiment corresponding to fig. 13, which is not repeated. The fifth image file may be the second image file in the embodiment corresponding to fig. 13, and the specific fifth image file may refer to the description related to the second image file in the embodiment corresponding to fig. 13, which is not repeated.

The video file may be understood as video 506 in b of fig. 5, and may also be understood as video 705 in b of fig. 7. For a description of the video file, reference may be made to the above description of the corresponding embodiment of fig. 5 and/or fig. 7, which is not repeated.

It is understood that the video file may or may not include the fifth image file. For example, when the electronic device learns the fifth image file using the image processing method described above, the video file may include the fifth image file. When the user deletes the fifth image file or the electronic device does not learn the fifth image file by using the above-described image processing method, the video file may not include the fifth image file.

The embodiment of the application generates the video according to the matched image-text pairs, and can better match the text input by the user with the image file, thereby realizing the association of the text and the image in the video.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.

The foregoing description of the solution provided by the embodiments of the present application has been mainly presented in terms of a method. To achieve the above functions, it includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the present application may be implemented in hardware or a combination of hardware and computer software, as the method steps of the examples described in connection with the embodiments disclosed herein. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The embodiment of the application can divide the functional modules of the device for realizing the method according to the method example, for example, each functional module can be divided corresponding to each function, and two or more functions can be integrated in one processing module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present application, the division of the modules is schematic, which is merely a logic function division, and other division manners may be implemented in actual implementation.

Fig. 14 is a schematic structural diagram of a chip according to an embodiment of the present application. Chip 1400 includes one or more (including two) processors 1401, communication lines 1402, communication interfaces 1403, and memory 1404.

In some implementations, the memory 1404 stores the following elements: executable modules or data structures, or a subset thereof, or an extended set thereof.

The methods described in the embodiments of the present application described above may be applied to the processor 1401 or implemented by the processor 1401. The processor 1401 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry of hardware in the processor 1401 or instructions in the form of software. The processor 1401 as described above may be a general purpose processor (e.g., a microprocessor or a conventional processor), a digital signal processor (digital signal processing, DSP), an application specific integrated circuit (application specific integrated circuit, ASIC), an off-the-shelf programmable gate array (field-programmable gate array, FPGA) or other programmable logic device, discrete gates, transistor logic, or discrete hardware components, and the processor 1401 may implement or perform the methods, steps, and logic blocks related to the processes disclosed in the embodiments of the present application.

The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a state-of-the-art storage medium such as random access memory, read-only memory, programmable read-only memory, or charged erasable programmable memory (electrically erasable programmable read only memory, EEPROM). The storage medium is located in the memory 1404, and the processor 1401 reads the information in the memory 1404 and performs the steps of the method in combination with its hardware.

The processor 1401, the memory 1404, and the communication interface 1403 can communicate with each other via a communication line 1402.

In the above embodiments, the instructions stored by the memory for execution by the processor may be implemented in the form of a computer program product. The computer program product may be written in the memory in advance, or may be downloaded in the form of software and installed in the memory.

Embodiments of the present application also provide a computer program product comprising one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL), or wireless (e.g., infrared, wireless, microwave, etc.), or semiconductor medium (e.g., solid state disk, SSD)) or the like.

The embodiment of the application also provides a computer readable storage medium. The methods described in the above embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. Computer readable media can include computer storage media and communication media and can include any medium that can transfer a computer program from one place to another. The storage media may be any target media that is accessible by a computer.

As one possible design, the computer-readable medium may include compact disk read-only memory (CD-ROM), RAM, ROM, EEPROM, or other optical disk memory; the computer readable medium may include disk storage or other disk storage devices. Moreover, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital versatile disc (digital versatile disc, DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. An image processing method, the method comprising:

the electronic equipment displays a first interface, wherein the first interface comprises an area for displaying an image file to be processed, a text display area, a text input area and a fifth button;

the electronic equipment receives text input by a user in the text input area of the first interface; the text comprises a first keyword and a second keyword, wherein the first keyword and the second keyword are adjacent in a first direction;

The electronic equipment displays the text in the text display area of the first interface;

the electronic equipment responds to the operation of triggering the fifth button by a user, and the electronic equipment displays a first image file and a second image file in a seventh interface in a distinguishing mode so as to generate a video file matched with the keywords in the text based on the first image file; the first image file is an image file matched with the keywords in the text in the image file to be processed, and the second image file is an image file not matched with the keywords in the text in the image file to be processed; the first image file is determined by a first model in the electronic equipment based on the text and the image file to be processed, and the first model is obtained by training a sample set according to graphics context; the image-text pair sample set comprises: a sample image and text corresponding to the sample image, wherein one or more objects in the sample image and text corresponding to each of the one or more objects; the first image file comprises a first target image file, and the first target image file and the first keyword are matched image-text pairs; the second image file comprises a second target image file; the first target image file is adjacent to the second target image file in the first direction;

The electronic equipment responds to the operation of reserving the second target image file triggered by the user, and a fourth interface is displayed, wherein the fourth interface displays the second target image file;

and the electronic equipment updates the first model according to the second target image file and the second keyword, and the updated first model learns the capability of matching the second target image file with the second keyword.

2. The method of claim 1, wherein the text corresponding to the sample image is obtained by: identifying one or more objects in the sample image; obtaining text labels of the one or more objects; obtaining a text corresponding to the sample image based on the text labels of the one or more objects and the source text of the sample image, wherein the source text of the sample image is a text which is obtained in advance and used for describing the sample image, and keywords in the source text of the sample image are less than keywords in the text corresponding to the sample image.

3. The method according to claim 1 or 2, characterized in that the method further comprises:

Updating the first model according to an image-text pair formed by a target object and a target text label of the target object, wherein the target object and the target text label are obtained in advance in a target application, and the updated first model has the capability of matching the target object with the target text label.

4. A method according to claim 3, wherein the first model is derived based on a second model comprising a model derived by training a sample set based on the graph, the method further comprising:

acquiring a third image file in the target application, wherein the third image file comprises the target object;

uploading the related data of the third image file to the second model, the related data of the third image file comprising: the image of the target object, the target text label and the image file obtained by removing the target object from the third image file;

training the second model based on the related data of the third image file to obtain a text corresponding to the third image file;

and updating the first model based on the text corresponding to the third image file, wherein the updated first model has the capability of matching the third image file with the text corresponding to the third image file.

5. The method of claim 3, wherein the target object comprises a character image and the target text label comprises a character designation.

6. The method according to claim 1 or 2, characterized in that the method further comprises:

the electronic equipment responds to the operation that the user triggers the second image file, and a second interface is displayed, wherein the second interface comprises: and the information is used for prompting the second image file to be the image file which is not matched with the text in the image file to be processed.

7. The method of claim 6, wherein the second interface further comprises a first button for canceling the display of the second image file; the method further comprises the steps of:

and the electronic equipment responds to the operation of triggering the first button by a user, and displays a third interface, wherein the third interface does not display the second image file.

8. The method of claim 7, wherein the second interface further comprises a second button for a user to hold the second target image file; the method further comprises the steps of:

and the electronic equipment responds to the operation of triggering the second button by a user and displays the fourth interface.

9. The method of claim 8, wherein before updating the first model based on the second target image file and the second keyword, further comprising:

the first model generates a random number;

the updating the first model according to the second target image file and the second keyword comprises:

and if the random number is larger than or equal to a preset value, updating the first model according to the second target image file and the second keyword.

10. The method of claim 1, wherein prior to the electronic device displaying the first interface, further comprising:

the electronic equipment displays a fifth interface, wherein the fifth interface comprises an image file and a third button, and the image file is in a state that the image file cannot be selected;

the electronic equipment responds to the operation of triggering the third button by a user, and a sixth interface is displayed, wherein the sixth interface comprises the image file and a fourth button, and the image file is in a selectable state;

and the electronic equipment responds to the selection of the image file to be processed in the image files of the sixth interface by a user, and triggers the operation of the fourth button.

11. The method according to claim 1, wherein the first model is a model obtained by compressing a second model, and the second model comprises a model obtained by training the sample set by using multi-modal contrast learning.

12. A method of video generation, the method comprising:

the electronic device generates a target video using the first image file as obtained in any one of claims 1 to 11 in response to a user operation for instructing generation of a video file.

13. A method of video display, the method comprising:

the electronic equipment displays an eighth interface, wherein the eighth interface comprises an image file and a sixth button, and the image file is in a state that cannot be selected;

the electronic equipment responds to the operation of triggering the sixth button by a user, and displays a ninth interface, wherein the ninth interface comprises the image file and a seventh button, and the image file is in a selectable state;

the electronic equipment responds to the fact that a user selects an image file to be processed from the image files in the ninth interface and triggers the operation of the seventh button, a tenth interface is displayed, and the tenth interface comprises an area for displaying the image file to be processed, a text display area, a text input area and an eighth button;

The electronic equipment receives text input by a user in the text input area of the tenth interface; the text comprises a first keyword and a second keyword, wherein the first keyword and the second keyword are adjacent in a first direction;

the electronic equipment displays the text in the text display area of the tenth interface;

the electronic equipment responds to the operation of triggering the eighth button by a user, an eleventh interface is displayed, the eleventh interface comprises a ninth button, and fourth image files and fifth image files are displayed in the eleventh interface in a distinguishing mode, wherein the fourth image files are image files matched with keywords in the text in the image files to be processed, and the fifth image files are image files not matched with the keywords in the text in the image files to be processed; the fourth image file is determined by a first model in the electronic equipment based on the text and the image file to be processed, and the first model is obtained by training a sample set according to graphics context; the image-text pair sample set comprises: a sample image and text corresponding to the sample image, wherein one or more objects in the sample image and text corresponding to each of the one or more objects; the fourth image file comprises a first target image file, and the first target image file and the first keyword are matched image-text pairs; the fifth image file includes a second target image file; the first target image file is adjacent to the second target image file in the first direction;

the electronic equipment updates the first model according to the second target image file and the second keyword, and the updated first model learns the capability of matching the second target image file with the second keyword;

and the electronic equipment responds to the operation of triggering the ninth button by a user, and displays a video file, wherein the video file comprises the fourth image file.

14. An electronic device, comprising: a memory for storing a computer program and a processor for executing the computer program to cause the electronic device to perform the method of any one of claims 1-13.

15. A computer readable storage medium storing instructions that, when executed, cause a computer to perform the method of any one of claims 1-13.

16. A chip comprising a processor for executing a computer program such that the chip performs the method of any of claims 1-13.