CN113486903A

CN113486903A - Image processing method, image processing device, electronic equipment and computer readable storage medium

Info

Publication number: CN113486903A
Application number: CN202110730097.1A
Authority: CN
Inventors: 袁欢; 谈建超; 刘霁
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-10-08

Abstract

The present disclosure relates to an image processing method, an apparatus, an electronic device, and a computer-readable storage medium, wherein the method comprises: performing feature extraction on the input image through a first deep neural network model to obtain first image features of the input image, and performing feature extraction on template images and template music in a plurality of album templates through a second deep neural network model and a third neural network model to obtain second image features and audio features; matching the first image characteristics with second image characteristics and audio characteristics of a plurality of album templates through a matching network model to obtain a plurality of matching values of the input image and the plurality of album templates respectively; and determining an album template matched with the input image according to the plurality of matching values. Through the method and the device, the problem that the photo album template is difficult to match in the related art and the user experience is poor is solved, and the effect of performing double matching on the input image by the image in the photo album template and the music is achieved.

Description

Image processing method, image processing device, electronic equipment and computer readable storage medium

Technical Field

The present disclosure relates to the field of computers, and in particular, to an image processing method and apparatus, an electronic device, and a computer-readable storage medium.

Background

With the development of internet technology, "music album" applications have become increasingly popular in recent years. The music photo album application provides the functions of the photo album template and background music for an image queue input by a user, the photo album template enriches the visual effect of an input image, the background music and the photo album template complement each other, and the emotion and the mood of the music photo album are well spread.

Generally, the number of album templates and background music in the music album application is large, the variety is complicated, and the user is difficult to select when using the music album application, so the user experience is poor.

Disclosure of Invention

The present disclosure provides an image processing method, an image processing apparatus, an electronic device, and a computer-readable storage medium, so as to at least solve a problem that a photo album template is difficult to match in a related art, resulting in poor user experience. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, an image processing method is provided, where a first deep neural network model is used to perform feature extraction on an input image to obtain first image features of the input image, a second deep neural network model is used to perform feature extraction on template images in a plurality of album templates respectively to obtain second image features, and a third deep neural network model is used to perform feature extraction on template music in the plurality of album templates respectively to obtain audio features; matching the first image characteristics with second image characteristics and audio characteristics of a plurality of album templates through a matching network model to obtain a plurality of matching values of the input image and the plurality of album templates respectively; and determining an album template matched with the input image according to the matching values.

Optionally, the matching the first image feature with second image features and audio features of a plurality of album templates through a matching network model to obtain a plurality of matching values of the input image with the plurality of album templates, respectively, includes: respectively combining the first image characteristics with second image characteristics of the plurality of album templates to obtain first combined characteristics; matching the first combined features through a first matching sub-network model to obtain first matching output values of the input image and template images of the plurality of album templates respectively; respectively combining the first image characteristics with the audio characteristics of the plurality of album templates to obtain second combined characteristics; matching the second combination characteristics through a second matching sub-network model to obtain second matching output values of the input image and template music of the plurality of album templates respectively; determining a plurality of matching values of the input image and the plurality of album templates respectively according to the first matching output value and the second matching output value; wherein the matching network model comprises the first matching sub-network model and the second matching sub-network model.

Optionally, the matching network model is obtained by the following method: obtaining a sample set, wherein the sample set comprises a positive sample set and a negative sample set, the positive sample set comprising: a plurality of sets of positive sample pairs, the positive sample pairs comprising: historical input images and selected album templates, the negative sample set comprising: a plurality of sets of negative sample pairs, the negative sample pairs comprising: historical input images and unselected album templates; and training an initial network model by adopting the sample set to obtain the matching network model.

Optionally, training an initial network model by using the sample set to obtain the matching network model, including: generating a loss function for the initial network model; and training the initial network model by minimizing the loss function to obtain the matching network model.

Optionally, generating a loss function of the initial network model comprises: generating a ranking loss function of the initial network model, and generating a two-class loss function of the initial network model, wherein the ranking loss function is determined according to a difference between a matching value of a positive sample pair and a matching value of a negative sample pair, and the two-class loss function is determined according to a predicted matching value of the positive and negative sample pairs.

In a second aspect of the embodiments of the present disclosure, there is provided an image processing method, including: receiving an input image through an interactive interface; displaying a plurality of matching values of the input image and a plurality of album templates respectively on the interactive interface, wherein the input image and the plurality of matching values of the plurality of album templates are obtained by respectively matching first image features with second image features and audio features of the plurality of album templates through a matching network model, the first image features are obtained by extracting features of the input image through a first deep neural network model, the second image features are obtained by respectively extracting features of template images in the plurality of album templates through a second deep neural network model, and the audio features are obtained by respectively extracting features of template music in the plurality of album templates through a third deep neural network model; and highlighting an album template matched with the input image on the interactive interface, wherein the album template matched with the input image is determined by a plurality of matching values corresponding to the plurality of album templates.

In a third aspect of the embodiments of the present disclosure, there is provided an image processing method, including: performing feature extraction on an input image through a first deep neural network model to obtain first image features of the input image, performing feature extraction on each image in a plurality of images through a second deep neural network model to obtain second image features of each image, and performing feature extraction on each music in a plurality of music through a third deep neural network model to obtain audio features of each music; matching the first image characteristics with the second image characteristics of each image through an image matching network model to obtain a template image matched with the input image; matching the first image characteristics with the audio characteristics of each piece of music through an audio matching network model to obtain template music matched with the input image; and generating an album template according to the template image and the template music.

In a fourth aspect of the embodiments of the present disclosure, there is provided an image processing method, including: receiving an input image through an interactive interface; displaying a template image matched with the input image on the interactive interface, wherein the template image is obtained by matching a first image feature and a second image feature of each image in a plurality of images through an image matching network model, the first image feature is obtained by extracting the features of the input image through a first deep neural network model, and the second image feature of each image is obtained by extracting the features of each image through a second deep neural network model; displaying template music matched with the input image on the interactive interface, wherein the template music is obtained by matching the first image characteristic with the audio characteristic of each piece of music in a plurality of pieces of music through an audio matching network model, and the audio characteristic of each piece of music is obtained by extracting the characteristic of each piece of music through a third deep neural network model; and displaying an album template on the interactive interface, wherein the album template is generated according to the template image and the template music.

In a fifth aspect of the disclosed embodiments, there is provided an image processing apparatus comprising: the system comprises a first extraction module, a second extraction module and a third extraction module, wherein the first extraction module is used for extracting the characteristics of an input image through a first deep neural network model to obtain first image characteristics of the input image, the second extraction module is used for extracting the characteristics of template images in a plurality of album templates through a second deep neural network model to obtain second image characteristics, and the third extraction module is used for extracting the characteristics of template music in the plurality of album templates through a third deep neural network model to obtain audio characteristics; the first matching module is used for matching the first image characteristics with second image characteristics and audio characteristics of a plurality of album templates through a matching network model to obtain a plurality of matching values of the input image and the plurality of album templates respectively; and the first determining module is used for determining the album template matched with the input image according to the matching values.

Optionally, the first matching module comprises: the first processing unit is used for respectively combining the first image characteristics with the second image characteristics of the plurality of album templates to obtain first combined characteristics; matching the first combined features through a first matching sub-network model to obtain first matching output values of the input image and template images of the plurality of album templates respectively; the second processing unit is used for combining the first image characteristics with the audio characteristics of the plurality of album templates respectively to obtain second combined characteristics; matching the second combination characteristics through a second matching sub-network model to obtain second matching output values of the input image and template music of the plurality of album templates respectively; the determining unit is used for determining a plurality of matching values of the input image and the plurality of album templates according to the first matching output value and the second matching output value; wherein the matching network model comprises the first matching sub-network model and the second matching sub-network model.

Optionally, the apparatus further includes a training module, configured to obtain the matching network model in the following manner: obtaining a sample set, wherein the sample set comprises a positive sample set and a negative sample set, the positive sample set comprising: a plurality of sets of positive sample pairs, the positive sample pairs comprising: historical input images and selected album templates, the negative sample set comprising: a plurality of sets of negative sample pairs, the negative sample pairs comprising: historical input images and unselected album templates; and training an initial network model by adopting the sample set to obtain the matching network model.

Optionally, the training module comprises: a generating unit, configured to generate a loss function of the initial network model; and the training unit is used for training the initial network model by minimizing the loss function to obtain the matching network model.

Optionally, the generating unit includes: and the generating subunit is used for generating a sequencing loss function of the initial network model and generating two classification loss functions of the initial network model, wherein the sequencing loss function is determined according to the difference value between the matching value of the positive sample pair and the matching value of the negative sample pair, and the two classification loss functions are determined according to the predicted matching value of the positive sample pair and the negative sample pair.

In a sixth aspect of the disclosed embodiments, there is provided an image processing apparatus comprising: the first receiving module is used for receiving an input image through an interactive interface; the first display module is used for displaying a plurality of matching values of the input image and a plurality of album templates on the interactive interface, wherein the matching values of the input image and the plurality of album templates are obtained by matching a first image feature with a second image feature and an audio feature of the plurality of album templates through a matching network model, the first image feature is obtained by extracting the feature of the input image through a first deep neural network model, the second image feature is obtained by extracting the feature of a template image in the album template through a second deep neural network model, and the audio feature is obtained by extracting the feature of template music in the album template through a third deep neural network model; and the second display module is used for highlighting the album template matched with the input image on the interactive interface, wherein the album template matched with the input image is determined by a plurality of matching values corresponding to the plurality of album templates.

A seventh aspect of the disclosed embodiments provides an image processing apparatus comprising: the second extraction module is used for performing feature extraction on an input image through the first deep neural network model to obtain first image features of the input image, performing feature extraction on each image in the plurality of images through the second deep neural network model to obtain second image features of each image, and performing feature extraction on each music in the plurality of music through the third deep neural network model to obtain audio features of each music; the second matching module is used for matching the first image characteristics with the second image characteristics of each image through an image matching network model to obtain a template image matched with the input image; the third matching module is used for matching the first image characteristics with the audio characteristics of each piece of music through an audio matching network model to obtain template music matched with the input image; and the generation module is used for generating the photo album template according to the template image and the template music.

In an eighth aspect of the disclosed embodiments, there is provided an image processing apparatus comprising: the second receiving module is used for receiving the input image through the interactive interface; the third display module is used for displaying a template image matched with the input image on the interactive interface, wherein the template image is obtained by matching a first image feature with a second image feature of each image in a plurality of images through an image matching network model, the first image feature is obtained by extracting the features of the input image through a first deep neural network model, and the second image feature of each image is obtained by extracting the features of each image through a second deep neural network model; the fourth display module is used for displaying template music matched with the input image on the interactive interface, wherein the template music is obtained by matching the first image feature with the audio feature of each piece of music in a plurality of pieces of music through an audio matching network model, and the audio feature of each piece of music is obtained by extracting the feature of each piece of music through a third deep neural network model; and the fifth display module is used for displaying an album template on the interactive interface, wherein the album template is generated according to the template image and the template music.

In a ninth aspect of the disclosed embodiments, there is provided an electronic device, comprising: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement any of the image processing methods.

In a tenth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions of the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform any one of the image processing methods.

In an eleventh aspect of the disclosed embodiments, there is provided a computer program product comprising a computer program that, when executed by a processor, implements any of the image processing methods.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the method comprises the steps of extracting first image features of an input image, extracting second image features of template images in an album template, extracting audio features of template music in the album template, matching the first image features with the second image features and the audio features respectively to determine matching values of the input image and the album template, and after the input image is matched with a plurality of album templates to obtain a plurality of matching values, determining the album template matched with the input image according to the plurality of matching values, so that the effect of combining and matching the input image with the images and the music in the album template is realized, and the matching accuracy of the input image and the album template is effectively improved. The problem that the photo album template is difficult to match in the related technology and the user experience is poor is solved, and the effect of double matching of the images in the photo album template and the music on the input images is achieved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is a block diagram illustrating a hardware configuration of a computer terminal for implementing an image processing method according to an exemplary embodiment.

Fig. 2 is a flow chart illustrating a first image processing method according to an exemplary embodiment.

Fig. 3 is a flowchart illustrating a second image processing method according to an exemplary embodiment.

FIG. 4 is a flowchart illustrating an image processing method three according to an exemplary embodiment.

FIG. 5 is a flowchart illustrating a fourth method of image processing according to an exemplary embodiment.

FIG. 6 is a flowchart illustrating an image processing method five according to an exemplary embodiment.

Fig. 7 is a schematic diagram illustrating a structure of a matching network according to an example embodiment.

FIG. 8 is a diagram illustrating model training of a matching network, according to an example embodiment.

Fig. 9 is a diagram illustrating the matching effect of a matching network according to an example embodiment.

Fig. 10 is an apparatus block diagram illustrating an image processing apparatus one according to an exemplary embodiment.

Fig. 11 is a device block diagram of a second image processing device shown according to an exemplary embodiment.

Fig. 12 is an apparatus block diagram of an image processing apparatus iii shown according to an exemplary embodiment.

Fig. 13 is an apparatus block diagram illustrating an image processing apparatus four according to an exemplary embodiment.

Fig. 14 is an apparatus block diagram of a terminal shown in accordance with an example embodiment.

FIG. 15 is a block diagram illustrating a configuration of a server according to an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Example 1

According to an embodiment of the present disclosure, a method embodiment of an image processing method is presented. It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.

The method provided by the embodiment 1 of the present disclosure can be executed in a mobile terminal, a computer terminal or a similar operation device. Fig. 1 is a block diagram illustrating a hardware configuration of a computer terminal (or mobile device) for implementing an image processing method according to an exemplary embodiment. As shown in fig. 1, the computer terminal 10 (or mobile device) may include one or more (shown as 102a, 102b, … …, 102 n) processors 102 (the processors 102 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.), memories 104 for storing data, and a transmission device for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial BUS (USB) port (which may be included as one of the ports of the BUS), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors 102 and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuit may be a single stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computer terminal 10 (or mobile device). As referred to in the disclosed embodiments, the data processing circuit acts as a processor control (e.g., selection of a variable resistance termination path connected to the interface).

The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the image processing method in the embodiment of the present disclosure, and the processor 102 executes various functional applications and data processing by running the software programs and modules stored in the memory 104, that is, implementing the vulnerability detection method of the application program. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).

It should be noted here that in some alternative embodiments, the computer device (or mobile device) shown in fig. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that fig. 1 is only one example of a particular specific example and is intended to illustrate the types of components that may be present in the computer device (or mobile device) described above.

In the above operating environment, the present disclosure provides an image processing method as shown in fig. 2. Fig. 2 is a flowchart illustrating a first image processing method according to an exemplary embodiment, which is used in the computer terminal described above, as shown in fig. 2, and includes the following steps.

In step S21, feature extraction is performed on the input image through the first deep neural network model to obtain first image features of the input image, feature extraction is performed on template images in the plurality of album templates through the second deep neural network model to obtain second image features, and feature extraction is performed on template music in the plurality of album templates through the third deep neural network model to obtain audio features.

In step S22, the first image feature is matched with the second image feature and the audio feature of the plurality of album templates, respectively, through the matching network model, so as to obtain a plurality of matching values between the input image and the plurality of album templates, respectively.

In step S23, an album template matching the input image is determined based on the plurality of matching values.

By adopting the processing, the matching value of the input image and the album template is determined by extracting the first image characteristic of the input image, the second image characteristic of the template image in the album template and the audio characteristic of the template music in the album template and respectively matching the first image characteristic with the second image characteristic and the audio characteristic, and after the input image is matched with a plurality of album templates to obtain a plurality of matching values, the album template matched with the input image can be determined according to the plurality of matching values, so that the effect of combining and matching the input image with the image and the music in the album template is realized, and the matching accuracy of the input image and the album template is effectively improved. Not only solved among the correlation technique album template and be difficult to match, lead to user experience poor problem, reached in addition and carried out dual matching with the image in the album template and music to input image to realize effectively improving the matching degree of accuracy, promote user experience's effect.

In one or more alternative embodiments, the input image may be in various forms, for example, a static picture, a dynamic video image, text, sound, or animation content, etc.

In one or more alternative embodiments, the input image may be an image queue, and the image in the image queue may be only one type of image, for example, a picture queue formed by a plurality of still pictures; the images in the image queue may also be a combination of multiple forms of images, for example, a queue of one or more still pictures and one or more video images, etc.

In one or more alternative embodiments, the first deep neural network model, the second deep neural network model, and the third deep neural network model may be deep neural network models that are already relatively mature. It should be noted that, the first deep neural network model and the second deep neural network model may be the same deep neural network model for image feature extraction.

In the related art, some features may be extracted manually, and clustering or correlation analysis is performed through the extracted features, so as to achieve the purpose of sound and picture matching. However, the method needs much manual intervention, and the extracted features cannot perfectly characterize the data, so that the matching effect is poor. In addition, when the features are extracted, only the template image of the album template or only the template music of the album template are extracted, and the two types of images are compared, so that whether the album template is matched with the input image cannot be completely verified, and therefore, the matching effect is poor. In the optional embodiment of the invention, the deep neural network model is adopted to simultaneously extract the characteristics of the template image and the template music of the album template, and the characteristics are simultaneously matched according to the extracted characteristics, so that the matching degree of the album template and the input image is more accurate.

In one or more optional embodiments, when a more appropriate album template is matched for an input image, the input image may be matched with a plurality of album templates to obtain matching values matched with the plurality of album templates; and then, selecting the most appropriate album template according to the matching value. For example, the first image feature may be matched with the second image feature and the audio feature of the plurality of album templates, respectively, through a matching network model, to obtain a plurality of matching values of the input image with the plurality of album templates, respectively; and determining an album template matched with the input image according to the plurality of matching values. It should be noted that the album templates may be provided by a "music album" application, a database other than the "music album" application, a remote network device through a convenient network link, or the like.

In one or more optional embodiments, when the first image feature is matched with the second image features and the audio features of the plurality of album templates respectively through the matching network model, and a plurality of matching values of the input image with the plurality of album templates respectively are obtained, various manners may be adopted, for example, the following manner may be adopted: respectively combining the first image characteristics with second image characteristics of a plurality of album templates to obtain first combined characteristics; matching the first combined features through a first matching sub-network model to obtain first matching output values of the input image and template images of the plurality of album templates respectively; respectively combining the first image characteristics with the audio characteristics of the plurality of album templates to obtain second combined characteristics; matching the second combination characteristics through a second matching sub-network model to obtain second matching output values of the input images and template music of the plurality of album templates respectively; determining a plurality of matching values of the input image and a plurality of album templates respectively according to the first matching output value and the second matching output value; wherein the matching network model comprises a first matching sub-network model and a second matching sub-network model. By adopting the mode, the input image is respectively matched with the template images of the plurality of album templates by adopting the first matching sub-network model, and the input image is respectively matched with the template music of the plurality of album templates by adopting the second matching sub-network model, namely, the template image and the template music are independently matched by adopting the single network model, and compared with the mode that the template image of the album template and the template music are matched together, the accuracy of model matching can be improved to a certain extent due to the pertinence of matching.

In one or more alternative embodiments, the matching network model may be obtained in various manners, for example, the following manners may be adopted to obtain the matching network model: obtaining a sample set, wherein the sample set comprises a positive sample set and a negative sample set, and the positive sample set comprises: a plurality of sets of positive sample pairs, the positive sample pairs comprising: historical input image and selected album template, the negative sample set includes: a plurality of sets of negative sample pairs, the negative sample pairs comprising: historical input images and unselected album templates; and training the initial network model by adopting a sample set to obtain a matching network model. Adopting a historical input image and an album template selected by a user aiming at the historical input image to form a positive sample; and the historical images, and the album template not selected by the user for the historical input images constitute a negative sample. The sample set formed by the positive and negative sample pairs is real data selected by a user, namely the sample set comprises better data, so that the matching network model obtained by training the data by the model can also show better performance to a certain extent.

In one or more optional embodiments, training the initial network model with the sample set to obtain a matching network model includes: generating a loss function of the initial network model; and training the initial network model through a minimum loss function to obtain a matching network model. The loss value of the loss function is minimized by generating the loss function for training the initial network model and continuously iteratively updating the model parameters in the training process, so that the model training is trained towards the direction of more and more optimization, and finally the required matching network model is obtained. The initial network model is trained based on the generated loss function to obtain the matching network model, and the training process of the model can be directly controlled according to the loss function in the model training process, so that the training efficiency of the model can be improved to a certain extent. It should be noted that, the above-mentioned loss function may require that the larger the matching value of the positive sample pair is, the better the matching value of the negative sample pair is, and at this time, in order to make the model training better, a high matching value threshold may be set for the matching value of the positive sample pair, and a low matching value threshold may be set for the matching value of the negative sample pair, that is, the matching value of the positive sample pair is required to be large, and the matching value of the negative sample pair is required to be larger than the high matching value threshold, and the matching value of the negative sample pair is required to be small, and the matching value is required to be smaller than the low matching value threshold. By adopting the processing mode of the high matching value threshold and the low matching value threshold, the performance of the matching network model obtained by model training can meet certain quality requirements to a certain extent, namely, the accuracy of the album template matched by the matching network model is higher, the album template is easier to be accepted by a user, and the use experience of the user on the music album is improved.

In one or more optional embodiments, generating the loss function for the initial network model comprises: generating a ranking loss function of the initial network model, and generating a binary loss function of the initial network model, wherein the ranking loss function is determined according to a difference between a matching value of the positive sample pair and a matching value of the negative sample pair, and the binary loss function is determined according to a predicted matching value of the positive sample pair and the negative sample pair. When a machine learning method is adopted to train a matching model, features can be generally extracted by utilizing a neural network, so that the matching precision is improved. However, with this approach, a large amount of data is typically required to train a classification problem. When the data pairs are insufficient, the matching effect achieved by the classification network is not good. In view of the above problems in model training, at least two types of loss functions may be generated for the initial network model when generating the loss functions of the initial network model. By constructing at least two types of loss functions, because the training targets or the training tendencies of different types of loss functions are different, the model is trained by adopting the loss functions of various types, and compared with the training of the model realized by only adopting one loss function, on one hand, the comprehensiveness of the model training can be improved to a certain extent; on the other hand, the number of samples required by model training can be effectively reduced.

In one or more alternative embodiments, the loss function of the initial network model may be generated in various ways, for example, the following ways may be adopted: generating a Rank order loss function of the initial network model, and generating a two-class loss function of the initial network model. It should be noted that the Rank loss function and the binary loss function are only examples, and combinations of other loss functions for training the matching network model also belong to a part of the present application, and are not exemplified herein.

In one or more optional embodiments, when the Rank loss function is determined according to the difference between the matching value of the positive sample pair and the matching value of the negative sample pair, the Rank loss function of the initial network model may be generated by: rank _ Loss ═ Sigmoid (f (x)_p)-f(x_n))-1|²Wherein Rank _ Loss is the Loss value of the ordering Loss function, f (x)_p) Is the matching value of the positive sample pair, f (x)_n) Is the matching value of the negative sample pair.

In one or more alternative embodiments, when the binary loss function is determined according to the predicted matching value of the positive and negative sample pairs, the binary loss function of the initial network model may be constructed by: BCE _ Loss ═ -y · log (Sigmoid (f (x))) - (1-y) · log (1-Sigmoid (f (x))), wherein BCE _ Loss is a Loss value of a binary Loss function, f (x) is a matching value of a sample, y is used to identify whether the sample is a positive sample or a negative sample, y is 0 to indicate that the sample is a negative sample, and y is 1 to indicate that the sample is a positive sample.

It should be noted that the combination of the Rank loss function and the binary loss function is only an example, and any combination of two or three or more types of loss functions may be considered as part of the present application.

Fig. 3 is a flowchart illustrating a second image processing method according to an exemplary embodiment, which includes the following steps, as shown in fig. 3.

In step S31, an input image is received through the interactive interface.

In step S32, on the interactive interface, a plurality of matching values of the input image and the plurality of album templates are displayed, wherein the matching network model matches the first image features with the second image features and the audio features of the plurality of album templates to obtain the first image features, the first image features perform feature extraction on the input image through the first deep neural network model, the second image features perform feature extraction on the template images in the plurality of album templates through the second deep neural network model, and the audio features perform feature extraction on the template music in the plurality of album templates through the third deep neural network model.

In step S33, an album template matching the input image is highlighted on the interactive interface, wherein the album template matching the input image is determined by a plurality of matching values corresponding to the plurality of album templates.

By adopting the processing, the input image is received through the interactive interface, and the album template matched with the input image is displayed on the interactive interface. When the input image is used for matching the photo album template, the first image characteristic of the input image is firstly extracted, the second image characteristic of the template image in the photo album template is extracted, the audio characteristic of the template music in the photo album template is extracted, then the first image characteristic is respectively matched with the second image characteristic and the audio characteristic to determine the matching value of the input image and the photo album template, after the input image is matched with a plurality of photo album templates to obtain a plurality of matching values, the photo album template matched with the input image can be determined according to the plurality of matching values, the effect of combining and matching the image and the music in the input image and the photo album template is achieved, and the matching accuracy of the input image and the photo album template is effectively improved. The problem that the photo album template is difficult to match in the related technology and the user experience is poor is solved, and the effect of double matching of the images in the photo album template and the music on the input images is achieved. In addition, the matching process of the input image and the album template is realized on the interactive interface, the process result is displayed, and the album template matched with the input image is highlighted, so that the whole matching process is visualized, the user can know the matching process, and the additional experience of the user is improved.

Fig. 4 is a flowchart illustrating a third image processing method according to an exemplary embodiment, which includes the following steps, as shown in fig. 4.

In step S41, feature extraction is performed on the input image through the first deep neural network model to obtain first image features of the input image, feature extraction is performed on each of the plurality of images through the second deep neural network model to obtain second image features of each of the plurality of images, and feature extraction is performed on each of the plurality of music through the third deep neural network model to obtain audio features of each of the plurality of music.

In step S42, the first image features and the second image features of each image are matched by the image matching network model, and a template image matched with the input image is obtained.

In step S43, the first image feature and the audio feature of each piece of music are matched by the audio matching network model, resulting in template music matched with the input image.

In step S43, an album template is generated from the template image and the template music.

With the above processing, by extracting the first image feature of the input image, the second image feature of each of the plurality of images is extracted, and the audio feature of each of the plurality of music is extracted; obtaining a template image matched with the input image through an image matching network model; obtaining template music matched with the input image through an audio matching network model; and generating the album template according to the template image and the template music. The method and the device realize that the template image matched with the input image and the template music matched with the input image are output through the corresponding matching network models respectively, and further generate the album template according to the template image and the template music. Due to the fact that the independent matching network models are used, the matching accuracy of the matched template images and the matched template music is higher than that of the matched template music which is matched by the uniform matching network model. The method and the device not only solve the problem that the photo album template in the related technology is difficult to match, and the user experience is poor, but also achieve the purpose of carrying out double matching on the input image by the image in the photo album template and the music, so that the generated photo album template can better meet the effect of being suitable for the input image. In addition, the corresponding images and audios are directly matched according to the input images, so that the limitation of the album template on the materials is separated (namely, the input images are not required to be matched with the images and the audios in the album template), and the effect of efficiently obtaining the required album template is achieved.

Fig. 5 is a flowchart illustrating a fourth image processing method according to an exemplary embodiment, which includes the following steps, as shown in fig. 5.

In step S51, an input image is received through the interactive interface.

In step S52, displaying, on the interactive interface, a template image matched with the input image, where the template image is obtained by matching, through an image matching network model, a first image feature with a second image feature of each of the plurality of images, the first image feature is obtained by extracting features of the input image through a first deep neural network model, and the second image feature of each of the images is obtained by extracting features of each of the images through a second deep neural network model;

in step S53, displaying template music matched with the input image on the interactive interface, where the template music is obtained by matching the first image feature with the audio feature of each of the plurality of pieces of music through an audio matching network model, and the audio feature of each piece of music is obtained by feature extraction of each piece of music through a third deep neural network model;

in step S54, an album template is displayed on the interactive interface, wherein the album template is generated from the template image and the template music.

With the above processing, the input image is received through the interactive interface, and the album template generated for the input image is displayed on the interactive interface. Extracting a second image feature of each of the plurality of images and extracting an audio feature of each of the plurality of music by extracting a first image feature of the input image when generating the album template for the input image; obtaining a template image matched with the input image through an image matching network model; obtaining template music matched with the input image through an audio matching network model; and generating the album template according to the template image and the template music. The method and the device realize that the template image matched with the input image and the template music matched with the input image are output through the corresponding matching network models respectively, and further generate the album template according to the template image and the template music. Due to the fact that the independent matching network models are used, the matching accuracy of the matched template images and the matched template music is higher than that of the matched template music which is matched by the uniform matching network model. The method and the device not only solve the problem that the photo album template in the related technology is difficult to match, and the user experience is poor, but also achieve the purpose of carrying out double matching on the input image by the image in the photo album template and the music, so that the generated photo album template can better meet the effect of being suitable for the input image. In addition, the process of generating the album template corresponding to the input image is displayed on the interactive interface, so that the whole album template generating process is visualized (the template image for generating the album template and the acquiring process of the template music in the album template can be visually displayed), a user can know the matching process, and the additional experience of the user is improved.

With the above embodiments and the optional embodiments combined, an optional implementation is provided, specifically taking the example of matching the album template with the input image.

In this embodiment, an appropriate album template and background music can be automatically matched for the image queue input by the user, where the template image in the album template and the background music (i.e., the template music referred to above) are in one-to-one correspondence. Under the condition of not needing any extra marking information, whether the template image and the background music in the photo album template are matched with the input image or not is directly judged, and the template image and the background music in the proper photo album template can be automatically and simultaneously matched with the image queue input by the user end to end. In addition, a better matching result can be achieved on a smaller training data set by designing a correlation algorithm.

Fig. 6 is a flowchart illustrating an image processing method five according to an exemplary embodiment, which includes the following processes, as shown in fig. 6:

(1) inputting an image queue input by a user into a depth network for extracting image features, and extracting a feature queue of the image queue; the deep network from which the image features are extracted may be a neural network pre-trained by a classification dataset (ImageNet). The depth network for extracting the image features does not need to be updated by learning when the whole album template is matched with the background music. The depth network for extracting the image features of the single image outputs 2048-dimensional features.

(2) Inputting the music in the album template into a deep network for extracting the audio features of the music, and extracting the audio features of the music; the deep network that extracts audio features of music may employ a musicnn network. The musicnn network may be trained on the large amount of music data available, using very small convolution filters, using the Log Mel spectrogram of a convolutional neural network to learn the audio features of music for domain knowledge dependent models. Deep networks that extract audio features of music may not require learning updates when performing the entire music-image matching. The deep network that extracts the audio features of the music outputs 200-dimensional features.

(3) Inputting the template image in the album template into a depth network for extracting image characteristics, and extracting the image characteristics of the template image in the album template; the deep network from which the image features are extracted may be a neural network pre-trained by a classification dataset (ImageNet). The deep network for extracting image features may not need to be updated by learning when the whole music-image matching is performed. And extracting the 2048-dimensional features of the deep network output of the image features.

(4) And respectively carrying out characteristic combination on the input image characteristics and the template image characteristics and the audio characteristics to obtain two characteristics, respectively reducing the dimension through a Multi-Layer perception (MLP) network, and then carrying out matching prediction through a full-connection network. FIG. 7 is a diagram illustrating a matching network structure according to an exemplary embodiment, where the input to the matching network is the input image feature queue of the input image, the template image feature queue of the album template, and the audio features of the template music of the album template, as shown in FIG. 7; the output of the matching network is: and inputting a matching value of the image and the album template. The matching network outputs a matching value through deep learning, and the matching value is obtained by adding the matching output of the input image queue and the background music and the matching output of the input image queue and the photo album template. The matching value represents the matching degree of the input image queue and the album template-background music, and the larger the matching value is, the higher the matching degree is.

(5) And (3) passing the input image queue and all album template-background music through a matching network to obtain a matching value of the input image queue and each album template-background music, and selecting the first five album templates and the background music with the maximum matching value to recommend to the user.

Through the processing, the album template with a better matching effect with the input image can be obtained.

In order to obtain an album template having a good matching effect with an input image, it is important to train a model of the matching network. The following describes the contents of model training of the matching network.

(1) A data set. The data set used by the matching network may be constructed by constructing positive and negative samples. The data of the photo album template is provided by a music photo album application, each photo album template has image data of one photo album template and audio data of one background music, and the photo album templates and the background music are in one-to-one correspondence. Then according to the 'music photo album' uploaded and made by the user, forming a positive sample set by the image queue input by the user and photo album template data selected by the user; and the image queue input by the user and the data of the album template which is not selected by other users form a negative sample set.

(2) The matching network combines the image feature queue and the music feature after coding to obtain the combined feature of the input image and the background music. Meanwhile, the matching network combines the image characteristics of the image characteristic queue and the album template after coding to obtain the combined characteristics of the input image and the album template image. The matching network learns the two joint features through the MLP network, the two joint features respectively obtain a matching prediction result of the input image and the background music and a matching prediction result of the input image and the album template through the MLP network, and the two matching prediction results are added to obtain a final output value f (x) of the matching network to express a matching value.

Fig. 8 is a diagram illustrating model training of a matching network according to an exemplary embodiment, where as shown in fig. 8, the matching network performs multi-task learning by simultaneously learning a Rank loss function and a binary loss function to optimize performance of the matching network.

During training, performing pairing matching of positive and negative samples on input of a matching network, wherein a positive sample set is formed by input data matched with music and images, and a negative sample set is formed by samples unmatched with music and images; in the training process, it is desirable that the larger the positive sample result is, the better the negative sample result is. In the optional embodiment, a Rank loss function is designed, the matching output result of the positive sample subtracts the matching result of the negative sample to obtain the difference value of the positive sample and the negative sample, and the training process expects that the larger the output of the positive sample is, the better the output of the positive sample is, the smaller the negative sample is, and therefore, the larger the difference value of the positive sample is, the better the difference value of the negative sample is; and normalizing the difference value of the positive and negative samples through a Sigmoid function to obtain a normalized value of the difference value of the positive and negative samples, and then calculating the normalized value of the difference value of the positive and negative samples and the square of the difference of 1 to record the Rank loss. The Rank loss function is as follows:

Rank_Loss＝|Sigmoid(f(x_p)-f(x_n))-1|²wherein Rank _ Loss is the Loss value of the ordering Loss function, f (x)_p) Is the matching value of the positive sample pair, f (x)_n) Is the matching value of the negative sample pair.

During training, a binary loss function is used, and the effect of the matching network is improved. The two-class loss function calculation formula is as follows:

BCE _ Loss ═ -y · log (Sigmoid (f (x))) - (1-y) · log (1-Sigmoid (f (x))), wherein BCE _ Loss is a Loss value of a binary Loss function, f (x) is a matching value of a sample, y is used to identify whether the sample is a positive sample or a negative sample, y is 0 to indicate that the sample is a negative sample, and y is 1 to indicate that the sample is a positive sample.

The matching network simultaneously learns the Rank loss function and the binary classification loss function, performs multi-task learning and optimizes the performance of the matching network.

Through the optional implementation mode, the depth network is adopted to extract the characteristics of the music and the image, and the MLP network is used for matching the music and the image, so that a good effect is obtained. FIG. 9 is a diagram illustrating the matching effect of a matching network according to an exemplary embodiment, where the input image queue is a monkey series of images, and the album template includes: template 1: the music of the rapid sound of the cloud palace and the template of the monkey King photo album; template 2: blooming and full moon music and blossoming wealth and honor photo album templates; template 3: the time is the music and father festival album template; … … are provided. Through the matching process, the matching value of the input image queue and the template 1 is 0.92; the matching value of the input image queue and the template 2 is 0.05; the matching value of the input image queue and the template 3 is 0.001; … … are provided. As a result, since the input image queue has the highest matching value with the template 1, the template 1 is output as the matching result.

It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required for the disclosure.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method of the embodiments of the present disclosure.

Example 2

According to an embodiment of the present disclosure, there is also provided an apparatus for implementing the first image processing method, and fig. 10 is an apparatus block diagram of the first image processing apparatus shown according to an exemplary embodiment. Referring to fig. 10, the apparatus includes: a first extraction module 101, a first matching module 102 and a first determination module 103, which are described below.

The first extraction module 101 is used for extracting the characteristics of an input image through a first deep neural network model to obtain first image characteristics of the input image, extracting the characteristics of template images in a plurality of album templates through a second deep neural network model to obtain second image characteristics, and extracting the characteristics of template music in the plurality of album templates through a third deep neural network model to obtain audio characteristics; the first matching module 102 is connected to the first extracting module 101, and is configured to match the first image features with the second image features and the audio features of the plurality of album templates respectively through a matching network model, so as to obtain a plurality of matching values between the input image and the plurality of album templates respectively; and the first determining module 103 is connected to the first matching module 102 and configured to determine an album template matched with the input image according to the plurality of matching values.

As an alternative embodiment, the first matching module comprises: the first processing unit is used for respectively combining the first image characteristics with the second image characteristics of the plurality of album templates to obtain first combined characteristics; matching the first combined features through a first matching sub-network model to obtain first matching output values of the input image and template images of the plurality of album templates respectively; the second processing unit is used for respectively combining the first image characteristics with the audio characteristics of the plurality of album templates to obtain second combined characteristics; matching the second combination characteristics through a second matching sub-network model to obtain second matching output values of the input images and template music of the plurality of album templates respectively; the determining unit is used for determining a plurality of matching values of the input image and a plurality of album templates respectively according to the first matching output value and the second matching output value; wherein the matching network model comprises a first matching sub-network model and a second matching sub-network model.

As an optional embodiment, the apparatus further includes a training module, configured to obtain the matching network model by: obtaining a sample set, wherein the sample set comprises a positive sample set and a negative sample set, and the positive sample set comprises: a plurality of sets of positive sample pairs, the positive sample pairs comprising: historical input image and selected album template, the negative sample set includes: a plurality of sets of negative sample pairs, the negative sample pairs comprising: historical input images and unselected album templates; and training the initial network model by adopting a sample set to obtain a matching network model.

As an alternative embodiment, the training module comprises: a generating unit for generating a loss function of the initial network model; and the training unit is used for training the initial network model through the minimum loss function to obtain a matching network model.

As an alternative embodiment, the generating unit includes: and the generation subunit is used for generating a sequencing loss function of the initial network model and generating a binary classification loss function of the initial network model, wherein the sequencing loss function is determined according to the difference value between the matching value of the positive sample pair and the matching value of the negative sample pair, and the binary classification loss function is determined according to the predicted matching value of the positive sample pair and the negative sample pair.

It should be noted here that the first extracting module 101, the first matching module 102 and the first determining module 103 correspond to steps S21 to S23 in embodiment 1, and the modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure in embodiment 1. It should be noted that the above modules may be operated in the computer terminal 10 provided in embodiment 1 as a part of the apparatus.

According to an embodiment of the present disclosure, there is also provided an apparatus for implementing the second image processing method, and fig. 11 is an apparatus block diagram of the second image processing apparatus shown according to an exemplary embodiment. Referring to fig. 11, the apparatus includes: a first receiving module 111, a first display module 112 and a second display module 113, which will be described below.

A first receiving module 111, configured to receive an input image through an interactive interface; the first display module 112 is connected to the first receiving module 111, and configured to display, on an interactive interface, a plurality of matching values between an input image and a plurality of album templates, where the matching values between the input image and the plurality of album templates are obtained by matching, by a matching network model, a first image feature with a second image feature and an audio feature of the plurality of album templates, where the first image feature is obtained by performing feature extraction on the input image through a first deep neural network model, the second image feature is obtained by performing feature extraction on a template image in an album template through a second deep neural network model, and the audio feature is obtained by performing feature extraction on template music in an album template through a third deep neural network model; and a second display module 113, connected to the first display module 112, configured to highlight an album template matching the input image on the interactive interface, where the album template matching the input image is determined by multiple matching values corresponding to multiple album templates.

It should be noted that the first receiving module 111, the first displaying module 112 and the second displaying module 113 correspond to steps S31 to S33 in embodiment 1, and the modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure in embodiment 1. It should be noted that the above modules may be operated in the computer terminal 10 provided in embodiment 1 as a part of the apparatus.

According to an embodiment of the present disclosure, there is also provided an apparatus for implementing the third image processing method described above, and fig. 12 is an apparatus block diagram of the third image processing apparatus shown according to an exemplary embodiment. Referring to fig. 12, the apparatus includes: a second extracting module 121, a second matching module 122, a third matching module 123 and a generating module 124, which are explained below.

The second extraction module 121 is configured to perform feature extraction on the input image through the first deep neural network model to obtain a first image feature of the input image, perform feature extraction on each of the plurality of images through the second deep neural network model to obtain a second image feature of each of the plurality of images, and perform feature extraction on each of the plurality of pieces of music through the third deep neural network model to obtain an audio feature of each piece of music; a second matching module 122, connected to the second extracting module 121, configured to match the first image features with the second image features of each image through an image matching network model, so as to obtain a template image matched with the input image; a third matching module 123, connected to the second matching module 122, configured to match the first image feature with the audio feature of each piece of music through an audio matching network model, so as to obtain template music matched with the input image; and the generating module 124 is connected to the third matching module 123 and is configured to generate an album template according to the template image and the template music.

It should be noted here that the second extracting module 121, the second matching module 122, the third matching module 123 and the generating module 124 correspond to steps S41 to S44 in embodiment 1, and the modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure in embodiment 1. It should be noted that the above modules may be operated in the computer terminal 10 provided in embodiment 1 as a part of the apparatus.

According to an embodiment of the present disclosure, there is also provided an apparatus for implementing the above-described image processing method four, and fig. 13 is an apparatus block diagram of the image processing apparatus four shown according to an exemplary embodiment. Referring to fig. 13, the apparatus includes: a second receiving module 131, a third display module 132, a fourth display module 133 and a fifth display module 134, which will be described below.

A second receiving module 131, configured to receive an input image through an interactive interface; a third display module 132, connected to the second receiving module 131, configured to display a template image matched with the input image on the interactive interface, where the template image is obtained by matching, through an image matching network model, a first image feature with a second image feature of each of the multiple images, the first image feature is obtained by extracting features of the input image through a first deep neural network model, and the second image feature of each image is obtained by extracting features of each image through a second deep neural network model; a fourth display module 133, connected to the third display module 132, configured to display, on an interactive interface, template music matched with the input image, where the template music is obtained by matching, through an audio matching network model, the first image feature with an audio feature of each of the plurality of pieces of music, and the audio feature of each piece of music is obtained by performing feature extraction on each piece of music through a third deep neural network model; and a fifth display module 134, connected to the fourth display module 133, for displaying an album template on the interactive interface, where the album template is generated according to the template image and the template music.

It should be noted that the second receiving module 131, the third displaying module 132, the fourth displaying module 133 and the fifth displaying module 134 correspond to steps S51 to S54 in embodiment 1, and the modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure in embodiment 1. It should be noted that the above modules may be operated in the computer terminal 10 provided in embodiment 1 as a part of the apparatus.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Example 3

The embodiment of the disclosure can provide an electronic device, which can be a terminal or a server. The terminal can be any one computer terminal device in a computer terminal group. Optionally, in this embodiment, the terminal may also be a terminal device such as a mobile terminal.

Optionally, in this embodiment, the terminal may be located in at least one network device of a plurality of network devices of a computer network.

Alternatively, fig. 14 is a block diagram illustrating a structure of a terminal according to an exemplary embodiment. As shown in fig. 14, the terminal may include: one or more processors 141 (only one shown), a memory 142 for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement the image processing method of any of the above.

The memory may be configured to store software programs and modules, such as program instructions/modules corresponding to the image processing method and apparatus in the embodiments of the disclosure, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, so as to implement the image processing method. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory located remotely from the processor, and these remote memories may be connected to the computer terminal through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: performing feature extraction on an input image through a first deep neural network model to obtain first image features of the input image, performing feature extraction on template images in a plurality of album templates through a second deep neural network model to obtain second image features, and performing feature extraction on template music in the plurality of album templates through a third deep neural network model to obtain audio features; matching the first image characteristics with second image characteristics and audio characteristics of a plurality of album templates through a matching network model to obtain a plurality of matching values of the input image and the plurality of album templates respectively; and determining an album template matched with the input image according to the plurality of matching values.

Optionally, the processor may further execute the program code of the following steps: matching the first image characteristics with second image characteristics and audio characteristics of a plurality of album templates through a matching network model to obtain a plurality of matching values of the input image and the plurality of album templates respectively, comprising: respectively combining the first image characteristics with second image characteristics of a plurality of album templates to obtain first combined characteristics; matching the first combined features through a first matching sub-network model to obtain first matching output values of the input image and template images of the plurality of album templates respectively; respectively combining the first image characteristics with the audio characteristics of the plurality of album templates to obtain second combined characteristics; matching the second combination characteristics through a second matching sub-network model to obtain second matching output values of the input images and template music of the plurality of album templates respectively; determining a plurality of matching values of the input image and a plurality of album templates respectively according to the first matching output value and the second matching output value; wherein the matching network model comprises a first matching sub-network model and a second matching sub-network model.

Optionally, the processor may further execute the program code of the following steps: obtaining a matching network model by adopting the following modes: obtaining a sample set, wherein the sample set comprises a positive sample set and a negative sample set, and the positive sample set comprises: a plurality of sets of positive sample pairs, the positive sample pairs comprising: historical input image and selected album template, the negative sample set includes: a plurality of sets of negative sample pairs, the negative sample pairs comprising: historical input images and unselected album templates; and training the initial network model by adopting a sample set to obtain a matching network model.

Optionally, the processor may further execute the program code of the following steps: training the initial network model by adopting a sample set to obtain a matching network model, comprising the following steps: generating a loss function of the initial network model; and training the initial network model through a minimum loss function to obtain a matching network model.

Optionally, the processor may further execute the program code of the following steps: generating a loss function for the initial network model, comprising: generating a ranking loss function of the initial network model, and generating a binary loss function of the initial network model, wherein the ranking loss function is determined according to a difference between a matching value of the positive sample pair and a matching value of the negative sample pair, and the binary loss function is determined according to a predicted matching value of the positive sample pair and the negative sample pair.

The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: receiving an input image through an interactive interface; displaying a plurality of matching values of an input image and a plurality of photo album templates respectively on an interactive interface, wherein the input image and the plurality of matching values of the plurality of photo album templates are obtained by respectively matching first image features with second image features and audio features of the plurality of photo album templates through a matching network model, the first image features are obtained by extracting the features of the input image through a first deep neural network model, the second image features are obtained by respectively extracting the features of template images in the plurality of photo album templates through a second deep neural network model, and the audio features are obtained by respectively extracting the features of template music in the plurality of photo album templates through a third deep neural network model; and highlighting the album template matched with the input image on the interactive interface, wherein the album template matched with the input image is determined by a plurality of matching values corresponding to a plurality of album templates.

The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: performing feature extraction on an input image through a first deep neural network model to obtain first image features of the input image, performing feature extraction on each image in a plurality of images through a second deep neural network model to obtain second image features of each image, and performing feature extraction on each music in a plurality of music through a third deep neural network model to obtain audio features of each music; matching the first image characteristics with the second image characteristics of each image through an image matching network model to obtain a template image matched with the input image; matching the first image characteristics with the audio characteristics of each piece of music through an audio matching network model to obtain template music matched with the input image; and generating the album template according to the template image and the template music.

The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: receiving an input image through an interactive interface; displaying a template image matched with the input image on an interactive interface, wherein the template image is obtained by matching a first image characteristic and a second image characteristic of each image in a plurality of images through an image matching network model, the first image characteristic is obtained by extracting the characteristics of the input image through a first deep neural network model, and the second image characteristic of each image is obtained by extracting the characteristics of each image through a second deep neural network model; displaying template music matched with the input image on the interactive interface, wherein the template music is obtained by matching the first image characteristic with the audio characteristic of each music in the plurality of music through an audio matching network model, and the audio characteristic of each music is obtained by extracting the characteristic of each music through a third deep neural network model; and displaying the album template on the interactive interface, wherein the album template is generated according to the template image and the template music.

An embodiment of the present disclosure may provide a server, and fig. 15 is a block diagram illustrating a structure of a server according to an exemplary embodiment. As shown in fig. 15, the server 150 may include: one or more (only one shown) processing components 151, a memory 152 for storing instructions executable by the processing components 151, a power supply component 153 for supplying power, a network interface 154 for implementing communication with an external network, and an I/O input/output interface 155 for data transmission with the outside; wherein the processing component 151 is configured to execute instructions to implement the image processing method of any of the above.

The processing component can call the information and the application program stored in the memory through the transmission device to execute the following steps: performing feature extraction on an input image through a first deep neural network model to obtain first image features of the input image, performing feature extraction on template images in a plurality of album templates through a second deep neural network model to obtain second image features, and performing feature extraction on template music in the plurality of album templates through a third deep neural network model to obtain audio features; matching the first image characteristics with second image characteristics and audio characteristics of a plurality of album templates through a matching network model to obtain a plurality of matching values of the input image and the plurality of album templates respectively; and determining an album template matched with the input image according to the plurality of matching values.

Optionally, the processing component may further execute program codes of the following steps: matching the first image characteristics with second image characteristics and audio characteristics of a plurality of album templates through a matching network model to obtain a plurality of matching values of the input image and the plurality of album templates respectively, comprising: respectively combining the first image characteristics with second image characteristics of a plurality of album templates to obtain first combined characteristics; matching the first combined features through a first matching sub-network model to obtain first matching output values of the input image and template images of the plurality of album templates respectively; respectively combining the first image characteristics with the audio characteristics of the plurality of album templates to obtain second combined characteristics; matching the second combination characteristics through a second matching sub-network model to obtain second matching output values of the input images and template music of the plurality of album templates respectively; determining a plurality of matching values of the input image and a plurality of album templates respectively according to the first matching output value and the second matching output value; wherein the matching network model comprises a first matching sub-network model and a second matching sub-network model.

Optionally, the processing component may further execute program codes of the following steps: obtaining a matching network model by adopting the following modes: obtaining a sample set, wherein the sample set comprises a positive sample set and a negative sample set, and the positive sample set comprises: a plurality of sets of positive sample pairs, the positive sample pairs comprising: historical input image and selected album template, the negative sample set includes: a plurality of sets of negative sample pairs, the negative sample pairs comprising: historical input images and unselected album templates; and training the initial network model by adopting a sample set to obtain a matching network model.

Optionally, the processing component may further execute program codes of the following steps: training the initial network model by adopting a sample set to obtain a matching network model, comprising the following steps: generating a loss function of the initial network model; and training the initial network model through a minimum loss function to obtain a matching network model.

Optionally, the processing component may further execute program codes of the following steps: generating a loss function for the initial network model, comprising: generating a ranking loss function of the initial network model, and generating a binary loss function of the initial network model, wherein the ranking loss function is determined according to a difference between a matching value of the positive sample pair and a matching value of the negative sample pair, and the binary loss function is determined according to a predicted matching value of the positive sample pair and the negative sample pair.

The processing component can call the information and the application program stored in the memory through the transmission device to execute the following steps: receiving an input image through an interactive interface; displaying a plurality of matching values of an input image and a plurality of photo album templates respectively on an interactive interface, wherein the input image and the plurality of matching values of the plurality of photo album templates are obtained by respectively matching first image features with second image features and audio features of the plurality of photo album templates through a matching network model, the first image features are obtained by extracting the features of the input image through a first deep neural network model, the second image features are obtained by respectively extracting the features of template images in the plurality of photo album templates through a second deep neural network model, and the audio features are obtained by respectively extracting the features of template music in the plurality of photo album templates through a third deep neural network model; and highlighting the album template matched with the input image on the interactive interface, wherein the album template matched with the input image is determined by a plurality of matching values corresponding to a plurality of album templates.

The processing component can call the information and the application program stored in the memory through the transmission device to execute the following steps: performing feature extraction on an input image through a first deep neural network model to obtain first image features of the input image, performing feature extraction on each image in a plurality of images through a second deep neural network model to obtain second image features of each image, and performing feature extraction on each music in a plurality of music through a third deep neural network model to obtain audio features of each music; matching the first image characteristics with the second image characteristics of each image through an image matching network model to obtain a template image matched with the input image; matching the first image characteristics with the audio characteristics of each piece of music through an audio matching network model to obtain template music matched with the input image; and generating the album template according to the template image and the template music.

The processing component can call the information and the application program stored in the memory through the transmission device to execute the following steps: receiving an input image through an interactive interface; displaying a template image matched with the input image on an interactive interface, wherein the template image is obtained by matching a first image characteristic and a second image characteristic of each image in a plurality of images through an image matching network model, the first image characteristic is obtained by extracting the characteristics of the input image through a first deep neural network model, and the second image characteristic of each image is obtained by extracting the characteristics of each image through a second deep neural network model; displaying template music matched with the input image on the interactive interface, wherein the template music is obtained by matching the first image characteristic with the audio characteristic of each music in the plurality of music through an audio matching network model, and the audio characteristic of each music is obtained by extracting the characteristic of each music through a third deep neural network model; and displaying the album template on the interactive interface, wherein the album template is generated according to the template image and the template music.

It can be understood by those skilled in the art that the structures shown in fig. 14 and fig. 15 are only schematic, for example, the terminal may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 14 and 15 do not limit the structure of the electronic device. For example, it may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in fig. 14, 15, or have a different configuration than shown in fig. 14, 15.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

Example 4

In an exemplary embodiment, there is also provided a storage medium including instructions that, when executed by a processor of a terminal, enable the terminal to perform the image processing method of any one of the above. Alternatively, the storage medium may be a non-transitory computer readable storage medium, for example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Alternatively, in this embodiment, the storage medium may be configured to store program codes executed by the image processing method provided in embodiment 1.

Optionally, in this embodiment, the storage medium may be located in any one of computer terminals in a computer terminal group in a computer network, or in any one of mobile terminals in a mobile terminal group.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: performing feature extraction on an input image through a first deep neural network model to obtain first image features of the input image, performing feature extraction on template images in a plurality of album templates through a second deep neural network model to obtain second image features, and performing feature extraction on template music in the plurality of album templates through a third deep neural network model to obtain audio features; matching the first image characteristics with second image characteristics and audio characteristics of a plurality of album templates through a matching network model to obtain a plurality of matching values of the input image and the plurality of album templates respectively; and determining an album template matched with the input image according to the plurality of matching values.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: matching the first image characteristics with second image characteristics and audio characteristics of a plurality of album templates through a matching network model to obtain a plurality of matching values of the input image and the plurality of album templates respectively, comprising: respectively combining the first image characteristics with second image characteristics of a plurality of album templates to obtain first combined characteristics; matching the first combined features through a first matching sub-network model to obtain first matching output values of the input image and template images of the plurality of album templates respectively; respectively combining the first image characteristics with the audio characteristics of the plurality of album templates to obtain second combined characteristics; matching the second combination characteristics through a second matching sub-network model to obtain second matching output values of the input images and template music of the plurality of album templates respectively; determining a plurality of matching values of the input image and a plurality of album templates respectively according to the first matching output value and the second matching output value; wherein the matching network model comprises a first matching sub-network model and a second matching sub-network model.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: obtaining a matching network model by adopting the following modes: obtaining a sample set, wherein the sample set comprises a positive sample set and a negative sample set, and the positive sample set comprises: a plurality of sets of positive sample pairs, the positive sample pairs comprising: historical input image and selected album template, the negative sample set includes: a plurality of sets of negative sample pairs, the negative sample pairs comprising: historical input images and unselected album templates; and training the initial network model by adopting a sample set to obtain a matching network model.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: training the initial network model by adopting a sample set to obtain a matching network model, comprising the following steps: generating a loss function of the initial network model; and training the initial network model through a minimum loss function to obtain a matching network model.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: generating a loss function for the initial network model, comprising: generating a ranking loss function of the initial network model, and generating a binary loss function of the initial network model, wherein the ranking loss function is determined according to a difference between a matching value of the positive sample pair and a matching value of the negative sample pair, and the binary loss function is determined according to a predicted matching value of the positive sample pair and the negative sample pair.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: receiving an input image through an interactive interface; displaying a plurality of matching values of an input image and a plurality of photo album templates respectively on an interactive interface, wherein the input image and the plurality of matching values of the plurality of photo album templates are obtained by respectively matching first image features with second image features and audio features of the plurality of photo album templates through a matching network model, the first image features are obtained by extracting the features of the input image through a first deep neural network model, the second image features are obtained by respectively extracting the features of template images in the plurality of photo album templates through a second deep neural network model, and the audio features are obtained by respectively extracting the features of template music in the plurality of photo album templates through a third deep neural network model; and highlighting the album template matched with the input image on the interactive interface, wherein the album template matched with the input image is determined by a plurality of matching values corresponding to a plurality of album templates.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: performing feature extraction on an input image through a first deep neural network model to obtain first image features of the input image, performing feature extraction on each image in a plurality of images through a second deep neural network model to obtain second image features of each image, and performing feature extraction on each music in a plurality of music through a third deep neural network model to obtain audio features of each music; matching the first image characteristics with the second image characteristics of each image through an image matching network model to obtain a template image matched with the input image; matching the first image characteristics with the audio characteristics of each piece of music through an audio matching network model to obtain template music matched with the input image; and generating the album template according to the template image and the template music.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: receiving an input image through an interactive interface; displaying a template image matched with the input image on an interactive interface, wherein the template image is obtained by matching a first image characteristic and a second image characteristic of each image in a plurality of images through an image matching network model, the first image characteristic is obtained by extracting the characteristics of the input image through a first deep neural network model, and the second image characteristic of each image is obtained by extracting the characteristics of each image through a second deep neural network model; displaying template music matched with the input image on the interactive interface, wherein the template music is obtained by matching the first image characteristic with the audio characteristic of each music in the plurality of music through an audio matching network model, and the audio characteristic of each music is obtained by extracting the characteristic of each music through a third deep neural network model; and displaying the album template on the interactive interface, wherein the album template is generated according to the template image and the template music.

In an exemplary embodiment, a computer program product is also provided, in which the computer program, when executed by a processor of a terminal, enables the terminal to perform the image processing method of any of the above.

The above-mentioned serial numbers of the embodiments of the present disclosure are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present disclosure, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, a division of a unit is merely a division of a logic function, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An image processing method, comprising:

performing feature extraction on an input image through a first deep neural network model to obtain first image features of the input image, performing feature extraction on template images in a plurality of album templates through a second deep neural network model to obtain second image features, and performing feature extraction on template music in the plurality of album templates through a third deep neural network model to obtain audio features;

matching the first image characteristics with second image characteristics and audio characteristics of a plurality of album templates through a matching network model to obtain a plurality of matching values of the input image and the plurality of album templates respectively;

and determining an album template matched with the input image according to the matching values.

2. The method of claim 1, wherein the matching the first image feature with second image features and audio features of a plurality of album templates through a matching network model to obtain a plurality of matching values of the input image with the plurality of album templates respectively comprises:

respectively combining the first image characteristics with second image characteristics of the plurality of album templates to obtain first combined characteristics; matching the first combined features through a first matching sub-network model to obtain first matching output values of the input image and template images of the plurality of album templates respectively;

respectively combining the first image characteristics with the audio characteristics of the plurality of album templates to obtain second combined characteristics; matching the second combination characteristics through a second matching sub-network model to obtain second matching output values of the input image and template music of the plurality of album templates respectively;

determining a plurality of matching values of the input image and the plurality of album templates respectively according to the first matching output value and the second matching output value;

wherein the matching network model comprises the first matching sub-network model and the second matching sub-network model.

3. The method of claim 1, wherein the matching network model is obtained by:

obtaining a sample set, wherein the sample set comprises a positive sample set and a negative sample set, the positive sample set comprising: a plurality of sets of positive sample pairs, the positive sample pairs comprising: historical input images and selected album templates, the negative sample set comprising: a plurality of sets of negative sample pairs, the negative sample pairs comprising: historical input images and unselected album templates;

and training an initial network model by adopting the sample set to obtain the matching network model.

4. The method of claim 3, wherein training an initial network model using the sample set to obtain the matching network model comprises:

generating a loss function for the initial network model;

and training the initial network model by minimizing the loss function to obtain the matching network model.

5. The method of claim 4, wherein generating the loss function for the initial network model comprises:

generating a ranking loss function of the initial network model, and generating a two-class loss function of the initial network model, wherein the ranking loss function is determined according to a difference between a matching value of a positive sample pair and a matching value of a negative sample pair, and the two-class loss function is determined according to a predicted matching value of the positive and negative sample pairs.

6. An image processing method, comprising:

receiving an input image through an interactive interface;

displaying a plurality of matching values of the input image and a plurality of album templates respectively on the interactive interface, wherein the input image and the plurality of matching values of the plurality of album templates are obtained by respectively matching first image features with second image features and audio features of the plurality of album templates through a matching network model, the first image features are obtained by extracting features of the input image through a first deep neural network model, the second image features are obtained by respectively extracting features of template images in the plurality of album templates through a second deep neural network model, and the audio features are obtained by respectively extracting features of template music in the plurality of album templates through a third deep neural network model;

and highlighting an album template matched with the input image on the interactive interface, wherein the album template matched with the input image is determined by a plurality of matching values corresponding to the plurality of album templates.

7. An image processing apparatus characterized by comprising:

the system comprises a first extraction module, a second extraction module and a third extraction module, wherein the first extraction module is used for extracting the characteristics of an input image through a first deep neural network model to obtain first image characteristics of the input image, the second extraction module is used for extracting the characteristics of template images in a plurality of album templates through a second deep neural network model to obtain second image characteristics, and the third extraction module is used for extracting the characteristics of template music in the plurality of album templates through a third deep neural network model to obtain audio characteristics;

the first matching module is used for matching the first image characteristics with second image characteristics and audio characteristics of a plurality of album templates through a matching network model to obtain a plurality of matching values of the input image and the plurality of album templates respectively;

and the first determining module is used for determining the album template matched with the input image according to the matching values.

8. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the image processing method of any one of claims 1 to 6.

9. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the image processing method of any of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program realizes the image processing method of any one of claims 1 to 6 when executed by a processor.