CN110163121B

CN110163121B - Image processing method, device, computer equipment and storage medium

Info

Publication number: CN110163121B
Application number: CN201910360905.2A
Authority: CN
Inventors: 缪畅宇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-04-30
Filing date: 2019-04-30
Publication date: 2023-09-05
Anticipated expiration: 2039-04-30
Also published as: CN110163121A

Abstract

The invention discloses an image processing method, an image processing device, computer equipment and a storage medium, and belongs to the technical field of networks. The method comprises the following steps: acquiring a plurality of context information of a plurality of images; inputting the plurality of images and the plurality of context information into a language model, and extracting features of the plurality of images through the language model and the plurality of context information to obtain semantic features of the plurality of images; image processing is performed based on semantic features of the plurality of images. According to the method and the device, the semantic features of the image are extracted through the language model, so that image processing tasks under scenes with high requirements on the image semantic can be completed, and the accuracy of image processing is improved.

Description

Image processing method, device, computer equipment and storage medium

Technical Field

The present invention relates to the field of network technologies, and in particular, to an image processing method, an image processing device, a computer device, and a storage medium.

Background

In social interaction, compared with text, the image can more vividly convey the semantics to be expressed by the user, and is more vivid and interesting. With the development of computer equipment, the computer equipment can help users understand images, namely, the computer equipment can extract the characteristics of the images, so that the efficiency of the users for executing operations such as image message reply or image development can be improved in an auxiliary manner.

In the conventional feature extraction method, the computer device generally extracts the surface features of the image on the pixel level through a VGG (visual geometry group ) network, the VGG network is a model trained under some specific scenes (such as image classification, image segmentation, image recognition, etc.), so that the VGG network can only extract the surface features of interest of the specific scenes when extracting the features, for example, the VGG segmentation network extracts boundary information (such as breast segmentation, etc.) of each segmented region on the pixel level, and the VGG classification network extracts class labels (such as cat and dog classification, etc.) to which the image belongs on the pixel level.

In the process, the VGG network can only extract the surface layer characteristics of the image at the pixel level, so that the image semantics cannot be better understood, and the accuracy of image processing is not high in some scenes with high requirements on the image semantics.

Disclosure of Invention

The embodiment of the invention provides an image processing method, an image processing device, computer equipment and a storage medium, which can solve the problem that the computer equipment cannot better understand image semantics and the accuracy of image processing is not high in some scenes with higher requirements on the image semantics. The technical scheme is as follows:

In one aspect, there is provided an image processing method, the method including:

acquiring a plurality of pieces of context information of a plurality of images, wherein the plurality of pieces of context information are at least one item of text information before or after the position of the image in a text scene;

inputting the images and the context information into a language model, and extracting features of the images through the language model and the context information to obtain semantic features of the images;

and performing image processing based on the semantic features of the plurality of images.

In one possible implementation, obtaining the plurality of second initial word vectors includes:

and embedding the context information, and acquiring the pre-trained word vectors as the second initial word vectors.

In one aspect, there is provided an image processing apparatus including:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a plurality of pieces of context information of a plurality of images, and the context information is at least one item of text information before or after the position of the image in a text scene;

the feature extraction module is used for inputting the images and the context information into a language model, and extracting features of the images through the language model and the context information to obtain semantic features of the images;

And the image processing module is used for processing the images based on the semantic features of the images.

In one possible implementation manner, the feature extraction module includes:

an acquisition unit configured to acquire a plurality of first initial word vectors, the plurality of first initial word vectors corresponding to the plurality of images;

the acquiring unit is further configured to acquire a plurality of second initial word vectors, where the plurality of second initial word vectors correspond to the plurality of context information;

the iterative training unit is used for carrying out iterative training on the language model based on the first initial word vectors and the second initial word vectors;

and the obtaining unit is used for obtaining the semantic features of the plurality of images when the loss function value is smaller than a target threshold value or the iteration number reaches a target number.

In one possible implementation manner, the acquiring unit is configured to:

inputting the plurality of images into a pixel feature extraction model, and extracting pixel features of the plurality of images through the pixel feature extraction model;

clustering the plurality of images according to the pixel characteristics of the plurality of images to obtain category labels of the plurality of images;

images with the same category labels are assigned the same random word vector as the first initial word vector.

In one possible implementation manner, the acquiring unit is configured to:

when a first image is included in the plurality of images, extracting text from the first image, performing embedding processing on the text to obtain word vectors of at least one word in the text, and acquiring an average vector of the word vectors of the at least one word as a first initial word vector corresponding to the first image, wherein the first image is an image carrying the text;

when a second image is included in the plurality of images, acquiring the random word vector as a first initial word vector corresponding to the second image, wherein the second image is an image without text.

In one possible implementation manner, the acquiring unit is configured to:

In one possible implementation manner, the iterative training unit is configured to:

in the process of carrying out iterative training on the language model, keeping the plurality of second initial word vectors unchanged, and adjusting the numerical values of the plurality of first initial word vectors to obtain a plurality of first word vectors;

When the loss function value is smaller than a target threshold or the iteration number reaches a target number, obtaining semantic features of the plurality of images includes:

and determining the first word vectors as semantic features of the images when the loss function value is smaller than a target threshold value or the iteration number reaches a target number.

In one possible implementation, the image processing module includes:

the storage processing unit is used for storing semantic features of the images into a database according to the image identifications or the category identifications of the images, acquiring the semantic features of the target images from the database when receiving an image processing instruction carrying the target images, and carrying out image processing based on the semantic features of the target images.

In one possible implementation, the storage processing unit is configured to:

when the image processing instruction also carries the image identifier of the target image, determining semantic features corresponding to the image identifier in the database as the semantic features of the target image; or alternatively, the first and second heat exchangers may be,

clustering the target image, acquiring a category identifier of the target image, and determining semantic features corresponding to the category identifier in the database as the semantic features of the target image.

In one aspect, a computer device is provided that includes one or more processors and one or more memories having stored therein at least one instruction that is loaded and executed by the one or more processors to implement operations performed by an image processing method as any of the possible implementations described above.

In one aspect, a computer readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement operations performed by an image processing method as any one of the possible implementations described above is provided.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

by acquiring a plurality of context information of a plurality of images, the plurality of context information is at least one item of text information before or after the position of the image in a text scene, so that after the plurality of images and the plurality of context information are input into a language model, the plurality of images can be subjected to feature extraction through the language model and the plurality of context information under the action of the plurality of context information, so that semantic features of the plurality of images can be obtained.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic view of an implementation environment of an image processing method according to an embodiment of the present invention;

FIG. 2 is a flowchart of an image processing method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a clustering result provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of a language model training process provided by an embodiment of the present invention;

FIG. 5 is a schematic illustration of a language model training process provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of a database store provided by an embodiment of the present invention;

fig. 7 is a schematic structural view of an image processing apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

Fig. 1 is a schematic view of an implementation environment of an image processing method according to an embodiment of the present invention. Referring to fig. 1, at least one terminal 101 and a server 102 may be included in the implementation environment, as described in detail below:

the at least one terminal 101 may be any terminal capable of transmitting text or image, and the user may transmit text or image to the server 102 after logging in any terminal.

Wherein, the server 102 may be any computer device capable of providing an image processing service, and when the server 102 receives an image of any one of the at least one terminal 101, semantic features of the image may be acquired and image processing may be performed based on the semantic features of the image.

The embodiment of the invention can be applied to some human-computer interaction scenes, because the user gradually tends to replace text information with the expression image in social interaction, the meaning of the user is vividly conveyed, and social interest is increased, therefore, when the user carries out message interaction with intelligent question-answering products such as chat robots, intelligent assistants, intelligent customer service and the like through terminals, the requirement of conveying the meaning by adopting the expression image is also met, after the user sends the expression image through the terminals, the server 102 can extract the meaning characteristic of the expression image and carry out corresponding image processing, for example, the server 102 can recommend a response image with the highest matching degree with the expression image from a database, and then send the response image to the terminals.

Fig. 2 is a flowchart of an image processing method according to an embodiment of the present invention. Referring to fig. 2, the method is applied to a computer device, and the following details about this embodiment are given by taking the computer device as a server as an example:

201. the server inputs the plurality of images into a pixel feature extraction model, and extracts pixel features of the plurality of images through the pixel feature extraction model.

The images may be any content image, for example, the images may include an expression image or a non-expression image, the expression image may be an image expressing an idea in a man-machine interaction process, the expression image may carry text, and further, the expression image may be divided into a portrait expression, an animal expression, a cartoon expression, and the like.

The pixel feature extraction model is used for extracting pixel features of an image, wherein the pixel features refer to surface features of the image at a pixel level, namely visual features such as textures, colors, shapes or boundaries which are visually presented by the image.

In some embodiments, the pixel feature extraction model may be a CNN (convolutional neural networks, convolutional neural network) network, TCN (temporal convolutional networks) network, VGG (visual geometry group ) network, or the like.

Taking the pixel feature extraction model as a CNN for illustration, the CNN may include an input layer, at least one convolution layer, and an output layer connected in series, where the input layer is used for performing decoding processing on an input image, the at least one convolution layer is used for performing convolution processing on a decoded image, and the output layer is used for performing nonlinear processing and normalization processing on the convolved image. In some embodiments, at least one pooling layer may be further introduced between the various convolution layers, the pooling layer being configured to compress the feature map output by the previous convolution layer, thereby reducing the size of the feature map.

In some embodiments, a residual connection may be employed between the at least one convolutional layer, which is: for each convolution layer, any feature map output by the convolution layer between the convolution layers and the corresponding feature map output by the current convolution layer can be overlapped to obtain a residual block, and the residual block is used as one feature map input into the next convolution layer, so that the degradation problem of the generating network can be solved, for example, the residual connection can be performed once every two convolution layers, and the like.

In the above case, the step 201 is that the server inputs a plurality of images into a pixel feature extraction model, the pixel feature extraction model is in the form of a convolutional neural network, and the plurality of images are convolved by at least one convolutional layer in the convolutional neural network, so as to output the pixel features of the plurality of images. Of course, when the pixel feature extraction model is a time convolution network, a similar process may be performed, except that causal convolution (causal convolutions) processing is performed on the plurality of images by each convolution layer in the time convolution network, which is not described herein.

In some embodiments, the pixel feature extraction model may also be a VGG network (a special CNN network), where the VGG network includes a plurality of convolution layers and a plurality of pooling layers, each convolution layer uses a small convolution kernel of size 3*3, each pooling layer uses a largest pooling kernel of size 2×2, and each two convolution layers are connected by a residual, so as to reduce the size of the image by half and double the depth as the VGG network deepens, thereby simplifying the structure of the CNN network. For example, the VGG network may be VGG-16, etc., and the level of the VGG network is not specifically limited in the embodiments of the present invention.

202. And the server performs clustering processing on the plurality of images according to the pixel characteristics of the plurality of images to obtain category labels of the plurality of images.

In the above process, the server may perform clustering processing on the plurality of images based on a KNN (K-nearest neighbor) algorithm through a plurality of similarities corresponding to pixel features of the plurality of images, to obtain class labels of the plurality of images.

In some embodiments, the server may construct a KNN model based on a training image set, where the training image set includes a plurality of training images, and each training image includes a pixel feature and a class label, where the number of class labels in the training image set may be set smaller than the first target number in some cases where the training accuracy requirement is not high, so that clustering may be quickly completed when the images are clustered, and where the number of class labels in the training image set may be set greater than or equal to the first target number in some cases where the training accuracy requirement is high, so that the classes to which the images belong may be more finely divided when the images are clustered. Wherein the first target number may be any number greater than or equal to 1.

In the above case, the server may sequentially input the pixel features of the plurality of images into the KNN model, obtain a plurality of similarities between the pixel features of the plurality of images and the pixel features of the plurality of training images through the KNN model, and obtain class labels of the plurality of images according to the plurality of similarities.

Specifically, taking any one of the plurality of images as an example, after the server inputs the image into the KNN model, a plurality of similarities between the image and a plurality of training images in the training image set are obtained based on the KNN model, the plurality of training images are ordered according to the order of the similarities from large to small, the similarity is located in the category labels of the first second target number of training images, and the category label with the highest occurrence frequency is determined as the category label of the image. Wherein the second target number may be any number greater than or equal to 1.

In some embodiments, the server may obtain the inverse of the euclidean distance between the pixel feature of any image and any training image as a similarity between the pixel feature of the image and the pixel feature of the training image, and since the absolute distance of different pixel features in the feature space can be measured by the euclidean distance, the inverse of the euclidean distance better describes the similarity between the pixel feature of the image and the pixel feature of the training image.

In some embodiments, the server may further obtain the inverse of the manhattan distance between the pixel feature of any image and any training image as the similarity between the pixel feature of the image and the pixel feature of the training image, and since the absolute wheelbase (the axis here refers to the coordinate axis) of the different image features in the feature space can be measured by the manhattan distance, the inverse of the manhattan distance can also better describe the similarity between the pixel feature of the image and the pixel feature of the training image.

For example, the pixel characteristics of the image P are input into the KNN model, 20 training images are included in the training image set of the KNN model, the inverse of the euclidean distance between the pixel characteristics of the image P and the pixel characteristics of the 20 training images is determined as 20 corresponding similarities by the server, the 20 training images are ordered in the order from the big to the small of the similarities, the training images with the top 5 ranks are obtained, and since 4 training images with the top 5 ranks belong to the category label a, only 1 belongs to the category label B, and therefore the category label a with the highest occurrence frequency is determined as the category label of the image P.

In some embodiments, the server may further add the pixel feature of the image and the class label of the image to the training image set of the KNN model each time after obtaining the class label of the image, so that the training image set of the KNN model may be continuously expanded in the process of clustering the plurality of images, and the accuracy of clustering the KNN model is improved.

Fig. 3 is a schematic diagram of a clustering result provided by the embodiment of the present invention, referring to fig. 3, assuming that a training image set of a KNN model includes category labels "zel", "no language" and "look-up", after 8 images are input into the KNN model, the clustering result shown in fig. 3 may be obtained.

203. The server assigns the same random word vector as the first initial word vector to images having the same category label.

The random word vector can be any randomly generated word vector, and the first initial word vector is a word vector obtained by initializing any image.

In the foregoing steps 201 to 203, the server acquires a plurality of first initial word vectors, where the plurality of first initial word vectors correspond to the plurality of images, and further, the server further assigns the same first initial word vector to the images with the same category label, so that the images with the same category label can share the first initial word vector, thereby sharing adjustment of parameters in the subsequent iterative training process, so that the images with the same category label eventually have the same semantic features.

In the process, the server firstly performs clustering processing on the images, and then performs word vector initialization processing on the images based on the category labels, so that if some images only appear in a small amount of dialogues, and the context information of the images is not rich enough, the accuracy of the semantic features of the images can be improved through a clustering processing method, the training time of a language model in subsequent training is reduced, and the training calculation amount of the language model is reduced.

It should be noted that, the semantic features extracted in the embodiment of the present invention refer to a vector representation of a non-visual feature expressed by the whole image at the semantic level, and not refer to a vector representation of a visual feature expressed by a traditional image at the pixel level.

In some embodiments, the server may not perform the steps 201-203, that is, perform the clustering process on the multiple images, but perform the word vector initialization process directly on each image, where the steps 201-203 may be replaced by the following method: when a plurality of images comprise a first image, text is extracted from the first image, embedding processing is carried out on the text to obtain word vectors of at least one word in the text, average vectors of the word vectors of the at least one word are obtained to be first initial word vectors corresponding to the first image, wherein the first image is an image carrying the text, optionally, when a plurality of images comprise a second image, the server can also obtain random word vectors to be first initial word vectors corresponding to the second image, wherein the second image is an image not carrying the text, so that targeted initialization processing is carried out on the first image and the second image respectively, processing logic of the word vector initialization process is optimized, and training time is shortened.

In the above process, when text is extracted from the plurality of images, OCR (optical character recognition ) technology may be used to identify text in the plurality of images, and the server does not perform clustering processing in the above process, but for each image, may train out semantic features of the image in an iterative training process, so that the semantic features of the image have more pertinence.

For example, fig. 4 is a schematic diagram of a language model training process provided in the embodiment of the present invention, referring to fig. 4, assuming that text 401, image 402, image 403 and text 404 are a session between user a and user B, respectively, the session is regarded as a long text carrying an image during training of the language model, for text 401 and text 404, pre-trained word vectors are obtained as second initial word vectors based on the method in step 204 described below, for image 402, since image 402 is a first image carrying a text, "look-up" text in image 402 is extracted by a server based on OCR technology, word vectors of a plurality of words in the text are obtained, an average vector of word vectors of the plurality of words is determined as a first initial word vector of image 402, for image 403, since image 403 is a second image not carrying a text, at this time, a server can directly obtain any random word vector as a first initial word vector of image 403, so that all the steps of text initialization are performed in the server for all the long text 205 based on the method of word embedding.

204. The server acquires a plurality of context information of a plurality of images, performs embedding processing on the plurality of context information, and acquires a pre-trained word vector as a plurality of second initial word vectors.

The text scene may be a conversation scene, where the plurality of context information is at least one of a conversation before or after a position where an image is located in the conversation, or may be a long text scene carrying the image, where the plurality of context information is at least one of a text before or after the position where the image is located in the long text scene.

In the step 204, the server obtains a plurality of second initial word vectors, where the plurality of second initial word vectors correspond to the plurality of context information, and in some embodiments, the server performs the embedding process on the plurality of context information, that is: for each of the plurality of context information, the server multiplies a one-hot (one-hot) code corresponding to the context information by a pre-trained weight matrix, so as to map the one-hot code to a word vector space to obtain a pre-trained word vector, and determines the pre-trained word vector as a second initial word vector of the context information, so that the duration required by training can be shortened by acquiring the pre-trained word vector.

205. The server iteratively trains the language model based on the plurality of first initial word vectors and the plurality of second initial word vectors.

In the above process, the Language Model (LM) may be any natural language processing (natural language processing, NLP) model, for example, the language model may be an N-gram model (also referred to as N-gram model), may be an NNLM (neural network language model ), may be an ELMo (embeddings from language models, language model using embedding processing), may be a BERT (bidirectional encoder representation from transformers, translation model using bi-directional coding representation), and the structure of the language model is not specifically limited in the embodiment of the present invention.

In the iterative training process, the server inputs the plurality of first initial word vectors and the plurality of second initial word vectors into the language model, obtains a loss function value according to a prediction result of the language model, and when the loss function value is greater than a target threshold value, can perform parameter adjustment on the language model based on a back propagation algorithm (backpropagation algorithm, BP algorithm), and iteratively performs the process until the loss function value is less than or equal to the target threshold value or the iteration number reaches the target number, and can stop training.

In the above process, when training the language model, the text word vector (the second initial word vector) and the image word vector (the first initial word vector) are input into the language model together, so that the language model can extract deep features of the plurality of images on a semantic level based on processing logic similar to extracting the semantic features of the text, and therefore the constraint of pixel features can be removed, processing barriers between the text and the images can be opened, and the semantic features of the plurality of images can be obtained.

In some embodiments, during iterative training of the language model, the server may keep the plurality of second initial word vectors unchanged, so that only the values of the plurality of first initial word vectors are adjusted during training, so that it may be ensured that the plurality of first word vectors and the plurality of second initial word vectors obtained during stopping training are located in the same vector space, that is, the semantic features of the image and the semantic features of the context information are located in the same feature space, and the server may enable the semantic features of the plurality of images to have a more accurate semantic expression effect based on the method of controlling the variables.

Fig. 5 is a schematic diagram of a language model training process provided by the embodiment of the present invention, referring to fig. 5, it is assumed that text 401, image 402, image 403 and text 404 are a session between user a and user B, and the session is regarded as a long text carrying an image when training a language model, for text 401 and text 404, pre-trained word vectors are obtained as second initial word vectors, for image 402 and image 403, cluster class numbers (for example, class labels provided by the embodiment of the present invention or class labels obtained according to class label mapping) of image 402 and image 403 are obtained first, two random word vectors corresponding to the respective cluster class numbers are obtained according to the respective cluster class numbers, and the two random word vectors are respectively regarded as first initial word vectors of image 402 and image 403, so that a server can initialize all texts and images in the long text based on a word embedding (word embedding) method, and perform iteration training algorithm on the two first initial word vectors and two initial word vectors until the number of target word models reaches a threshold number of iteration threshold, and the iteration number is reached or the iteration number is reached by performing an iteration algorithm on the target language model.

206. When the loss function value is smaller than the target threshold value or the iteration number reaches the target number, the server determines a plurality of first word vectors as semantic features of a plurality of images.

The first word vectors are word vectors obtained by carrying out parameter adjustment on the first initial word vectors. Wherein the target threshold may be any value greater than or equal to 0.

It should be noted that, since the plurality of first initial word vectors and the plurality of second initial word vectors are also parameters of the language model, in the process of iterative training of the language model, parameter adjustment is performed on each first initial word vector and each second initial word vector, and when the function value of the loss function is smaller than the target threshold or the number of iterations reaches the target number, training is stopped, and at this time, the plurality of first initial word vectors after parameter adjustment may be referred to as a plurality of first word vectors, and the plurality of second initial word vectors after parameter adjustment may be referred to as a plurality of second word vectors.

In the above process, along with optimization of the language processing effect of the language model, each first initial word vector and each second initial word vector are more and more accurate, that is, the semantic expression effect of the plurality of first word vectors on the plurality of images is better and better, and the semantic expression effect of the plurality of second word vectors on the plurality of context information is better and better, so that when training is stopped, the plurality of first word vectors can be determined as semantic features of the plurality of images, and the plurality of second word vectors can be determined as semantic features of the plurality of context information.

In the above steps 201-206, the server inputs a plurality of images and a plurality of context information into a language model, and performs feature extraction on the plurality of images through the language model and the plurality of context information to obtain semantic features of the plurality of images, and in the process of feature extraction, the server performs parameter adjustment on the first initial word vector and the second initial word vector based on the iteration training on the language model, so that when the loss function value is smaller than a target threshold value or the number of iterations reaches a target number of times, the semantic features of the plurality of images and the semantic features of the plurality of context information can be obtained, thereby breaking through barriers between texts and images, and extracting deep features of the images in a semantic layer.

In the following description, a language model is taken as an ELMo as an example, a bi-directional LSTM (long short-term memory network) language model is included in the ELMo, that is, a forward LSTM and a backward LSTM are included in the ELMo, the server inputs the first initial word vectors and the second initial word vectors into the forward LSTM, the first initial word vectors and the second initial word vectors extract semantic features of the images and the context information in the forward direction through the forward LSTM, the server inputs the first initial word vectors and the second initial word vectors into the backward LSTM, the backward LSTM extracts semantic features of the images and the context information in the backward direction, and further, the maximum likelihood estimates (maximum likelihood estimation, MLE) of the forward LSTM and the backward LSTM can be taken as losses, and training is stopped when the maximum likelihood estimates are less than or equal to a target threshold, so that the semantic features of the images can be expressed more accurately.

207. The server stores semantic features of the plurality of images in a database according to the category identification of the plurality of images.

The category identifier may be a category label or an identifier code mapped based on the category label, for example, the category identifier may be a category label "no language", and when a mapping relationship between the category label "no language" and the identifier code "3055" is established in the server, the category identifier may be "3055".

In the above process, the server may map the category labels of the plurality of images to obtain category labels of the plurality of images, so that the category labels of the plurality of images are stored in the database in a key value pair manner, that is, the server stores category labels of the plurality of images as key names, and semantic features of the plurality of images are stored as key values, so that subsequent reading of data in the database is facilitated.

In some embodiments, when the server does not perform steps 201-203, since each image pair uniquely corresponds to a semantic feature, the server may also store the semantic features of multiple images into the database according to the image identifications of the multiple images, facilitating subsequent reading of the data in the database.

Fig. 6 is a schematic diagram of database storage provided in the embodiment of the present invention, referring to fig. 6, a server stores semantic features of each image into a database according to an image identifier, where the database illustrated in fig. 6 is a word vector library storing context information, in which text word vectors and image word vectors are stored respectively, each text word vector and each image word vector corresponds to an ID (identification code), for an image word vector, the text word vector may correspond to an image ID or a category ID, in some embodiments, a word vector with an ID of 1 to N is a text word vector, a word vector with an ID of N to N is an image word vector, N is any number greater than or equal to 1, and N is any number greater than N.

208. When an image processing instruction carrying a target image is received, semantic features of the target image are acquired from a database, and image processing is performed based on the semantic features of the target image.

The image processing instruction may carry a target image and a processing type, where the processing type may be semantic segmentation, image classification, image generation, and the like, and optionally, the image processing instruction may also carry an image identifier of the target image.

Since the server stores the semantic features of the plurality of images in the database in step 207, when the target image hits any image in the database, the semantic features of the image can be directly read from the database, and image processing is performed based on the semantic features of the image, thereby greatly saving the time period for processing the image.

In the process of image processing based on the database, the server can acquire the image carried by the image processing instruction when receiving the image processing instruction, perform clustering processing on the image to obtain the category label of the image, acquire the semantic feature (word vector) corresponding to the category label from the database, and perform image processing based on the semantic feature, thereby saving the time for extracting the image feature and optimizing the efficiency of image processing.

In some embodiments, if the server stores semantic features of a plurality of images according to the image identifier, if the image processing instruction further carries the image identifier of the target image, the server may determine the semantic feature corresponding to the image identifier in the database as the semantic feature of the target image, thereby quickly acquiring the semantic feature of the target image, and facilitating execution of a downstream image processing task.

In the determining process, the server may use the image identifier of the target image as an index, search the index content (i.e. the semantic feature of the image) corresponding to the index in the database, and determine the semantic feature of the image as the semantic feature of the target image when the index can hit the semantic feature of any image.

In some embodiments, if the server stores semantic features of a plurality of images according to the category identifiers, the server may directly perform clustering processing on the target image, which is similar to the above step 202, and details are not described here, and when the server may obtain the category label of the target image after the clustering processing, further map to obtain the category identifier of the target image according to the category label of the target image, and further determine the semantic feature corresponding to the category identifier in the database as the semantic feature of the target image, so as to be capable of coping with image processing requirements in a wider scenario. In the above determining process, the server may also use the category identifier of the target image as an index, and perform similar steps for determining the semantic features of the target image, which will not be described herein.

In the above steps 207-208, the server performs image processing based on the semantic features of the multiple images, so that deep features of the images at the semantic level can be extracted through the language model, and the image semantics can be better understood based on the computer device, that is, some image processing tasks in scenes with high requirements for the image semantics can be completed, and the image processing tasks are described in detail below by way of several examples, where the image processing tasks include but are not limited to the following examples.

In some embodiments, the image processing task may be image generation, where the image generation instruction may carry a text or an image, so that the server may obtain, based on the text or the image, a semantic feature with the highest context matching degree between semantic features of the text or the image from the database, and determine an image corresponding to the semantic feature as an output image, so that an output image with the highest context matching degree may be obtained based on the image generation instruction, and in some scenes of man-machine conversation, the output image may be sent to the terminal, so that the server may reply an image to the terminal, thereby increasing interestingness of a man-machine conversation process, and improving intelligence of the chat robot.

In some embodiments, the image processing task may also be image classification, where the image generation instruction may carry a plurality of images to be processed, and the server acquires semantic features of the plurality of images to be processed according to the process in step 208, so that the images to be processed with similarity between the semantic features greater than the target threshold may be determined to be in the same class, so that image classification can be implemented based on the semantic features of the images, and a similar semantic image class may be obtained, and compared with a traditional method according to pixel feature classification, a classification effect with higher performance may be achieved.

Any combination of the above-mentioned optional solutions may be adopted to form an optional embodiment of the present disclosure, which is not described herein in detail.

According to the method provided by the embodiment of the invention, the plurality of context information of the plurality of images is obtained, and because the plurality of context information is at least one item of text information before or after the position of the image in the text scene, after the plurality of images and the plurality of context information are input into the language model, the plurality of images can be subjected to feature extraction through the language model and the plurality of context information under the action of the plurality of context information, so that the semantic features of the plurality of images can be obtained, and because the semantic features are vector representations of non-visual features expressed in the semantic level of the whole of the plurality of images, the server can better understand the image semantics, so that when the image processing is performed based on the semantic features of the plurality of images, the image processing task under the scene with higher requirement on the image semantics can be completed, and the accuracy of the image processing is improved.

Further, the images are clustered firstly, and then the word vector initialization processing is carried out on the images based on the category labels, so that if some images only appear in a small amount of dialogues, and the context information of the images is not rich enough, the accuracy of the semantic features of the images can be improved through a clustering processing method, the training time of a language model is shortened, and the training calculation amount of the language model is reduced.

Further, the same first initial word vector can be allocated to the images with the same category labels, so that the images with the same category labels can share the first initial word vector, and therefore adjustment of parameters is shared in the subsequent iterative training process, and the images with the same category labels finally have the same semantic features.

Further, when a plurality of images comprise a first image, text is extracted from the first image, embedding processing is carried out on the text to obtain word vectors of at least one word in the text, average vectors of the word vectors of the at least one word are obtained to be first initial word vectors corresponding to the first image, when the plurality of images comprise a second image, random word vectors are obtained to be first initial word vectors corresponding to the second image, and therefore targeted initialization processing is carried out on the first image and the second image respectively, processing logic of a word vector initialization process is optimized, and training time is shortened. On the other hand, embedding processing is carried out on the context information, and the pre-trained word vector is acquired as a plurality of second initial word vectors, so that the duration required by training is further shortened by acquiring the pre-trained word vector.

Further, according to the image identifications or the category identifications of the images, the semantic features of the images are stored in the database, so that the semantic features of the images obtained through training can be stored properly for later use in the call of the image processing task, when an image processing instruction carrying a target image is received, the semantic features of the target image are obtained from the database, and image processing is performed based on the semantic features of the target image, so that the efficiency and the accuracy of the image processing process are greatly improved.

Fig. 7 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present invention, referring to fig. 7, the apparatus includes:

an acquiring module 701, configured to acquire a plurality of pieces of context information of a plurality of images, where the plurality of pieces of context information is at least one item of text information before or after a location of the image in a text scene;

the feature extraction module 702 is configured to input the plurality of images and the plurality of context information into a language model, and perform feature extraction on the plurality of images through the language model and the plurality of context information to obtain semantic features of the plurality of images;

an image processing module 703, configured to perform image processing based on semantic features of the plurality of images.

According to the device provided by the embodiment of the invention, the plurality of context information of the plurality of images is obtained, and because the plurality of context information is at least one item of text information before or after the position of the image in the text scene, after the plurality of images and the plurality of context information are input into the language model, the plurality of images can be subjected to feature extraction through the language model and the plurality of context information under the action of the plurality of context information, so that the semantic features of the plurality of images can be obtained, and because the semantic features are vector representations of non-visual features expressed in the semantic layer of the whole of the plurality of images, the server can better understand the image semantics, so that when the image processing is performed based on the semantic features of the plurality of images, the image processing task under the scene with higher requirement on the image semantics can be completed, and the accuracy of the image processing is improved.

In one possible implementation, the feature extraction module 702 includes:

an acquisition unit configured to acquire a plurality of first initial word vectors corresponding to the plurality of images;

the obtaining unit is further configured to obtain a plurality of second initial word vectors, where the plurality of second initial word vectors correspond to the plurality of context information;

The iterative training unit is used for carrying out iterative training on the language model based on the plurality of first initial word vectors and the plurality of second initial word vectors;

In one possible implementation, the obtaining unit is configured to:

when a first image is included in the plurality of images, extracting text from the first image, embedding the text to obtain a word vector of at least one word in the text, and acquiring an average vector of the word vectors of the at least one word as a first initial word vector corresponding to the first image, wherein the first image is an image carrying the text;

When a second image is included in the plurality of images, the random word vector is acquired as a first initial word vector corresponding to the second image, and the second image is an image without text.

In one possible implementation, the obtaining unit is configured to:

and embedding the plurality of context information, and acquiring the pre-trained word vectors as the plurality of second initial word vectors.

In one possible implementation, the iterative training unit is configured to:

when the loss function value is smaller than the target threshold or the iteration number reaches the target number, obtaining semantic features of the plurality of images includes:

and determining the first word vectors as semantic features of the images when the loss function value is smaller than a target threshold or the iteration number reaches a target number.

In one possible implementation, the image processing module 703 includes:

the storage processing unit is used for storing the semantic features of the images into a database according to the image identifications or the category identifications of the images, acquiring the semantic features of the target image from the database when receiving an image processing instruction carrying the target image, and carrying out image processing based on the semantic features of the target image.

In one possible implementation, the storage processing unit is configured to:

when the image processing instruction also carries the image identifier of the target image, determining the semantic feature corresponding to the image identifier in the database as the semantic feature of the target image; or alternatively, the first and second heat exchangers may be,

clustering is carried out on the target image, category identification of the target image is obtained, and semantic features corresponding to the category identification in the database are determined to be the semantic features of the target image.

It should be noted that: in the image processing apparatus provided in the above embodiment, only the division of the above functional modules is used for illustration, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the computer device is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the image processing apparatus and the image processing method provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the image processing apparatus and the image processing method are detailed in the image processing method embodiment, which is not described herein again.

Fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present invention, where the computer device 800 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 801 and one or more memories 802, where the memories 802 store at least one instruction, and the at least one instruction is loaded and executed by the processor 801 to implement the image processing method provided in the above-mentioned embodiments of the image processing method. Of course, the computer device may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

In an exemplary embodiment, a computer readable storage medium, such as a memory, comprising at least one instruction executable by a processor in a terminal to perform the image processing method of the above embodiment is also provided. For example, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the above storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. An image processing method, the method comprising:

inputting the plurality of images and the plurality of context information into a language model; acquiring a plurality of first initial word vectors, wherein the plurality of first initial word vectors correspond to the plurality of images; acquiring a plurality of second initial word vectors, wherein the plurality of second initial word vectors correspond to the plurality of context information;

Iteratively training the language model based on the plurality of first initial word vectors and the plurality of second initial word vectors;

when the loss function value is smaller than a target threshold value or the iteration number reaches the target number, semantic features of the plurality of images are obtained;

2. The method of claim 1, wherein the obtaining a plurality of first initial word vectors comprises:

3. The method of claim 1, wherein the obtaining a plurality of first initial word vectors comprises:

4. The method of claim 1, wherein the obtaining a plurality of second initial word vectors comprises:

5. The method of claim 1, wherein the iteratively training the language model based on the plurality of first initial word vectors and the plurality of second initial word vectors comprises:

when the loss function value is smaller than a target threshold value or the iteration number reaches a target number, obtaining semantic features of the plurality of images includes:

6. The method of claim 1, wherein the image processing based on semantic features of the plurality of images comprises:

according to the image identification or the category identification of the images, the semantic features of the images are stored in a database, when an image processing instruction carrying a target image is received, the semantic features of the target image are obtained from the database, and image processing is carried out based on the semantic features of the target image.

7. The method of claim 6, wherein the retrieving semantic features of the target image from the database comprises:

8. An image processing apparatus, characterized in that the apparatus comprises:

a feature extraction module for inputting the plurality of images and the plurality of context information into a language model;

the feature extraction module further includes: the device comprises an acquisition unit, an iterative training unit and an acquisition unit;

the acquisition unit is used for acquiring a plurality of first initial word vectors, and the plurality of first initial word vectors correspond to the plurality of images; acquiring a plurality of second initial word vectors, wherein the plurality of second initial word vectors correspond to the plurality of context information;

the iterative training unit is used for iteratively training the language model based on the first initial word vectors and the second initial word vectors;

the obtaining unit is used for obtaining semantic features of the plurality of images when the loss function value is smaller than a target threshold value or the iteration number reaches a target number;

9. The apparatus of claim 8, wherein the acquisition unit is configured to:

10. The apparatus of claim 8, wherein the acquisition unit is configured to:

11. The apparatus of claim 8, wherein the acquisition unit is configured to:

12. The apparatus of claim 8, wherein the iterative training unit is configured to:

13. The apparatus of claim 8, wherein the image processing module comprises:

14. The apparatus of claim 13, wherein the storage processing unit is configured to:

15. A computer device comprising one or more processors and one or more memories having stored therein at least one instruction that is loaded and executed by the one or more processors to implement the operations performed by the image processing method of any of claims 1 to 7.

16. A computer readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement the operations performed by the image processing method of any one of claims 1 to 7.