CN110163121A

CN110163121A - Image processing method, device, computer equipment and storage medium

Info

Publication number: CN110163121A
Application number: CN201910360905.2A
Authority: CN
Inventors: 缪畅宇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-04-30
Filing date: 2019-04-30
Publication date: 2019-08-23
Anticipated expiration: 2039-04-30
Also published as: CN110163121B

Abstract

The invention discloses a kind of image processing method, device, computer equipment and storage mediums, belong to network technique field.This method comprises: obtaining multiple contextual informations of multiple images；By multiple image and multiple contextual information input language model, by the language model and multiple contextual information, feature extraction is carried out to multiple image, obtains the semantic feature of multiple image；Semantic feature based on multiple image carries out image procossing.The present invention is extracted the semantic feature of image by language model, can complete the image processing tasks under some pairs of higher scenes of image, semantic demand, improve the accuracy of image procossing.

Description

Image processing method, device, computer equipment and storage medium

Technical field

The present invention relates to network technique field, in particular to a kind of image processing method, device, computer equipment and storage Medium.

Background technique

At present in social interaction, compared to for text, it is to be expressed that image can more visually express user Semanteme, also more vivid and interesting.And with the development of computer equipment, user can be helped to understand figure by computer equipment Picture, that is to say that computer equipment can carry out feature extraction to image, execute image message time so as to assist improving user The efficiency of the operations such as multiple or image exploitation.

In traditional feature extracting method, computer equipment be usually pass through VGG (visual geometry group, Visual geometric group) superficial feature of the network extraction image in pixel level, VGG network is the (example under some specific scenes Such as image classification, image segmentation, image recognition) obtained model is trained, therefore VGG network can only be extracted in feature extraction The superficial feature of interest to the specific scene, such as VGG segmentation network extraction is each cut zone in pixel level Boundary information (such as mammary gland divide), what VGG sorter network extracted is class label (such as cat belonging to image in pixel level Dog classification etc.).

In above process, VGG network can only extract image in the superficial feature of pixel level, can not more fully understand figure As semantic, under some pairs of higher scenes of image, semantic demand, the accuracy of image procossing is not high.

Summary of the invention

The embodiment of the invention provides a kind of image processing method, device, computer equipment and storage mediums, are able to solve Computer equipment can not more fully understand image, semantic, under some pairs of higher scenes of image, semantic demand, image procossing The not high problem of accuracy.The technical solution is as follows:

On the one hand, a kind of image processing method is provided, this method comprises:

Multiple contextual informations of multiple images are obtained, the multiple contextual information is locating for the image in text scene At least one of in text information before or after position；

By described multiple images and the multiple contextual information input language model, by the language model and described Multiple contextual informations carry out feature extraction to described multiple images, obtain the semantic feature of described multiple images；

Semantic feature based on described multiple images carries out image procossing.

In a kind of possible embodiment, obtaining multiple second initial term vectors includes:

Insertion processing is carried out to the multiple contextual information, the term vector of pre-training is retrieved as at the beginning of the multiple second Beginning term vector.

On the one hand, a kind of image processing apparatus is provided, which includes:

Module is obtained, for obtaining multiple contextual informations of multiple images, the multiple contextual information is in text At least one of in text information in scene before or after image present position；

Characteristic extracting module, for passing through described multiple images and the multiple contextual information input language model The language model and the multiple contextual information carry out feature extraction to described multiple images, obtain described multiple images Semantic feature；

Image processing module carries out image procossing for the semantic feature based on described multiple images.

In a kind of possible embodiment, the characteristic extracting module includes:

Acquiring unit, for obtaining the multiple first initial term vectors, the multiple first initial term vector corresponds to described Multiple images；

The acquiring unit, is also used to obtain the multiple second initial term vectors, and the multiple second initial term vector is corresponding In the multiple contextual information；

Repetitive exercise unit, for being based on the initial term vector of the multiple first initial word vector sum the multiple second, Training is iterated to the language model；

Unit is obtained, for obtaining institute when loss function value is less than targets threshold or the number of iterations reaches targeted number State the semantic feature of multiple images.

In a kind of possible embodiment, the acquiring unit is used for:

Described multiple images input pixel characteristic is extracted into model, is extracted by the pixel characteristic more described in model extraction The pixel characteristic of a image；

According to the pixel characteristic of described multiple images, clustering processing is carried out to described multiple images, obtains the multiple figure The class label of picture；

To distribute identical random term vector as the first initial term vector with the image of the same category label.

In a kind of possible embodiment, the acquiring unit is used for:

When in described multiple images include the first image when, text is extracted from the first image, to the text into Row insertion processing, obtains the term vector of at least one word in the text, by the flat of the term vector of at least one word Equal vector, is retrieved as the first initial term vector corresponding with the first image, and the first image is the image for carrying text；

When in described multiple images including the second image, random term vector is retrieved as corresponding with second image First initial term vector, second image are the image for not carrying text.

In a kind of possible embodiment, the acquiring unit is used for:

In a kind of possible embodiment, the repetitive exercise unit is used for:

During being iterated training to the language model, keep the multiple second initial term vector fixed not Become, the numerical value of the multiple first initial term vector is adjusted, multiple first term vectors are obtained；

When loss function value is less than targets threshold or the number of iterations reaches targeted number, the language of described multiple images is obtained Adopted feature includes:

When loss function value is less than targets threshold or the number of iterations reaches targeted number, by the multiple first term vector It is determined as the semantic feature of described multiple images.

In a kind of possible embodiment, described image processing module includes:

Storage processing unit, for the image identification or classification logotype according to described multiple images, by described multiple images Semantic feature store into database, when receive carry target image image processing commands when, from the database The semantic feature for obtaining the target image, the semantic feature based on the target image carry out image procossing.

In a kind of possible embodiment, the storage processing unit is used for:

When also carrying the image identification of the target image in described image process instruction, by the database with institute Semantic feature corresponding to image identification is stated, the semantic feature of the target image is determined as；Or,

Clustering processing is carried out to the target image, obtains the classification logotype of the target image, it will be in the database With semantic feature corresponding to the classification logotype, it is determined as the semantic feature of the target image.

On the one hand, provide a kind of computer equipment, the computer equipment include one or more processors and one or Multiple memories are stored at least one instruction in the one or more memory, and at least one instruction is by this or more A processor is loaded and is executed to realize the operation as performed by the image processing method of above-mentioned any possible implementation.

On the one hand, a kind of computer readable storage medium is provided, at least one instruction is stored in the storage medium, it should At least one instruction is loaded by processor and is executed to realize the image processing method institute such as above-mentioned any possible implementation The operation of execution.

Technical solution bring beneficial effect provided in an embodiment of the present invention includes at least:

By obtaining multiple contextual informations of multiple images, since multiple contextual information is to scheme in text scene As in the text information before or after present position at least one of, therefore by multiple image and multiple contextual information After input language model, it can be believed by the language model and multiple contextual information in view of multiple context Under the action of breath, feature extraction is carried out to multiple image, so as to obtain the semantic feature of multiple image, due to these Semantic feature is that the vector of the whole non-visualization feature expressed by semantic level of multiple image indicates, so that server It better understood when image, semantic, so that one can be completed when the semantic feature based on multiple image carries out image procossing A bit to the image processing tasks under the higher scene of image, semantic demand, the accuracy of image procossing is improved.

Detailed description of the invention

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 is a kind of implementation environment schematic diagram of image processing method provided in an embodiment of the present invention；

Fig. 2 is a kind of flow chart of image processing method provided in an embodiment of the present invention；

Fig. 3 is a kind of schematic diagram of cluster result provided in an embodiment of the present invention；

Fig. 4 is a kind of principle schematic diagram of language model training process provided in an embodiment of the present invention；

Fig. 5 is a kind of principle schematic diagram of language model training process provided in an embodiment of the present invention；

Fig. 6 is a kind of principle schematic diagram of database purchase provided in an embodiment of the present invention；

Fig. 7 is a kind of structural schematic diagram of image processing apparatus provided in an embodiment of the present invention；

Fig. 8 is the structural schematic diagram of computer equipment provided in an embodiment of the present invention.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to embodiment party of the present invention Formula is described in further detail.

Fig. 1 is a kind of implementation environment schematic diagram of image processing method provided in an embodiment of the present invention.Referring to Fig. 1, at this It may include at least one terminal 101 and server 102 in implementation environment, be detailed below:

Wherein, which can be any terminal for capableing of sending information or image, and user appoints logging in It, can be to 102 sending information of server or image after one terminal.

Wherein, which can be any computer equipment for being capable of providing image processing services, work as server 102 when receiving the image of any terminal at least one terminal 101, the semantic feature of the available image, and is based on The semantic feature of the image carries out image procossing.

The embodiment of the present invention can be applied in some man-machine interaction scenarios, since user is gradually inclined in social interaction In substituting text information with facial expression image, the semanteme to be expressed of user is visually expressed, increases social interest, because This, when user carries out interacting message by the intelligent answers product such as terminal and some chat robots, intelligent assistant, intelligent customer service When, it similarly has after conveying semantic demand, user to send expression picture by terminal using facial expression image, server 102 The semantic feature of the facial expression image can be extracted, and carries out corresponding image procossing, such as server 102 can be from database Recommend one and the highest response image of the facial expression image matching degree out, and then send the response image to terminal, compared to biography The intelligent answer product that can only reply text information of system, the image processing method of this semantic feature based on facial expression image, Intelligent answer product can be made more to personalize and intelligent, solve that text information is excessively blunt, is easy that the language fails to express the meaning, no The shortcomings that enough vivid and interestings.

Fig. 2 is a kind of flow chart of image processing method provided in an embodiment of the present invention.Referring to fig. 2, this method is applied to In computer equipment, the embodiment is described in detail so that computer equipment is server as an example below:

201, multiple images input pixel characteristic is extracted model by server, and extracting model extraction by the pixel characteristic should The pixel characteristic of multiple images.

Wherein, multiple image can be the image of any content, for example, may include facial expression image in multiple image Or non-facial expression image, the facial expression image can be the image for expressing idea in human-computer interaction process, can take in the facial expression image With text, further, portrait expression, animal expression or cartoon expression etc. can also be divided into inside facial expression image.

Wherein, which extracts the pixel characteristic that model is used to extract image, which refers to image in picture The superficial feature of plain level that is to say that the visualization such as texture, color, shape or boundary that image is visually presented is special Sign.

In some embodiments, which, which extracts model, can be CNN (convolutional neural Networks, convolutional neural networks) network, it can be TCN (temporal convolutional networks) network, also It can be VGG (visual geometry group, visual geometric group) network etc..

It is extracted for model is CNN and is illustrated by the pixel characteristic, may include the input of serial connection in CNN Layer, at least one convolutional layer and output layer, the input layer is for being decoded processing to input picture, at least one convolutional layer For carrying out process of convolution to by decoded image, which is used to carry out the image after process of convolution non-thread Property processing and normalized.In some embodiments, at least one pond layer, the pond be may be incorporated between each convolutional layer Change the characteristic pattern that layer is used to compress convolutional layer output, to reduce the size of this feature figure.

In some embodiments, it can be connected using residual error between at least one convolutional layer, residual error connection that is to say: It, can be by any feature figure that the convolutional layer between the convolutional layer is exported and current convolutional layer institute for each convolutional layer Residual block (residual block) is obtained after the corresponding characteristic pattern superposition of output, using the residual block as the next convolution of input One characteristic pattern of layer, so as to solve to generate the degenerate problem of network, for example, one can be carried out at interval of a convolutional layer Secondary residual error connection can also carry out residual error connection etc. at interval of two convolutional layers, and the embodiment of the present invention does not connect residual error The convolution layer number at middle interval is specifically limited.

In the above case said, above-mentioned steps 201 that is to say, multiple images input pixel characteristic is extracted model by server, The pixel characteristic extracts the form that model is convolutional neural networks, passes through at least one convolutional layer pair in the convolutional neural networks Multiple image carries out process of convolution, exports the pixel characteristic of multiple image.Certainly, extracting model when pixel characteristic is the time When convolutional network, similar process can also be executed, only each convolutional layer in time convolutional network is to multiple figure What it is as progress is cause and effect convolution (causal convolutions) processing, is not described herein.

In some embodiments, which, which extracts model, can also be an a kind of VGG network (special CNN net Network), it include multiple convolutional layers and multiple pond layers in VGG network, each convolutional layer uses the small-sized convolution having a size of 3*3 Core uses the maximum Chi Huahe having a size of 2*2 in each pond layer, and carries out a residual errors at interval of two convolutional layers and connect It connects, so that the size reduction half of each Chi Huahou image, depth doubles, to simplify with the intensification of VGG network The structure of CNN network.For example, the VGG network can be VGG-16 etc., the embodiment of the present invention not to the level of the VGG network into Row is specific to be limited.

202, server carries out clustering processing to multiple image, it is more to obtain this according to the pixel characteristic of multiple image The class label of a image.

In above process, server can be based on KNN (k-nearest neighbor, k nearest neighbor) algorithm, more by this Multiple similarities corresponding to the pixel characteristic of a image carry out clustering processing to multiple image, obtain multiple image Class label.

In some embodiments, server can construct a KNN model, the training image based on a training image collection Concentrating includes multiple training images, and each training image includes a pixel characteristic and a class label, needs to illustrate It is that, in the case where some training precisions are of less demanding, the quantity of class label can be concentrated to be set as small the training image In first object quantity, cluster is quickly completed when allowing to image clustering, and it is more demanding in some training precisions In the case of, which can be concentrated the quantity of class label to be set greater than or be equal to first object quantity, so that right Classification belonging to image can be more subtly divided when image clustering.Wherein, which can be any Numerical value more than or equal to 1.

In the above case said, the pixel characteristic of multiple image can be sequentially input KNN model by server, by this KNN model obtains multiple similarities between the pixel characteristic of multiple image and the pixel characteristic of multiple training image, root The class label of multiple image is obtained according to multiple similarity.

Specifically, it is illustrated by taking any one image in multiple image as an example, which is inputted KNN mould by server After type, which is obtained based on the KNN model and the training image concentrates multiple similarities between multiple training images, according to The sequence of similarity from big to small, is ranked up multiple training image, and similarity size is located at preceding second destination number In the class label of a training image, the highest class label of frequency of occurrence is determined as the class label of the image.Wherein, this Two destination numbers any can be greater than or equal to 1 numerical value.

In some embodiments, server can by Euclidean between the pixel characteristic of any image and any training image away from From inverse, the similarity being retrieved as between the pixel characteristic of the image and the pixel characteristic of the training image, due to passing through Europe Formula distance can measure absolute distance of the different pixel characteristics in feature space, therefore the inverse of Euclidean distance is preferably retouched State the similarity between the pixel characteristic of image and the pixel characteristic of training image.

In some embodiments, server can also will be graceful between the pixel characteristic of any image and any training image The inverse of Hatton's distance, the similarity being retrieved as between the pixel characteristic of the image and the pixel characteristic of the training image, due to Absolute wheelbase of the different characteristics of image in feature space can be measured by manhatton distance, and (axis here refers to coordinate Axis), thus manhatton distance inverse also can preferably describe image pixel characteristic and training image pixel characteristic it Between similarity.

For example, the pixel characteristic of image P is inputted KNN model, concentrating in the training image of KNN model includes 20 training Image, server determine the inverse of Euclidean distance between the pixel characteristic of image P and the pixel characteristic of this 20 training images For corresponding 20 similarities, 20 training images are ranked up according to the sequence of similarity from big to small, obtain sequence Positioned at preceding 5 training images, due to having 4 to belong to class label A in preceding 5 training images that sort, only 1 belongs to class Distinguishing label B, therefore the highest class label A of frequency of occurrence is determined as to the class label of image P.

In some embodiments, server can also be after the class label for obtaining an image, by the image Pixel characteristic and the image class label be added to KNN model training image concentrate, so as to multiple During image carries out clustering processing, constantly expands the training image collection of KNN model, promote the accurate of the KNN Model tying Degree.

Fig. 3 is a kind of schematic diagram of cluster result provided in an embodiment of the present invention, referring to Fig. 3, it is assumed that the training of KNN model In image set include class label " staring ", " no language " and " looking up at ", after 8 images are inputted the KNN model, it is available such as Cluster result shown in Fig. 3.

203, server is that the image with the same category label distributes identical random term vector as the first initial word Vector.

Wherein, which can be any term vector generated at random, which is to any Image carries out the obtained term vector of initialization process.

In above-mentioned steps 201-203, server obtains multiple first initial term vectors, wherein multiple first is initial Term vector corresponds to multiple image, and further, server can distribute identical the also to the identical image of class label One initial term vector enables the identical image of class label to share the first initial term vector, thus in successive iterations training During have shared the adjustment of parameter so that the identical image of class label eventually semantic feature having the same.

In above process, server first to image carry out clustering processing, then based on class label to image carry out word to The initialization process of amount, so that causing the context of these images to be believed if some images only occur in a small amount of dialogue When breath is not abundant enough, the accuracy of the semantic feature of these images can be promoted by the method for clustering processing, but also reduce The training duration of language model, reduces the training calculation amount of language model when subsequent trained.

It should be noted that extracted semantic feature refers to that image is whole in semantic level institute table in the embodiment of the present invention The vector of the non-visualization feature reached indicates, and does not mean that traditional image visualization feature expressed by pixel level Vector indicates.

In some embodiments, server can not also execute above-mentioned steps 201-203, that is to say not to multiple images into Row clustering processing, but the initialization process of term vector is directly carried out to each image, above-mentioned steps 201-203 can be at this time It is replaced with following methods: when in multiple images including the first image, text is extracted from first image, to the text Insertion processing is carried out, the term vector of at least one word in the text is obtained, by being averaged for the term vector of at least one word Vector is retrieved as the first initial term vector corresponding with first image, wherein and first image is the image for carrying text, Optionally, when in multiple image including the second image, random term vector can also be retrieved as and second figure by server As corresponding first initial term vector, wherein second image is the image for not carrying text, hence for the first image and Two images carry out targetedly initialization process respectively, optimize the processing logic of term vector initialization procedure, shorten training Duration.

In above process, when extracting text from multiple image, OCR (optical character can be used Recognition, optical character identification) technology identifies that the text in multiple image, server do not have in above process Clustering processing is carried out, but for each image, the semantic feature of the image can be trained during repetitive exercise, So that the semantic feature of image is more targeted.

For example, Fig. 4 is a kind of principle schematic diagram of language model training process provided in an embodiment of the present invention, referring to figure 4, it is assumed that text 401, image 402, image 403 and text 404 are one section of dialogue between user A and user B respectively, to language It says and this section of dialogue is considered as one section of long this paper for carrying image, for text 401 and text 404, base when model training Method in following step 204 obtains the term vector of pre-training as the second initial term vector, for image 402 respectively Speech, since image 402 is the first image for carrying text, server extracts the text in image 402 based on OCR technique and " looks up at Big shot " obtains the term vector of multiple words in the text, the average vector of the term vector of multiple word is determined as image The initial term vector of the first of 402, since image 403 is the second image for not carrying text, takes at this time for image 403 Any random term vector can be directly retrieved as the first initial term vector of image 403 by business device, thus word-based insertion (word Embedding method), server in this segment length's text all texts and image be completed initialization, and then hold Row following step 205.

204, server obtains multiple contextual informations of multiple images, carries out at insertion to multiple contextual information Reason, is retrieved as the multiple second initial term vectors for the term vector of pre-training.

Wherein, multiple contextual information is in the text information in text scene before or after image present position At least one of, optionally, text scene can be session context, and multiple contextual information is image in a session at this time At least one of in session before present position or session later, optionally, text scene is also possible to carrying figure The long text scene of picture, multiple contextual information is in the text in long text before image or text later at this time At least one of, the embodiment of the present invention does not limit the form of text scene specifically.

In above-mentioned steps 204, server obtains multiple second initial term vectors, wherein multiple second initial term vector Corresponding to multiple contextual information, in some embodiments, server is carrying out insertion processing to multiple contextual information Process, that is to say: server for each contextual information in multiple contextual informations, by the contextual information institute it is right Only hot (one-hot) coding answered is multiplied with the weight matrix of pre-training, so that the one-hot coding is mapped to term vector space, The term vector of a pre-training is obtained, the term vector of the pre-training is determined as to the second initial term vector of the contextual information, To which duration required for training can be shortened by the term vector for obtaining pre-training.

205, server is based on the initial term vector of multiple first initial word vector sum multiple second, carries out to language model Repetitive exercise.

In above process, which can be any natural language processing (natural language processing, NLP) model, for example, the language model can be n-gram model (also referred to as N member Model), it can be NNLM (neural network language model, neural network language model), can be ELMo (embeddings from language models, using the language model of insertion processing), can also be BERT (bidirectional encoder representation from transformers is turned over using what alternating binary coding indicated Translate model) etc., the embodiment of the present invention does not limit the structure of the language model specifically.

During above-mentioned repetitive exercise, server by multiple second initial word of multiple first initial word vector sum to Input language model is measured, loss function value is obtained according to the prediction result of language model, when the loss function value is greater than target threshold When value, the language model can be joined based on back-propagation algorithm (backpropagation algorithm, BP algorithm) Number adjustment, iteration execute the above process until loss function value is less than or equal to the targets threshold or the number of iterations reaches target It, can be with deconditioning when number.

It is by text term vector (the second initial term vector) and figure when in above process, due to language model training As term vector (the first initial term vector) together input language model, enable language model based on the semanteme for extracting text The similar processing logic of feature, extracts further feature of multiple image on semantic level, so as to be detached from pixel spy The constraint of sign gets through the processing barrier between text and image, gets the semantic feature of multiple images.

In some embodiments, during being iterated training to language model, server can keep multiple Second initial term vector immobilizes, so that only the numerical value of the multiple first initial term vector is adjusted in training, from And when can guarantee deconditioning, it is empty that obtained multiple first term vectors and multiple second initial word vectors are located at the same vector Between, the semantic feature of the semantic feature and contextual information that is to say image is located at the same feature space, and server being capable of base In the method for control variable, so that the semantic feature of multiple image has more accurate semantic meaning representation effect.

Fig. 5 is a kind of principle schematic diagram of language model training process provided in an embodiment of the present invention, false referring to Fig. 5 If text 401, image 402, image 403 and text 404 are one section of dialogue between user A and user B respectively, to language mould This section of dialogue is considered as one section of long this paper for carrying image when type training, for text 401 and text 404, is obtained respectively It takes the term vector of pre-training as the second initial term vector, for image 402 and image 403, first will acquire 402 He of image Cluster classification number (such as the class label provided in an embodiment of the present invention, or mapped according to class label of image 403 Classification logotype), it, will according to the acquisition of respective cluster classification number and two random term vectors corresponding to respective cluster classification number This two random term vectors respectively as image 402 and image 403 the first initial term vector, so that server can be word-based Be embedded in (word embedding) method, in this segment length's text all texts and image be completed initialization, by two Two the second initial term vector input language models of a first initial word vector sum, based on back-propagation algorithm to the language model It is iterated training, until loss function value reaches targeted number less than targets threshold or the number of iterations, trigger the server is executed Following step 206.

206, when loss function value, which is less than targets threshold or the number of iterations, reaches targeted number, server is by multiple first Term vector is determined as the semantic feature of multiple images.

Wherein, multiple first term vector be after the multiple first initial term vectors are carried out with parameter adjustment obtained word to Amount.Wherein, the targets threshold can be it is any be greater than or equal to 0 numerical value.

It should be noted that since the initial term vector of multiple first initial word vector sum multiple second is also language mould A kind of parameter of type, therefore during to language model repetitive exercise, also can be to each first initial term vector and each A second initial term vector carries out parameter adjustment, when loss function functional value reaches target time less than targets threshold or the number of iterations Deconditioning when number, parameter multiple first initial term vectors adjusted are properly termed as multiple first term vectors, parameter tune at this time The multiple second initial term vectors after whole are properly termed as multiple second term vectors.

In above process, with the optimization of language model Language Processing effect itself, each first initial term vector with And each second initial term vector also can be more and more accurate, also just represents multiple first term vector to the language of multiple image Adopted expression effect is become better and better, and multiple second term vector also becomes better and better to the semantic meaning representation effect of multiple contextual informations, So that multiple first term vector be determined as to the semantic feature of multiple image, additionally it is possible to by this in deconditioning Multiple second term vectors are determined as the semantic feature of multiple contextual information.

In above-mentioned steps 201-206, server passes through multiple images and multiple contextual information input language models The language model and multiple contextual informations carry out feature extraction to multiple image, obtain the semantic feature of multiple image, During feature extraction, server is based on being iterated training to language model, due in the training process can be to first The initial term vector of initial word vector sum second carries out parameter adjustment, so that loss function value is less than targets threshold or the number of iterations reaches When to targeted number, the semantic feature of the semantic feature of available multiple images and multiple contextual information, so as to The barrier between text and image is got through, extracts image in the further feature of semantic level.

It is illustrated so that language model is ELMo as an example below, includes a two-way LSTM (long in ELMo Short-term memory, shot and long term memory network) language model, it that is to say to include a forward direction LSTM and one in ELMo A backward LSTM, server, to LSTM, will lead to before the input of multiple first initial word vector sum multiple second initial term vector The semantic feature of multiple image and multiple contextual information in the forward direction is extracted to LSTM before crossing, server will be multiple To LSTM after the initial term vector input of first initial word vector sum multiple second, by it is rear to LSTM extract multiple image and Multiple contextual information in the rear semantic feature on direction, it is possible to further by it is preceding to LSTM and backward LSTM most Maximum-likelihood estimates that (maximum likelihood estimation, MLE) is used as loss function value, when the maximal possibility estimation Deconditioning when less than or equal to targets threshold obtains the semantic feature of multiple images, enables the semantic features of multiple images Enough semantemes for more accurately expressing multiple image.

207, server stores the semantic feature of multiple image to database according to the classification logotype of multiple images In.

Wherein, category mark can be class label, be also possible to the identification code mapped based on class label, example Such as, category mark can be class label " no language ", when establishing class label " no language " and identification code in server When mapping relations between " 3055 ", category mark can be " 3055 ".

In above process, server can map according to the class label of multiple image and obtain multiple image Classification logotype that is to say to be stored in a manner of key-value pair in the database, and server is by the classification logotype of multiple image It is stored as key name, the semantic feature of multiple image is stored as key assignments, has been convenient for subsequent to number in database According to reading.

In some embodiments, when server does not execute above-mentioned steps 201-203, since each image is to unique right Ying Yuyi semantic feature, therefore server can also be special by the semanteme of multiple image according to the image identification of multiple images Sign storage has been convenient for the subsequent reading to data in database into database.

Fig. 6 is a kind of principle schematic diagram of database purchase provided in an embodiment of the present invention, and referring to Fig. 6, server is pressed According to image identification, the semantic feature of each image is stored into database, wherein the database illustrated in Fig. 6 is precisely to deposit The term vector library for storing up each contextual information stores text term vector and image term vector in the term vector library, often respectively A text term vector and each image term vector both correspond to respective ID (identification, identification code), to image word For vector, can correspond to image ID or category IDs, in some embodiments, the term vector that ID is 1~n be text word to Amount, the term vector that ID is n~N are image term vector, and n is any value more than or equal to 1, and N is any value greater than n, When certainly, if it is storing according to classification logotype, the corresponding pass between image and classification logotype can also be stored in the database System.

208, when receiving the image processing commands for carrying target image, the language of the target image is obtained from database Adopted feature, the semantic feature based on the target image carry out image procossing.

Wherein, target image and processing type can be carried in the image processing commands, which can be semanteme Segmentation, image classification, image generation etc., optionally, can also carry the image identification of target image in the image processing commands.

Since in above-mentioned steps 207, server stores the semantic feature of multiple images into database, when target figure As when any image, the semantic feature of the image can be directly read from database, based on the image in hiting data library Semantic feature carries out image procossing, thus duration used in the processing image greatlyd save.

During server is based on database progress image procossing, it can be obtained when receiving image processing commands Image entrained by the image processing commands is taken, clustering processing is carried out to the image and obtains the class label of the image, from data Acquisition and semantic feature (term vector) corresponding to such distinguishing label, carry out image procossing based on the semantic feature in library, thus The time for extracting characteristics of image can be saved, the efficiency of image procossing is optimized.

In some embodiments, if the case where server is the semantic feature according to image identification storage multiple images Under, if also carrying the image identification of target image in image processing commands, server can by database with the image The corresponding semantic feature of mark, is determined as the semantic feature of the target image, so as to rapidly obtain target image Semantic feature has been convenient for the execution of the image processing tasks in downstream.

In above-mentioned determination process, server can be index with the image identification of target image, retrieve in the database With index content corresponding to the index (semantic feature that is to say image), when the index can hit the semanteme of any image When feature, the semantic feature of the image is determined as to the semantic feature of target image.

In some embodiments, if the case where server is the semantic feature according to classification logotype storage multiple images Under, server directly can carry out clustering processing to the target image, and specific cluster process is similar with above-mentioned steps 202, here It does not repeat them here, the class label of the available target image of server after clustering processing, and then according to the target image Class label, mapping obtains the classification logotype of the target image, and then will identify corresponding semanteme with the category in database Feature is determined as the semantic feature of target image, copes with the image processing requirements under more extensive scene.In above-mentioned determination In the process, server can also be index with the classification logotype of target image, and the semanteme for executing the similar image that sets the goal really is special The step of sign, is not described herein.

In above-mentioned steps 207-208, server carries out image procossing based on the semantic feature of multiple images, so as to Image is extracted in the further feature of semantic level by language model, can more fully understand image language based on computer equipment Justice can also complete the image processing tasks under some pairs of higher scenes of image, semantic demand, below with several examples Mode is described in detail image processing tasks, which includes but is not limited to following examples.

In some embodiments, which can be image generation, and image generates at this time can take in instruction With text or image, server is allowed to be based on the text or image, is obtained from database and the text or image Image corresponding to the semantic feature is determined as output figure by the highest semantic feature of context matches degree between semantic feature Picture obtains the highest output image of a context matches degree, in some human-computer dialogues so as to generate instruction based on image Scene in, which can be sent to terminal, so that server to one image of terminal replies, increases The interest of human-computer dialogue process, improves the degree of intelligence of chat robots.

In some embodiments, which can also be image classification, and generate can be in instruction for image at this time Multiple images to be processed are carried, server obtains multiple image to be processed according to according to the process in above-mentioned steps 208 Semantic feature, the image to be processed so as to which the similarity between semantic feature is greater than targets threshold are determined as same class Not, so as to realize image classification based on the semantic feature of image, semantic similar image category can be obtained, compared to Traditional method classified according to pixel characteristic, can reach the classifying quality of higher performance.

All the above alternatives can form the alternative embodiment of the disclosure, herein no longer using any combination It repeats one by one.

Method provided in an embodiment of the present invention, by obtaining multiple contextual informations of multiple images, on multiple Context information is at least one in the text information in text scene before or after image present position, therefore this is more After a image and multiple contextual information input language model, it can be believed by the language model and multiple context Breath carries out feature extraction to multiple image under the action of considering multiple contextual information, more so as to obtain this The semantic feature of a image, since these semantic features are the whole non-visualizations expressed by semantic level of multiple image The vector of feature indicates, enables the server to more fully understand image, semantic, thus the semantic feature based on multiple image When carrying out image procossing, the image processing tasks under some pairs of higher scenes of image, semantic demand can be completed, figure is improved As the accuracy of processing.

Further, clustering processing first is carried out to image, then carries out the initialization of term vector to image based on class label Processing, so that causing the contextual information of these images not abundant enough if some images only occur in a small amount of dialogue When, the accuracy of the semantic feature of these images can be promoted by the method for clustering processing, and also reduce language model Training duration, reduce the training calculation amount of language model.

Further, to the identical image of class label, identical first initial term vector can be distributed, so that classification mark The first initial term vector can be shared by signing identical image, to have shared the tune of parameter during successive iterations training It is whole, so that the identical image of class label eventually semantic feature having the same.

Further, when in multiple images include the first image when, extract text from first image, to the text into Row insertion processing, obtains the term vector of at least one word in the text, by the term vector of at least one word it is average to Amount, is retrieved as the first initial term vector corresponding with first image, will be random when in multiple image including the second image Term vector is retrieved as the first initial term vector corresponding with second image, hence for the first image and the second image respectively into Capable targetedly initialization process, optimizes the processing logic of term vector initialization procedure, shortens trained duration.Another party Face carries out insertion processing to multiple contextual informations, and the term vector of pre-training is retrieved as the multiple second initial term vectors, thus By obtaining the term vector of pre-training, further shorten duration required for training.

Further, according to the image identification of multiple images or classification logotype, the semantic feature of multiple image is stored Into database, so as to properly store the semantic feature for the multiple images trained, in case the image procossing in downstream Task call obtains the semanteme of the target image when receiving the image processing commands for carrying target image from database Feature, semantic feature based on the target image carry out image procossing, greatly improve the efficiency of image processing process and accurate Rate.

Fig. 7 is a kind of structural schematic diagram of image processing apparatus provided in an embodiment of the present invention, referring to Fig. 7, the device packet It includes:

Module 701 is obtained, for obtaining multiple contextual informations of multiple images, multiple contextual information is in text At least one of in text information in scene before or after image present position；

Characteristic extracting module 702 is used for by multiple image and multiple contextual information input language model, by this Language model and multiple contextual information carry out feature extraction to multiple image, obtain the semantic feature of multiple image；

Image processing module 703 carries out image procossing for the semantic feature based on multiple image.

Device provided in an embodiment of the present invention, by obtaining multiple contextual informations of multiple images, on multiple Context information is at least one in the text information in text scene before or after image present position, therefore this is more After a image and multiple contextual information input language model, it can be believed by the language model and multiple context Breath carries out feature extraction to multiple image under the action of considering multiple contextual information, more so as to obtain this The semantic feature of a image, since these semantic features are the whole non-visualizations expressed by semantic level of multiple image The vector of feature indicates, enables the server to more fully understand image, semantic, thus the semantic feature based on multiple image When carrying out image procossing, the image processing tasks under some pairs of higher scenes of image, semantic demand can be completed, figure is improved As the accuracy of processing.

In a kind of possible embodiment, this feature extraction module 702 includes:

Acquiring unit, for obtaining the multiple first initial term vectors, multiple first initial term vector corresponds to multiple Image；

The acquiring unit is also used to obtain the multiple second initial term vectors, and multiple second initial term vector corresponds to should Multiple contextual informations；

Repetitive exercise unit, for being based on the initial term vector of multiple first initial word vector sum multiple second, to this Language model is iterated training；

Unit is obtained, for being somebody's turn to do when loss function value is less than targets threshold or the number of iterations reaches targeted number The semantic feature of multiple images.

In a kind of possible embodiment, which is used for:

Multiple image input pixel characteristic is extracted into model, the multiple image of model extraction is extracted by the pixel characteristic Pixel characteristic；

According to the pixel characteristic of multiple image, clustering processing is carried out to multiple image, obtains the class of multiple image Distinguishing label；

In a kind of possible embodiment, which is used for:

When in multiple image including the first image, text is extracted from first image, the text is embedded in Processing, obtains the term vector of at least one word in the text, and the average vector of the term vector of at least one word obtains For the first initial term vector corresponding with first image, which is the image for carrying text；

When in multiple image including the second image, random term vector is retrieved as and second image corresponding first Initial term vector, second image are the image for not carrying text.

In a kind of possible embodiment, which is used for:

Insertion processing is carried out to multiple contextual information, the term vector of pre-training is retrieved as multiple second initial word Vector.

In a kind of possible embodiment, which is used for:

During being iterated training to the language model, multiple second initial term vector is kept to immobilize, The numerical value of multiple first initial term vector is adjusted, multiple first term vectors are obtained；

When loss function value is less than targets threshold or the number of iterations reaches targeted number, the semanteme of multiple image is obtained Feature includes:

It is when loss function value is less than targets threshold or the number of iterations reaches targeted number, multiple first term vector is true It is set to the semantic feature of multiple image.

In a kind of possible embodiment, which includes:

Storage processing unit, for the image identification or classification logotype according to multiple image, by the language of multiple image Adopted characteristic storage is into database, and when receiving the image processing commands for carrying target image, obtaining from the database should The semantic feature of target image, the semantic feature based on the target image carry out image procossing.

In a kind of possible embodiment, which is used for:

When also carrying the image identification of the target image in the image processing commands, by the database with the image mark Know corresponding semantic feature, is determined as the semantic feature of the target image；Or,

To the target image carry out clustering processing, obtain the classification logotype of the target image, by the database with such Corresponding semantic feature is not identified, is determined as the semantic feature of the target image.

It should be understood that image processing apparatus provided by the above embodiment is when handling image, only with above-mentioned each function The division progress of module can according to need and for example, in practical application by above-mentioned function distribution by different function moulds Block is completed, i.e., the internal structure of computer equipment is divided into different functional modules, with complete it is described above whole or Partial function.In addition, image processing apparatus provided by the above embodiment and image processing method embodiment belong to same design, Specific implementation process is detailed in image processing method embodiment, and which is not described herein again.

Fig. 8 is the structural schematic diagram of computer equipment provided in an embodiment of the present invention, which can be because of configuration Or performance is different and generate bigger difference, may include one or more processors (central processing Units, CPU) 801 and one or more memory 802, wherein at least one finger is stored in the memory 802 It enables, which is loaded by the processor 801 and executed to realize that above-mentioned each image processing method embodiment provides Image processing method.Certainly, which can also have wired or wireless network interface, keyboard and input and output The components such as interface, to carry out input and output, which can also include other components for realizing functions of the equipments, This will not be repeated here.

In the exemplary embodiment, a kind of computer readable storage medium is additionally provided, for example including at least one instruction Memory, it is above-mentioned at least one instruction can by the processor in terminal execute to complete image processing method in above-described embodiment Method.For example, the computer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk and Optical data storage devices etc..

Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, which can store in a kind of computer-readable storage In medium, storage medium mentioned above can be read-only memory, disk or CD etc..

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of image processing method, which is characterized in that the described method includes:

Multiple contextual informations of multiple images are obtained, the multiple contextual information is the image present position in text scene Before or after text information at least one of；

By described multiple images and the multiple contextual information input language model, by the language model and the multiple Contextual information carries out feature extraction to described multiple images, obtains the semantic feature of described multiple images；

2. the method according to claim 1, wherein described pass through the language model and the multiple context Information carries out feature extraction to described multiple images, and the semantic feature for obtaining described multiple images includes:

The multiple first initial term vectors are obtained, the multiple first initial term vector corresponds to described multiple images；

The multiple second initial term vectors are obtained, the multiple second initial term vector corresponds to the multiple contextual information；

Based on the initial term vector of the multiple first initial word vector sum the multiple second, the language model is iterated Training；

When loss function value is less than targets threshold or the number of iterations reaches targeted number, the semanteme for obtaining described multiple images is special Sign.

3. according to the method described in claim 2, it is characterized in that, the multiple first initial term vectors of acquisition include:

Described multiple images input pixel characteristic is extracted into model, the multiple figure of model extraction is extracted by the pixel characteristic The pixel characteristic of picture；

According to the pixel characteristic of described multiple images, clustering processing is carried out to described multiple images, obtains described multiple images Class label；

4. according to the method described in claim 2, it is characterized in that, the multiple first initial term vectors of acquisition include:

When in described multiple images including the first image, text is extracted from the first image, the text is carried out embedding Enter processing, obtain the term vector of at least one word in the text, by the term vector of at least one word it is average to Amount, is retrieved as the first initial term vector corresponding with the first image, and the first image is the image for carrying text；

When in described multiple images including the second image, random term vector is retrieved as and second image corresponding first Initial term vector, second image are the image for not carrying text.

5. according to the method described in claim 2, it is characterized in that, described based on described in the multiple first initial word vector sum Multiple second initial term vectors, being iterated training to the language model includes:

During being iterated training to the language model, the multiple second initial term vector is kept to immobilize, The numerical value of the multiple first initial term vector is adjusted, multiple first term vectors are obtained；

It is described when loss function value is less than targets threshold or the number of iterations and reaches targeted number, obtain the language of described multiple images Adopted feature includes:

When loss function value is less than targets threshold or the number of iterations reaches targeted number, the multiple first term vector is determined For the semantic feature of described multiple images.

6. the method according to claim 1, wherein the semantic feature based on described multiple images carries out figure As processing includes:

According to the image identification or classification logotype of described multiple images, by the semantic feature storage of described multiple images to database In, when receiving the image processing commands for carrying target image, the semanteme of the target image is obtained from the database Feature, the semantic feature based on the target image carry out image procossing.

7. according to the method described in claim 6, it is characterized in that, described obtain the target image from the database Semantic feature includes:

When also carrying the image identification of the target image in described image process instruction, by the database with the figure As the corresponding semantic feature of mark, it is determined as the semantic feature of the target image；Or,

To the target image carry out clustering processing, obtain the classification logotype of the target image, by the database with institute Semantic feature corresponding to classification logotype is stated, the semantic feature of the target image is determined as.

8. a kind of image processing apparatus, which is characterized in that described device includes:

Module is obtained, for obtaining multiple contextual informations of multiple images, the multiple contextual information is in text scene At least one of in text information before or after middle image present position；

Characteristic extracting module is used for by described multiple images and the multiple contextual information input language model, by described Language model and the multiple contextual information carry out feature extraction to described multiple images, obtain the language of described multiple images Adopted feature；

9. a kind of computer equipment, which is characterized in that the computer equipment includes one or more processors and one or more A memory is stored at least one instruction in one or more of memories, and at least one instruction is by one Or multiple processors are loaded and are executed to realize such as claim 1 to the described in any item image processing method institutes of claim 7 The operation of execution.

10. a kind of computer readable storage medium, which is characterized in that be stored at least one instruction, institute in the storage medium At least one instruction is stated to be loaded by processor and executed to realize such as claim 1 to the described in any item images of claim 7 Operation performed by processing method.