CN109002852A

CN109002852A - Image processing method, device, computer readable storage medium and computer equipment

Info

Publication number: CN109002852A
Application number: CN201810758796.5A
Authority: CN
Inventors: 陈志博; 石楷弘
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-07-11
Filing date: 2018-07-11
Publication date: 2018-12-14
Anticipated expiration: 2038-07-11
Also published as: CN109002852B

Abstract

This application involves a kind of image processing method, device, computer readable storage medium and computer equipments, which comprises obtains input picture；Pass through the characteristics of image of input picture described in the first model extraction；By first model and according to described image feature, class label text corresponding with the input picture is determined；Described image feature and corresponding class label text are carried out to obtain comprehensive characteristics across modality fusion；By comprehensive characteristics described in the second model treatment, the iamge description text of the input picture is exported.The accuracy of image understanding information can be improved in scheme provided by the present application.

Description

Image processing method, device, computer readable storage medium and computer equipment

Technical field

This application involves field of computer technology, more particularly to a kind of image processing method, device, computer-readable deposit Storage media and computer equipment.

Background technique

With the development of computer technology, various challenges are handled by computer equipment or are carried out with people mutual It is dynamic to have become more and more frequently.For example, being helped by computer equipment it is appreciated that image, especially for child, old age The tools such as people, visual dysfunction person or language understanding obstacle person are very helpful.

Traditional image understanding method is usually the characteristics of image for extracting image, and characteristics of image and preset text is common It is input in encoder, is decoded by decoder, to obtain image understanding information.However, traditional image understanding method, Image is handled by coding-decoded structure, with the increase of processing time, can slowly lack the finger of characteristics of image It leads, so that image understanding is not accurate enough.

Summary of the invention

Based on this, it is necessary in traditional image understanding scheme to the not accurate enough technical problem of image understanding, A kind of image processing method, device, computer readable storage medium and computer equipment are provided.

A kind of image processing method, comprising:

Obtain input picture；

Pass through the characteristics of image of input picture described in the first model extraction；

By first model and according to described image feature, class label text corresponding with the input picture is determined This；

Described image feature and corresponding class label text are carried out to obtain comprehensive characteristics across modality fusion；

By comprehensive characteristics described in the second model treatment, the iamge description text of the input picture is exported.

A kind of image processing apparatus, described device include:

Module is obtained, for obtaining input picture；

Extraction module, for the characteristics of image by input picture described in the first model extraction；

Determining module, for by first model and according to described image feature, determination and the input picture phase The class label text answered；

Fusion Module, for obtain comprehensive across modality fusion described image feature and corresponding class label text Close feature；

Output module, for exporting the iamge description of the input picture by comprehensive characteristics described in the second model treatment Text.

A kind of computer readable storage medium is stored with computer program, when the computer program is executed by processor, So that the processor executes the step of described image processing method.

A kind of computer equipment, including memory and processor, the memory are stored with computer program, the calculating When machine program is executed by the processor, so that the step of processor executes described image processing method.

Above-mentioned image processing method, device, computer readable storage medium and computer equipment pass through the first model extraction The characteristics of image of input picture, and determine class label text corresponding with input picture, it can rapidly and accurately be inputted The characteristics of image of image and corresponding class label text.Characteristics of image and corresponding class label text are carried out cross-module state to melt It closes, obtains comprehensive characteristics, then by the second model treatment comprehensive characteristics, obtain iamge description text.In this way, second can be made Model can make full use of the characteristics of image of input picture itself during processing and combine classification belonging to input picture Information.Feature that is careful in this way and being sufficiently used input picture, when understanding image, obtained characteristics of image and The dual guidance of class label text, substantially increases the accuracy of image understanding information, improves computer equipment to image Understandability.

A kind of image processing method, comprising:

Obtain input picture and question text corresponding with the input picture；

Extract the characteristics of image of the input picture；

Extract the text feature of described problem text；

According to the text feature, Automobile driving processing, the power that gains attention weight are carried out to described image feature；

Weighted image feature is determined according to described image feature and the attention weight；

Classification processing is carried out according to the weighted image feature, obtains the corresponding answer text of described problem text.

A kind of image processing apparatus, comprising:

Module is obtained, for obtaining input picture and question text corresponding with the input picture；

Extraction module, for extracting the characteristics of image of the input picture；

The extraction module is also used to extract the text feature of described problem text；

Automobile driving processing, for carrying out Automobile driving processing to described image feature according to the text feature, The power that gains attention weight；

Determining module, for determining weighted image feature according to described image feature and the attention weight；

It is corresponding to obtain described problem text for carrying out classification processing according to the weighted image feature for categorization module Answer text.

Above-mentioned image processing method, device, computer readable storage medium and computer equipment extract the figure of input picture As feature, the text feature of question text corresponding with input picture is extracted, and according to text feature, characteristics of image is infused Meaning power allocation processing, the power that gains attention weight determine weighted image feature according to characteristics of image and attention weight.Foundation adds again It weighs characteristics of image and carries out classification processing, export answer text corresponding with question text.In this way, can be corresponding according to question text Text feature carries out Automobile driving processing to characteristics of image, to obtain weighted image feature, so that during image processing It can focus on characteristics of image relevant to question text, then can be significantly by carrying out classification processing to weighted image feature The accuracy of answer text is improved, that is, substantially increases the accuracy of image understanding information, improves computer equipment pair The understandability of image.

Detailed description of the invention

Fig. 1 is the applied environment figure of image processing method in one embodiment；

Fig. 2 is the flow diagram of image processing method in one embodiment；

Fig. 3 is the schematic diagram of input picture in one embodiment；

Fig. 4 is obtain comprehensive across modality fusion by characteristics of image and corresponding class label text in one embodiment Close the flow diagram of characterization step；

Fig. 5 is flow diagram the step of carrying out image question and answer in one embodiment；

Fig. 6 is the flow diagram of image processing method in another embodiment；

Fig. 7 is the flow diagram of image processing method in another embodiment；

Fig. 8 is the flow diagram of image processing method in one embodiment；

Fig. 9 is flow diagram the step of extracting the text feature of question text in one embodiment；

Figure 10 is the flow diagram of image processing method in another embodiment；

Figure 11 is the flow diagram of image processing method in another embodiment；

Figure 12 is the structural block diagram of image processing apparatus in one embodiment；

Figure 13 is the structural block diagram of image processing apparatus in another embodiment；

Figure 14 is the structural block diagram of image processing apparatus in one embodiment；

Figure 15 is the structural block diagram of computer equipment in one embodiment；

Figure 16 is the structural block diagram of computer equipment in another embodiment.

Specific embodiment

It is with reference to the accompanying drawings and embodiments, right in order to which the objects, technical solutions and advantages of the application are more clearly understood The application is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain the application, and It is not used in restriction the application.

Fig. 1 is the applied environment figure of image processing method in one embodiment.Referring to Fig.1, the image processing method application In image processing system.The image processing system includes terminal 110 and server 120.Image processing method can be in terminal 110 Or completed in server 120, terminal 110 can directly acquire input picture, and execute above-mentioned image processing method in terminal side；Or Input can also be sent to server after obtaining input picture by person, terminal 110, so that server obtains input picture and executes Above-mentioned image processing method.Terminal 110 and server 120 pass through network connection.Terminal 110 specifically can be terminal console or shifting Dynamic terminal, mobile terminal specifically can be at least one of mobile phone, tablet computer and laptop etc..Server 120 can be with It is realized with the server cluster of the either multiple server compositions of independent server.

As shown in Fig. 2, in one embodiment, providing a kind of image processing method.The present embodiment is mainly in this way Applied to the computer equipment in above-mentioned Fig. 1, illustrated such as terminal 110 or server 120.Referring to Fig. 2, the image procossing Method specifically comprises the following steps:

S202 obtains input picture.

Specifically, computer equipment can obtain local image as input picture, or pass through network connection, USB The communication modes such as (Universal Serial Bus, universal serial bus) interface connection obtain defeated from other computer equipments Enter image.

In one embodiment, terminal can acquire image under the current visual field of camera by camera, by acquisition Image is as input picture.Alternatively, terminal can be by showing that image shows interface to user, user can show in interface in image It carries out choosing operation, terminal can be using the image chosen as input picture.Wherein, image shows that image shown in interface can be with It is the image of terminal local storage, is also possible to image of the terminal by network connection access server to obtain.

In one embodiment, terminal can be performed locally image processing method after getting input picture.Alternatively, terminal Input picture can be sent to server, so that server obtains input picture and executes image processing method.

S204 passes through the characteristics of image of the first model extraction input picture.

Wherein, model is the model being made of artificial neural network.Artificial neural network (Artificial Neural Networks is abbreviated as ANNs), also referred to as neural network (NNs) or make link model (Connection Model).People Artificial neural networks can be abstracted human brain neuroid from information processing angle, to establish certain model, by different companies The mode of connecing forms different networks.Neural network or neural network are also often directly referred to as in engineering and academia.

Neural network model such as CNN (Convolutional Neural Network, convolutional neural networks) model, DNN (Deep Neural Network, deep neural network) model and RNN (Recurrent Neural Network, circulation Neural network) model etc..

Wherein, convolutional neural networks include convolutional layer (Convolutional Layer) and pond layer (Pooling Layer).There are many convolutional neural networks models, such as VGG (Visual Geometry Group vision collection is combined) network mould Type, GoogleNet (Google's network) model or ResNet (energy efficiency evaluation system) network model etc..Deep neural network includes defeated Enter layer, hidden layer and output layer, is the relationship connected entirely between layers.Recognition with Recurrent Neural Network is a kind of pair of sequence data modeling Neural network, i.e. the output of a sequence current output and front is also related.The specific form of expression is that network can be to preceding The information in face is remembered and is applied in the calculating currently exported, i.e. node between hidden layer is no longer connectionless but have company It connects, and it further includes the output of last moment hidden layer that the input of hidden layer, which not only includes the output of input layer,.Circulation nerve Network model, such as LSTM (Long Short-Term Memory Neural Network, long Memory Neural Networks in short-term) mould Type.

Characteristics of image is the feature of the color for indicating image, texture, shape or spatial relationship etc..In the present embodiment, scheme As feature specifically can be the color that can indicate image that computer equipment extracts from input picture, texture, shape or The data such as spatial relationship obtain the expression or description of " non-image " of image, such as numerical value, vector or symbol.

In the present embodiment, the first model specifically can be convolutional neural networks model, such as ResNet-80.Computer Input picture can be input in the first model by equipment, pass through the characteristics of image of the first model extraction input picture.For example, calculating Input picture can be input in convolutional neural networks model by machine equipment, by the convolutional layer of convolutional neural networks to input picture Process of convolution is carried out, the characteristics of image of input picture is extracted.Namely convolutional neural networks can be by convolutional layer to input picture After carrying out process of convolution, the feature map (characteristic pattern) of input picture is obtained, feature map here is exactly this implementation Characteristics of image in example.

In one embodiment, the first model is with the image and corresponding class label work in image library (ImageNet) For training data, the model for classifying to input picture that learning training obtains is carried out.Computer equipment is being got After input picture, input picture is inputted into the first model, passes through the image of the convolutional layer structure extraction input picture of the first model Feature passes through the pond layer structure and/or the corresponding class label text of full articulamentum structure determination input picture of the first model.

S206 determines class label text corresponding with input picture by the first model and according to characteristics of image.

Wherein, class label text is the corresponding label text of classification belonging to input picture.Specifically, computer equipment Subsequent classification processing can be carried out by the first model extraction characteristics of image, then to the characteristics of image of extraction, obtain input picture Classification, and then determine the corresponding class label text of input picture.

In one embodiment, the first model specifically can be convolutional neural networks model.Computer equipment can will input Image is input in convolutional neural networks model, to extract the characteristics of image of input picture.Pass through pond layer and full articulamentum again Characteristics of image is handled, the probability value of input picture generic is obtained.By class label corresponding to most probable value As class label corresponding with input picture.

In one embodiment, computer equipment can be handled input picture by multitask convolutional neural networks, To obtain input picture multiple class label texts accordingly.Wherein, multitask convolutional neural networks are can to carry out multitask The convolutional neural networks of study.The network structure of multitask convolutional neural networks and the structure of single task convolutional neural networks slightly have It is different.For single task convolutional neural networks, that is, independent neural network, each network is only one for input The function of output.And multitask convolutional neural networks are then directed to input can multiple outputs, the corresponding task of each output. It is understood that these outputs can connect all neurons for the hidden layer that they share, Mr. Yu is used in these hidden layers The feature of a task can also be utilized by other tasks, and multiple tasks is promoted to learn jointly, in this way, the spy that single network learns Sign can help the study of another network.

S208 carries out characteristics of image and corresponding class label text to obtain comprehensive characteristics across modality fusion.

It wherein, is that will there are the data of different modalities to merge across modality fusion.In the present embodiment, different modalities Data specifically refer to and the corresponding characteristics of image of input picture and text data corresponding with class label text.Specifically, The characteristics of image of extraction and corresponding class label text can be mapped to the data in the same space, then mapping by computer equipment Data after penetrating carry out fusion treatment, obtain comprehensive characteristics.

In one embodiment, pass through the characteristics of image of the first model extraction input picture.Computer equipment can be by following The text feature of ring neural network extraction class label text.Wherein, the form of expression of characteristics of image and text feature is ok It is vector form.Computer equipment, can be special by characteristics of image and text before merging to characteristics of image and text feature Sign is converted into canonical form respectively, makes the feature vector of the two all in same range.For example, can respectively to characteristics of image and Text feature is normalized.Common normalization algorithm has function method and probability density method.Wherein, function method, such as Feature (has all been normalized to a consistent section, for example mean value is 0, variance by maximum-minimum function, Mean-Variance function Section for 1) or hyperbolic sigmoid (S sigmoid growth curve) function etc..

Further, computer equipment can be to the characteristics of image and corresponding class label text correspondence after normalized Text feature, execute mixing operation, obtain comprehensive characteristics.Wherein, algorithm characteristics of image and text feature merged The algorithm based on Bayesian decision theory, the algorithm based on sparse representation theory specifically can be used or calculated based on deep learning theory Method etc..Alternatively, computer equipment can be weighted summation to two vectors after normalized, by characteristics of image and text Feature is merged, and comprehensive characteristics are obtained.

In one embodiment, computer equipment can extract the text spy of class label text by Recognition with Recurrent Neural Network Sign does Automobile driving processing, that is, attention processing, the power that gains attention distribution power to characteristics of image and text feature Value, that is, attention weight (attention value), then by attention value and feature combinations, obtain comprehensive Close feature.

Wherein, attention is handled, it can be understood as selectively filters out a small amount of important information simultaneously from bulk information It focuses on these important informations, ignores most unessential information.The process of focusing is embodied in the meter of Automobile driving weight Count in, Automobile driving weight it is more big more, then more focus on its corresponding characteristics of image.

S210 exports the iamge description text of input picture by the second model treatment comprehensive characteristics.

Wherein, iamge description text be describe input picture text, such as identification input picture in object, understand object Relationship etc. between body, iamge description text specifically can be a word, a whole sentence or paragraph text etc..Second model is specific It can be Recognition with Recurrent Neural Network model, for example (Long Short-Term Memory Neural Network, length is in short-term by LSTM Memory Neural Networks) model.

Specifically, comprehensive characteristics can be input in the second model by computer equipment, by the second model to comprehensive characteristics It is handled, to export the iamge description text of input picture.

In one embodiment, step S210 can specifically include following steps: obtain image corresponding with input picture Pre- description text；Each term vector that comprehensive characteristics and image describe text in advance is sequentially input to the second model；Pass through second The comprehensive characteristics and term vector that model treatment is sequentially input export the iamge description text of input picture.

Wherein, it is the text that input picture is described in advance that image describes text in advance.It is specific that image describes text in advance It can be after thinking to understand input picture, obtained initial more coarse description text.Image describes text in advance It can be language germline in unified with iamge description text, be also possible to different language germlines.For example, image describes text in advance It can be the text that input picture is described with Chinese, and iamge description text is then to be retouched with English to input picture The text stated.

In one embodiment, computer equipment can obtain image corresponding with input picture and describe text in advance, and obtain Image describes each term vector of text in advance.Computer equipment can be using coding-decoded mode, using comprehensive characteristics as the The input of one moment, using each term vector as the input of following instant, the synthesis sequentially input by the second model treatment Feature and term vector export iamge description text.In this way, the second model can describe text in conjunction with comprehensive characteristics and image in advance, The iamge description text for the output for being more is bonded input picture, substantially increases the accuracy of image understanding information.

Above-mentioned image processing method, by the characteristics of image of the first model extraction input picture, and determining and input picture Corresponding class label text, can rapidly and accurately obtain the characteristics of image and corresponding class label text of input picture. Characteristics of image and corresponding class label text are carried out to obtain comprehensive characteristics, then pass through the second model treatment across modality fusion Comprehensive characteristics obtain iamge description text.In this way, the second model can be made input can be made full use of to scheme during processing As the characteristics of image of itself, and can be in conjunction with classification information belonging to input picture.It is careful in this way and be sufficiently used input figure The feature of picture has obtained the dual guidance of characteristics of image and class label text, has substantially increased when understanding image The accuracy of image understanding information improves computer equipment to the understandability of image.

It in one embodiment, include: by first by the step of characteristics of image of the first model extraction input picture Model determines mutually different multiple candidate regions in input picture；By the first model, the figure of each candidate region is extracted respectively As feature.

Specifically, computer equipment can determine multiple targets in input picture by the first model treatment input picture, And mutually different multiple candidate regions in input picture, that is, Region Proposal are determined according to corresponding target.Its In, each candidate region is different, can partly overlap or not be overlapped completely.Wherein, the overlapping of candidate region refers to not There is identical pixel in same candidate region.Computer equipment can extract respectively the image of each candidate region by the first model Feature.

Wherein, there are many algorithms that the division of candidate region is carried out to input picture, for example sliding window judgement can be used Method, target detection method (Selective Search for Object Recognition) or SSD (Single Shot Multibox Detector, the more frame detections of single-shot) algorithm etc..

In one embodiment, computer equipment can be by the first model and special according to the corresponding image in each candidate region Sign determines class label text corresponding with each candidate region.For example, Fig. 3 is shown in one embodiment with reference to Fig. 3 Input picture schematic diagram.As shown in figure 3, input picture includes a room, a brook, a dog and a people.Its In, brook is in the front in house, and dog is by brook, the left side of the people in house.Above-mentioned input picture is input in the first model, First model can determine multiple candidate regions, such as region A-D included by dotted line frame in Fig. 3.Correspondingly, the first model can divide The characteristics of image in corresponding candidate region is indescribably taken, determines class label text corresponding with each candidate region.Such as with candidate regions A corresponding class label text in domain is " house ", class label text corresponding with candidate region B is " people " and candidate region C Corresponding class label text " brook " and class label text " dog " corresponding with candidate region D.

In above-described embodiment, mutually different multiple candidate regions in input picture are determined by the first model, and respectively The characteristics of image of each candidate region is extracted, to determine multiple class label texts corresponding with input picture.

In one embodiment, step S210 exports the image of input picture that is, by the second model treatment comprehensive characteristics The step of describing text specifically includes: the corresponding comprehensive characteristics in each candidate region being spliced, splicing feature is obtained；Pass through the second mould Type processing splicing feature, exports the iamge description text of input picture.

Specifically, the corresponding characteristics of image in each candidate region and class label text can be carried out cross-module state by computer equipment Fusion, obtains the corresponding comprehensive characteristics in each candidate region.Computer equipment can splice the corresponding comprehensive characteristics in each candidate region, Splicing feature is obtained, feature is spliced by the second model treatment, exports the iamge description text of input picture.

In one embodiment, computer equipment can determine mutually different multiple candidate regions in input picture, calculate After machine equipment determines candidate region, mesh is extracted as object candidate area in the candidate region that may be selected to meet preset condition It marks the characteristics of image of candidate region and determines the corresponding class label text of object candidate area, respectively to target candidate The corresponding characteristics of image in region and class label text carry out obtaining multiple comprehensive characteristics across modality fusion.

Wherein, preset condition presets ratio for example, the ratio of the area of the area and input picture of candidate region meets, or The maximum several former of ratio, such as front three.Preset condition also for example, by network model learn under big data most by The target that people welcome, selects the candidate region comprising respective objects of preset quantity.

In above-described embodiment, by splicing the corresponding comprehensive characteristics in each candidate region, splicing feature is obtained, further according to spelling Feature output iamge description text is connect, image information is more fully utilized, characteristics of image and class label text are carried out effective Ground combines, and substantially increases the accuracy of image understanding information.

In one embodiment, step S208, that is, characteristics of image and corresponding class label text are subjected to cross-module State fusion, the step of obtaining comprehensive characteristics specifically includes the following steps:

S402 determines coded data corresponding with class label text.

Wherein, coded data is to carry out the data that coded treatment obtains to class label text, and coded data can represent Class label text in data encoded, that is, the present embodiment.Common coding mode has: unipolar code, polar code, Bipolar code, zero code, diphase code, non-return to zero code, Manchester's code, Differential Manchester Encoding, multilevel coding etc..

In one embodiment, the mapping relations of class label text and coded data can be preset in computer equipment. According to mapping relations, coded data corresponding with class label text is determined.For example, class label such as can be preset Text " dog " corresponds to coded data " 0001 ", class label text " people " corresponds to coded data " 0002 ", class label text This " mountain " corresponds to coded data " 0003 ", class label text " house " corresponds to coded data " 0101 " etc..Work as computer When equipment determines that class label corresponding with characteristics of image is " dog ", then it can determine corresponding coded data " 0001 ".

In one embodiment, computer equipment can extract the text spy of class label text by Recognition with Recurrent Neural Network Sign, using corresponding text feature as coded data corresponding with class label text.

S404 carries out Automobile driving processing, the power that gains attention weight to characteristics of image according to coded data.

In one embodiment, computer equipment can carry out at Automobile driving characteristics of image according to coded data Reason, the power that gains attention weight.

In one embodiment, computer equipment can be mapped respectively coded data and characteristics of image by preset standard rule At the standard vector in the same space.Dot product behaviour is carried out to standard vector corresponding with coded data and characteristics of image respectively again Make, obtains intermediate result.Pond processing (such as sum pooling processing) and recurrence processing (ratio are successively carried out to intermediate result Such as softmax processing), the power that gains attention weight.

Comprehensive characteristics are calculated according to attention weight and characteristics of image in S406.

Specifically, computer equipment can synthesis by attention weight and corresponding feature combinations, after being weighted Feature.In one embodiment, computer equipment can be realized by attention model by characteristics of image and corresponding classification mark Text is signed to carry out obtaining the step of integrating text across modality fusion.Extremely by characteristics of image and corresponding class label text input In attention model, attention model can automatically learn weight by network structure, the power that gains attention weight.Again by attention Weight and characteristics of image are combined, and obtain comprehensive characteristics.In obtained comprehensive characteristics, ground that attention model more focuses Side, shared weight are bigger.

In above-described embodiment, by carrying out Automobile driving processing to characteristics of image and corresponding coded data, infused Meaning power weight, then attention weight and characteristics of image are combined, comprehensive characteristics are obtained, so that member more important in comprehensive characteristics Weight shared by element is bigger, and may make can focus on object element during image processing, substantially increases image understanding letter The accuracy of breath improves computer equipment to the understandability of image.

In one embodiment, image processing method further include: by the text in the first model extraction input picture Hold.The step of characteristics of image and corresponding class label text are carried out across modality fusion, comprehensive characteristics are obtained specifically includes: will Characteristics of image, content of text corresponding with characteristics of image and class label text corresponding with characteristics of image carry out cross-module state Fusion, obtains comprehensive characteristics.

It specifically, include content of text in input picture.Multi-instance learning (Multiple can be used in computer equipment Instance Learning) method, from input picture extract have semantic meaning content of text.By characteristics of image, with The corresponding content of text of characteristics of image and class label text corresponding with characteristics of image obtain comprehensive across modality fusion Close feature.

In one embodiment, computer equipment determines mutually different multiple candidates in input picture by the first model Content of text can be corresponded to phase when computer equipment extracts the content of text with semantic meaning from input picture by region The candidate region answered.Correspondingly, computer equipment can be by the corresponding characteristics of image in each candidate region, content of text, class label Text is carried out across modality fusion, to obtain comprehensive characteristics.

In above-described embodiment, by extracting the content of text in input picture, by characteristics of image, corresponding with characteristics of image Content of text and class label text three corresponding with characteristics of image carry out across modality fusion, can more sufficiently meticulously The feature of input picture is excavated, so that iamge description text is more acurrate, further improves the accuracy of image understanding information, Computer equipment is improved to the understandability of image.

In one embodiment, which further includes the steps that carrying out image question and answer, which specifically includes:

S502 obtains the corresponding question text of input picture.

Wherein, question text is the text for the problem of description is directed to input picture.For example, with reference to the input figure in Fig. 3 Picture, corresponding question text specifically can be " what is before house? ", " what the left side in house is? " or " beside brook What is there? " Deng.

Specifically, computer equipment can obtain local text corresponding with input picture as question text, or pass through The communication modes such as network connection, the connection of USB (Universal Serial Bus, universal serial bus) interface are from other computers Question text is obtained at equipment.

In one embodiment, terminal can be by showing that image shows interface to user, and user can show interface in image In carry out choosing operation, terminal can be using the image chosen as input picture.Terminal can show defeated shown in interface in image Enter and shows preset question text by image.User can carry out choosing operation in image displaying interface, and terminal chooses user The problem of text as question text corresponding with input picture.

In one embodiment, terminal can call local voice collection device to acquire voice data.In local to voice Data are identified, or corresponding voice data is sent to server, to identify to voice data, are obtained corresponding Question text.

In one embodiment, terminal can be performed locally at image after getting input picture and corresponding question text Reason method.Alternatively, input picture and corresponding question text can be sent to server by terminal, so that server acquisition is defeated Enter image and corresponding question text and executes image processing method.

S504 extracts the text feature of question text.

Specifically, computer equipment can extract the text feature of question text by Recognition with Recurrent Neural Network.Recycle nerve net Network, such as LSTM network.In one embodiment, the word, word of the extractable question text of computer equipment or the text of whole sentence Feature.

S506 carries out Automobile driving processing, the power that gains attention weight to characteristics of image according to text feature.

In one embodiment, computer equipment can carry out at Automobile driving characteristics of image according to text feature Reason, the power that gains attention weight.

In one embodiment, computer equipment can be mapped respectively text feature and characteristics of image by preset standard rule At the standard vector in the same space.Dot product behaviour is carried out to standard vector corresponding with coded data and characteristics of image respectively again Make, obtains intermediate result.Pond processing (such as sum pooling processing) and recurrence processing (ratio are successively carried out to intermediate result Such as softmax processing), the power that gains attention weight.

S508 determines weighted image feature according to characteristics of image and attention weight.

Specifically, computer equipment can weighting by attention weight and corresponding feature combinations, after being weighted Characteristics of image.In one embodiment, computer equipment can be realized characteristics of image by attention model and accordingly be asked Inscribe text across modality fusion, the step of obtaining weighted image feature.Characteristics of image and corresponding question text are input to note It anticipates in power model, attention model can automatically learn weight by network structure, the power that gains attention weight.Attention is weighed again Value and characteristics of image are combined, and obtain weighted image feature.It is more related to question text in obtained weighted image feature Place, shared weight is bigger.

S510 carries out classification processing according to weighted image feature, obtains the corresponding answer text of question text.

Specifically, computer equipment can carry out classification processing to weighted image feature by Machine learning classifiers, obtain Class label text belonging to weighted image feature.Using corresponding class label text as answer text corresponding with question text This.

In one embodiment, weighted image feature can be input to trained machine learning classification by computer equipment Device carries out 3000 class classification, obtains corresponding class label text, answer using class label text as corresponding with question text Case text.

For example, with reference to the input picture of Fig. 3, when question text corresponding with input picture is " to be before house What? " when, the answer text obtained according to above-mentioned image processing method is " brook "；When the problem text corresponding to input picture Originally be " what has beside brook? " when, the answer text obtained according to above-mentioned image processing method is " dog ".

In above-described embodiment, the text feature of question text corresponding with input picture is extracted, and according to text feature, it is right Characteristics of image carries out Automobile driving processing, and the power that gains attention weight determines weighted graph according to characteristics of image and attention weight As feature.Classification processing is carried out according to weighted image feature again, exports answer text corresponding with question text.In this way, can root According to the corresponding text feature of question text, Automobile driving processing is carried out to characteristics of image, to obtain weighted image feature, so that It can be focused on characteristics of image relevant to question text during image processing, then by being carried out to weighted image feature Classification processing can greatly improve the accuracy of answer text, that is, substantially increase the accuracy of image understanding information, mention High understandability of the computer equipment to image.

In one embodiment, the flow diagram of image processing method in one embodiment is shown with reference to Fig. 6, Fig. 6. As shown in fig. 6, computer equipment can combine the first model, the second model and attention model, an Image is constructed Caption system, for handling input picture, to obtain the iamge description text of input picture.Wherein Image caption system The structure for serving as the first model in system is CNN model structure, and serve as the second model structure is RNN model structure.In this way, can By a complete Image caption system, input picture is handled, exports image reason corresponding with input picture Solve text.

As shown in fig. 6, input picture (Image) can be input in the Image caption system, pass through convolutional Neural Network model (CNN network structure) determines multiple candidate regions (Region Proposal), then passes through convolutional neural networks model (CNN network structure) extracts the characteristics of image (Feature map) of corresponding candidate region.Pass through convolutional neural networks model (CNN network structure) determines class label text (Label) corresponding with each candidate region.By attention model to classification mark It signs text and characteristics of image executes Automobile driving processing, obtain corresponding comprehensive characteristics.Comprehensive characteristics are input to shot and long term In memory network model (LSTM network structure), corresponding iamge description text (Image Caption) is exported.

As shown in fig. 7, in a specific embodiment, image processing method includes:

S702 obtains input picture.

S704 determines mutually different multiple candidate regions in input picture by the first model.

S706 extracts the characteristics of image of each candidate region by the first model respectively.

S708 determines class label text corresponding with input picture by the first model and according to characteristics of image.

S710 determines coded data corresponding with class label text.

S712 carries out Automobile driving processing, the power that gains attention weight to characteristics of image according to coded data.

Comprehensive characteristics are calculated according to attention weight and characteristics of image in S714.

S716 splices the corresponding comprehensive characteristics in each candidate region, obtains splicing feature.

S718 obtains image corresponding with input picture and describes text in advance.

S720 sequentially inputs each term vector that splicing feature and image describe text in advance to the second model.

S722, the splicing feature sequentially input by the second model treatment and term vector, the image for exporting input picture are retouched State text.

Fig. 7 is the flow diagram of image processing method in one embodiment.Although should be understood that the process of Fig. 7 Each step in figure is successively shown according to the instruction of arrow, but these steps are not the inevitable sequence indicated according to arrow Successively execute.Unless expressly stating otherwise herein, there is no stringent sequences to limit for the execution of these steps, these steps can To execute in other order.Moreover, at least part step in Fig. 7 may include multiple sub-steps or multiple stages, These sub-steps or stage are not necessarily to execute completion in synchronization, but can execute at different times, these Sub-step perhaps the stage execution sequence be also not necessarily successively carry out but can be with the son of other steps or other steps Step or at least part in stage execute in turn or alternately.

As shown in figure 8, in one embodiment, providing a kind of image processing method.The present embodiment is mainly in this way Applied to the computer equipment in above-mentioned Fig. 1, illustrated such as terminal 110 or server 120.Referring to Fig. 8, the image procossing Method specifically comprises the following steps:

S802 obtains input picture and question text corresponding with input picture.

Specifically, computer equipment can obtain local image and corresponding text as input picture and accordingly Question text, or input figure is obtained from other computer equipments by communication modes such as network connection, USB interface connections Picture and corresponding question text.

In one embodiment, terminal can acquire image under the current visual field of camera by camera, by acquisition Image is as input picture.In one embodiment, terminal can call local voice collection device to acquire voice data.At this Ground identifies voice data, or corresponding voice data is sent to server, to identify to voice data, obtains To corresponding question text.

In one embodiment, terminal can show that image shows interface, and user can show in interface in image and choose Operation, terminal can be using the image chosen as input picture.Wherein, image shows that image shown in interface can be terminal sheet The image of ground storage is also possible to image of the terminal by network connection access server to obtain.Terminal can be shown in image Preset question text is shown by input picture shown in interface.User can carry out choosing operation in image displaying interface, The problem of terminal chooses user text is as question text corresponding with input picture.

S804 extracts the characteristics of image of input picture.

In one embodiment, computer equipment can extract input figure by convolutional neural networks, such as ResNet-80 The characteristics of image of picture.Input picture is input in convolutional neural networks, input is schemed by the convolutional layer of convolutional neural networks As carrying out process of convolution, the characteristics of image of input picture is extracted.Namely convolutional neural networks can scheme input by convolutional layer After carrying out process of convolution, the feature map (characteristic pattern) of input picture is obtained, feature map here is exactly this reality Apply the characteristics of image in example.

In one embodiment, convolutional neural networks are with the image and corresponding classification mark in image library (ImageNet) Label are used as training data, carry out what learning training obtained.Computer equipment inputs input picture after getting input picture Convolutional neural networks pass through the characteristics of image of the convolutional layer structure extraction input picture of convolutional neural networks.

S806 extracts the text feature of question text.

S808 carries out Automobile driving processing, the power that gains attention weight to characteristics of image according to text feature.

Specifically, computer equipment can carry out Automobile driving processing to characteristics of image, be infused according to text feature Meaning power weight.

In one embodiment, text feature can be mapped to the first standard feature by computer equipment, and characteristics of image is reflected Penetrate into the second standard feature.Wherein the first standard feature and the second standard feature are the features under same mapping space.By One standard feature is added with the second standard feature, then carries out nonlinear operation, finally carries out softmax processing, gain attention power Weight.

In one embodiment, text feature can be mapped to the first standard feature by computer equipment, and characteristics of image is reflected Penetrate into the second standard feature.Wherein the first standard feature and the second standard feature are the features under same mapping space.It is right again First standard feature and second feature carry out dot product operation, obtain intermediate features.Intermediate features are successively carried out with pond processing (ratio Such as sum pooling processing) and recurrence processing (such as softmax processing), the power that gains attention weight.

S810 determines weighted image feature according to characteristics of image and attention weight.

S812 carries out classification processing according to weighted image feature, obtains the corresponding answer text of question text.

In one embodiment, step S806, i.e. the step of text feature of extraction question text are specifically included:

S902 obtains word sequence corresponding with question text.

Specifically, computer equipment can split question text, obtain the word sequence of corresponding single word composition.

S904 carries out word segmentation processing to question text, obtains word sequence corresponding with question text.

Specifically, computer equipment can be used segmenting method and carry out word segmentation processing to question text, obtain being made of each word Word sequence.Computer equipment can be used segmentation methods or participle model based on dictionary etc. and segment to question text. Wherein, the segmentation methods based on dictionary specifically can be Forward Maximum Method algorithm based on dictionary, reverse maximum matching algorithm, Minimum segmentation algorithm or self-reinforcing in double directions etc..Participle model specifically can be hidden Markov model or CRF (conditional random field algorithm, condition random field algorithm) model etc..

In one embodiment, after computer equipment segments question text, the word obtained to participle removes stop words Afterwards, word sequence is obtained.Wherein, stop words (Stop Words) refers in information retrieval, to save memory space and improving inspection Rope efficiency, the certain words or word that meeting automatic fitration is fallen before or after handling natural language data (or text), such as it is some Using very extensive word, auxiliary words of mood, polite formula word, preposition or conjunction etc..

S906 extracts the text feature of word sequence, word sequence and the whole sentence of question text respectively.

Specifically, it is whole can to extract respectively word sequence, word sequence and question text by Recognition with Recurrent Neural Network for computer equipment The text feature of sentence.

In above-described embodiment, the text of the corresponding word sequence of question text, word sequence and the whole sentence of question text is extracted respectively Feature can press word rank to question text, and word rank and sentence level are sufficiently excavated and asked to carry out multi-level feature extraction Inscribe the text information of text.

In one embodiment, step S808 carries out Automobile driving processing to characteristics of image that is, according to text feature, Gain attention power weight the step of include: respectively according to word sequence, the text feature of word sequence and the whole sentence of question text, to image Feature carries out Automobile driving processing, obtains the first attention weight, the second attention weight and third attention weight.Step S810, i.e., the step of weighted image feature being determined according to characteristics of image and attention weight include: according to the first attention weight, Second attention weight and third attention weight determine weighted image feature in conjunction with characteristics of image.

Specifically, computer equipment can be respectively according to word sequence, the text feature of word sequence and the whole sentence of question text, to figure As feature progress Automobile driving processing, the first attention weight, the second attention weight and third attention weight are obtained.Into And according to the first attention weight, the second attention weight and third attention weight, in conjunction with characteristics of image, determine weighted graph As feature.

In one embodiment, computer equipment can be respectively according to the first attention weight, the second attention weight and Three L's power weight is weighted processing to characteristics of image, obtains corresponding first intermediate image feature.By each first middle graph As Fusion Features, the second intermediate image feature is obtained, and by the second characteristics of image directly as weighted image feature.

In one embodiment, computer equipment can pay attention to the first attention weight, the second attention weight and third Power weight is merged, such as weighted sum, obtains comprehensive attention weight.According to comprehensive attention weight and characteristics of image, The second intermediate image feature is obtained, and by the second intermediate image feature directly as weighted image feature.

In one embodiment, computer equipment can be according to the first attention weight, the second attention weight and third Attention weight is weighted processing to characteristics of image, obtains corresponding first intermediate image feature.By each first intermediate image Fusion Features obtain the second intermediate image feature.According to the text feature of the whole sentence of question text, to the second intermediate image feature into The processing of row Automobile driving, obtains the 4th attention weight.It is determined according to the second intermediate image feature and the 4th attention weight Weighted image feature.

In one embodiment, computer equipment is by the first attention weight, the second attention weight and third attention Weight is combined with characteristics of image respectively, is obtained corresponding with the word rank of question text, word rank and sentence level accordingly The first intermediate image feature.Computer equipment can be by the corresponding first intermediate image feature of word rank corresponding with word rank One intermediate image feature is overlapped, then the first intermediate image feature corresponding with sentence level is overlapped, and is obtained in second Between characteristics of image.

In one embodiment, computer equipment can be according to the text feature of the whole sentence of question text, to the second intermediate image Feature carries out Automobile driving processing again, obtains the 4th attention weight, according to the second intermediate image feature and the 4th attention Weight determines weighted image feature.In above-described embodiment, pass through being paid attention to respectively with characteristics of image at many levels for question text After power allocation processing, the second intermediate image feature is obtained.Further according to the text feature of the whole sentence of question text, to the second intermediate image Feature does Automobile driving processing, to obtain weighted image feature, so that the emphasis of weighted image feature is closer to question text Content, and then can be improved and subsequent carry out the accuracy of answer text that classification processing obtains to weighted image feature.

In above-described embodiment, respectively according to word sequence corresponding with question text, the text of word sequence and the whole sentence of question text Eigen carries out Automobile driving processing to characteristics of image, obtains the first attention weight, the second attention weight and third note Meaning power weight, further according to the first attention weight, the second attention weight and third attention weight, in conjunction with characteristics of image, really Determine weighted image feature.In this way, the text information of question text can sufficiently be excavated, so that the emphasis of weighted image feature is more Close to the content of question text, and then the subsequent answer text obtained to weighted image feature progress classification processing can be improved Accuracy.

In one embodiment, the flow chart of image processing method in one embodiment is shown with reference to Figure 10, Figure 10.Such as Shown in Figure 10, computer equipment can extract the characteristics of image of input picture by convolutional neural networks.By recycling nerve net Network extracts the text feature of question text.Weighted image feature is input to Machine learning classifiers and carries out classification processing, is obtained Answer text corresponding with question text.In the present embodiment, computer equipment can be by convolutional neural networks, Recognition with Recurrent Neural Network It is combined with Machine learning classifiers, constructs a vision question and answer (visual question answering) system.

As shown in Figure 10, input picture (image) can be input in the vision question answering system, passes through convolutional neural networks The characteristics of image (feature map) of model (CNN network structure) extraction input picture.Question text is input to the vision to ask It answers in system, the text feature (question of question text is extracted by shot and long term memory network model (LSTM network structure) feature).Automobile driving processing (Attention processing) is done to characteristics of image and text feature, then does recurrence processing (softmax processing), the power that gains attention weight (Attention value).According to attention weight and characteristics of image, is obtained Two intermediate image features (Attention map).By the second intermediate image feature (Attention map) and the whole sentence of question text Automobile driving processing (Attention) is done, weighted image feature is obtained.Weighted image feature is input to machine learning classification Classified in device (Classification) processing, obtain answer text (Answer) corresponding with question text.

In one embodiment, computer equipment also can be used co-attention's (coordination-Automobile driving processing) Mode carries out Automobile driving processing to characteristics of image and text feature.Co-attention processing mode is primarily referred to as according to text Eigen carries out Automobile driving processing to characteristics of image, carries out Automobile driving processing to text feature according to characteristics of image, The result by the two processing combines again, and details are not described herein again.

As shown in figure 11, in one specifically embodiment, image processing method the following steps are included:

S1102 obtains input picture and question text corresponding with input picture.

S1104 extracts the characteristics of image of input picture by convolutional neural networks.

S1106 obtains word sequence corresponding with question text.

S1108 carries out word segmentation processing to question text, obtains word sequence corresponding with question text.

S1110 extracts the text feature of word sequence, word sequence and the whole sentence of question text by Recognition with Recurrent Neural Network respectively.

S1112 infuses characteristics of image respectively according to word sequence, the text feature of word sequence and the whole sentence of question text Meaning power allocation processing, obtains the first attention weight, the second attention weight and third attention weight.

S1114, respectively according to the first attention weight, the second attention weight and third attention weight to characteristics of image It is weighted processing, obtains corresponding first intermediate image feature.

Each first intermediate image Fusion Features are obtained the second intermediate image feature by S1116.

S1118 carries out at Automobile driving the second intermediate image feature according to the text feature of the whole sentence of question text Reason, obtains the 4th attention weight.

S1120 determines weighted image feature according to the second intermediate image feature and the 4th attention weight.

Weighted image feature is input to Machine learning classifiers and carries out classification processing, obtained and question text pair by S1122 The answer text answered.

Above-mentioned image processing method extracts the characteristics of image of input picture, extracts question text corresponding with input picture Text feature Automobile driving processing, the power that gains attention weight, according to figure are carried out to characteristics of image and according to text feature As feature and attention weight determine weighted image feature.Classification processing, output and problem are carried out according to weighted image feature again The corresponding answer text of text.In this way, Automobile driving can be carried out to characteristics of image according to the corresponding text feature of question text Processing, to obtain weighted image feature, so that it is special to focus on image relevant to question text during image processing In sign, then the accuracy by the way that answer text can be greatly improved to weighted image feature progress classification processing, that is, significantly The accuracy for improving image understanding information improves computer equipment to the understandability of image.

Figure 11 is the flow diagram of image processing method in one embodiment.Although should be understood that the stream of Figure 11 Each step in journey figure is successively shown according to the instruction of arrow, but these steps are not inevitable according to the suitable of arrow instruction Sequence successively executes.Unless expressly stating otherwise herein, there is no stringent sequences to limit for the execution of these steps, these steps It can execute in other order.Moreover, at least part step in Figure 11 may include multiple sub-steps or multiple ranks Section, these sub-steps or stage are not necessarily to execute completion in synchronization, but can execute at different times, this The execution sequence in a little step perhaps stage be also not necessarily successively carry out but can be with other steps or other steps Sub-step or at least part in stage execute in turn or alternately.

In concrete application scene, a new image can be input in above-mentioned image processing system by user, at image Reason system executes above-mentioned image processing method, provides the understanding for the image.For example, image processing system can export the figure The iamge description text of picture.Alternatively, user can propose several problems for given image, image processing system is executed Above-mentioned image processing method can export corresponding answer text.Especially in education sector, above-mentioned image processing method can be with It helps user fast and effectively to understand the semantic information in figure, and question and answer interaction can occur with user, especially pair It is very helpful in tools such as child, the elderly, visual dysfunction person or language understanding obstacle persons.

As shown in figure 12, in one embodiment, a kind of image processing apparatus 1200 is provided, comprising: obtain module 1201, extraction module 1202, determining module 1203, Fusion Module 1204 and output module 1205.

Module 1201 is obtained, for obtaining input picture.

Extraction module 1202, for passing through the characteristics of image of the first model extraction input picture.

Determining module 1203, for determining classification corresponding with input picture by the first model and according to characteristics of image Label text.

Fusion Module 1204, for obtain comprehensive across modality fusion characteristics of image and corresponding class label text Close feature.

Output module 1205, for exporting the iamge description text of input picture by the second model treatment comprehensive characteristics.

In one embodiment, extraction module 1202 is also used to determine by the first model mutually different in input picture Multiple candidate regions；By the first model, the characteristics of image of each candidate region is extracted respectively.

In one embodiment, output module 1205 is also used to obtain the corresponding comprehensive characteristics splicing in each candidate region Splice feature；Splice feature by the second model treatment, exports the iamge description text of input picture.

In one embodiment, Fusion Module 1204 is also used to determine coded data corresponding with class label text；Root According to coded data, Automobile driving processing, the power that gains attention weight are carried out to characteristics of image；It is special according to attention weight and image Sign, is calculated comprehensive characteristics.

In one embodiment, extraction module 1202 is also used to by the text in the first model extraction input picture Hold.Fusion Module 1204 is also used to characteristics of image, content of text corresponding with characteristics of image and corresponding with characteristics of image Class label text carries out obtaining comprehensive characteristics across modality fusion.

In one embodiment, output module 1205 is also used to obtain image corresponding with input picture and describes text in advance； Each term vector that comprehensive characteristics and image describe text in advance is sequentially input to the second model；Successively by the second model treatment The comprehensive characteristics and term vector of input export the iamge description text of input picture.

As shown in figure 13, in one embodiment, image processing apparatus 1200 further includes Automobile driving processing module 1206。

It obtains module 1201 and is also used to obtain the corresponding question text of input picture.

Extraction module 1202 is also used to extract the text feature of question text.

Automobile driving processing module 1206, for carrying out Automobile driving processing to characteristics of image according to text feature, The power that gains attention weight.

Determining module 1203 is also used to determine weighted image feature according to characteristics of image and attention weight.

Output module 1205 is also used to carry out classification processing according to weighted image feature, obtains the corresponding answer of question text Text.

Above-mentioned image processing apparatus, by the characteristics of image of the first model extraction input picture, and determining and input picture Corresponding class label text, can rapidly and accurately obtain the characteristics of image and corresponding class label text of input picture. Characteristics of image and corresponding class label text are carried out to obtain comprehensive characteristics, then pass through the second model treatment across modality fusion Comprehensive characteristics obtain iamge description text.In this way, the second model can be made input can be made full use of to scheme during processing As the characteristics of image of itself, and can be in conjunction with classification information belonging to input picture.It is careful in this way and be sufficiently used input figure The feature of picture has obtained the dual guidance of characteristics of image and class label text, has substantially increased when understanding image The accuracy of image understanding information improves computer equipment to the understandability of image.

As shown in figure 14, in one embodiment, a kind of image processing apparatus 1400 is provided, comprising: obtain module 1401, extraction module 1402, Automobile driving processing module 1403, determining module 1404 and categorization module 1405.

Module 1401 is obtained, for obtaining input picture and question text corresponding with input picture.

Extraction module 1402, for extracting the characteristics of image of input picture.

Extraction module 1402 is also used to extract the text feature of question text.

Automobile driving processing module 1403, for carrying out Automobile driving processing to characteristics of image according to text feature, The power that gains attention weight.

Determining module 1404, for determining weighted image feature according to characteristics of image and attention weight.

Categorization module 1405 obtains the corresponding answer of question text for carrying out classification processing according to weighted image feature Text.

In one embodiment, extraction module 1402 is also used to obtain word sequence corresponding with question text；To problem text This progress word segmentation processing obtains word sequence corresponding with question text；Word sequence, word sequence and the whole sentence of question text are extracted respectively Text feature.

In one embodiment, Automobile driving processing module 1403 is also used to according to word sequence, word sequence and ask respectively The text feature for inscribing the whole sentence of text carries out Automobile driving processing to characteristics of image, obtains the first attention weight, the second attention Power weight and third attention weight.Determining module 1404 be also used to according to the first attention weight, the second attention weight and Third attention weight determines weighted image feature in conjunction with characteristics of image.

In one embodiment, determining module 1404 is also used to be weighed according to the first attention weight, the second attention respectively Value and third attention weight are weighted processing to characteristics of image, obtain corresponding first intermediate image feature；By each first Intermediate image Fusion Features obtain the second intermediate image feature；According to the text feature of the whole sentence of question text, to the second middle graph As feature progress Automobile driving processing, the 4th attention weight is obtained；According to the second intermediate image feature and the 4th attention Weight determines weighted image feature.

In one embodiment, Automobile driving processing module 1403 is also used to for text feature being mapped to the first standard spy Sign；By image feature maps at the second standard feature；Point multiplication operation is carried out to the first standard feature and the second standard feature, is obtained Intermediate features；Intermediate features are successively carried out with pondization processing and recurrence processing, the power that gains attention weight.

In one embodiment, extraction module 1402 is also used to extract the image of input picture by convolutional neural networks Feature.By Recognition with Recurrent Neural Network, the text feature of question text is extracted.Categorization module 1405 is also used to weighted image feature It is input to Machine learning classifiers and carries out classification processing, obtain answer text corresponding with question text.

Above-mentioned image processing apparatus extracts the characteristics of image of input picture, extracts question text corresponding with input picture Text feature Automobile driving processing, the power that gains attention weight, according to figure are carried out to characteristics of image and according to text feature As feature and attention weight determine weighted image feature.Classification processing, output and problem are carried out according to weighted image feature again The corresponding answer text of text.In this way, Automobile driving can be carried out to characteristics of image according to the corresponding text feature of question text Processing, to obtain weighted image feature, so that it is special to focus on image relevant to question text during image processing In sign, then the accuracy by the way that answer text can be greatly improved to weighted image feature progress classification processing, that is, significantly The accuracy for improving image understanding information improves computer equipment to the understandability of image.

Figure 15 shows the internal structure chart of computer equipment in one embodiment.The computer equipment specifically can be figure Terminal 110 in 1.As shown in figure 15, it includes the place connected by system bus which, which includes the computer equipment, Manage device, memory, network interface, input unit and display screen.Wherein, memory includes non-volatile memory medium and interior storage Device.The non-volatile memory medium of the computer equipment is stored with operating system, can also be stored with computer program, the computer When program is executed by processor, processor may make to realize route method for digging.Computer can also be stored in the built-in storage Program when the computer program is executed by processor, may make processor to execute route method for digging.The display of computer equipment Screen can be liquid crystal display or electric ink display screen, and the input unit of computer equipment can be to be covered on display screen Touch layer is also possible to the key being arranged on computer equipment shell, trace ball or Trackpad, can also be external keyboard, Trackpad or mouse etc..

Figure 16 shows the internal structure chart of computer equipment in one embodiment.The computer equipment specifically can be figure Server 120 in 1.As shown in figure 16, it includes being connected by system bus which, which includes the computer equipment, Processor, memory and network interface.Wherein, memory includes non-volatile memory medium and built-in storage.The computer The non-volatile memory medium of equipment is stored with operating system, can also be stored with computer program, and the computer program is processed When device executes, processor may make to realize route method for digging.Computer program can also be stored in the built-in storage, the calculating When machine program is executed by processor, processor may make to execute route method for digging.

It will be understood by those skilled in the art that structure shown in Figure 15 and Figure 16, only with application scheme phase The block diagram of the part-structure of pass does not constitute the restriction for the computer equipment being applied thereon to application scheme, specifically Computer equipment may include perhaps combining certain components or with different than more or fewer components as shown in the figure Component layout.

In one embodiment, image processing apparatus provided by the present application can be implemented as a kind of shape of computer program Formula, computer program are run in the computer equipment shown in as shown in Figure 15 or Figure 16.In the memory of computer equipment The each program module for forming the image processing apparatus can be stored, for example, obtaining module, extraction module, determination shown in Figure 12 Module, Fusion Module and output module.Also for example, acquisition module, extraction module, Automobile driving shown in Figure 14 handle mould Block, determining module and categorization module.The computer program that each program module is constituted executes processor in this specification to retouch Step in the hair style recognition methods of each embodiment of the application stated.

For example, computer equipment shown in Figure 15 or Figure 16 can pass through obtaining in image processing apparatus as shown in figure 12 Modulus block executes step S202.Computer equipment can execute step S204 by extraction module.Computer equipment can pass through determination Module executes step S206.Computer equipment can execute step S208 by Fusion Module.Computer equipment can be by exporting mould Block executes step S210.

For example, computer equipment shown in Figure 15 or Figure 16 can pass through obtaining in image processing apparatus as shown in figure 14 Modulus block executes step S802.Computer equipment can execute step S804 and S806 by extraction module.Computer equipment can lead to It crosses Automobile driving processing module and executes step S808.Computer equipment can execute step S210 by determining module.Computer Equipment can execute step S812 by categorization module.

In one embodiment, a kind of computer equipment, including memory and processor are provided, is stored in memory Computer program, when computer program is executed by processor, so that processor executes following steps: obtaining input picture；Pass through The characteristics of image of first model extraction input picture；By the first model and according to characteristics of image, determination is corresponding to input picture Class label text；Characteristics of image and corresponding class label text are carried out to obtain comprehensive characteristics across modality fusion；Pass through Second model treatment comprehensive characteristics export the iamge description text of input picture.

In one embodiment, computer program makes processor execute the figure for passing through the first model extraction input picture As feature step when specifically execute following steps: mutually different multiple candidate regions in input picture are determined by the first model Domain；By the first model, the characteristics of image of each candidate region is extracted respectively.

In one embodiment, computer program is executing processor by the second model treatment comprehensive characteristics, defeated Following steps are specifically executed when the step of the iamge description text of input picture out: the corresponding comprehensive characteristics in each candidate region are spelled It connects, obtains splicing feature；Splice feature by the second model treatment, exports the iamge description text of input picture.

In one embodiment, computer program is executing processor by characteristics of image and corresponding class label text This progress obtains specifically executing following steps when the step of comprehensive characteristics across modality fusion: determination is corresponding to class label text Coded data；According to coded data, Automobile driving processing, the power that gains attention weight are carried out to characteristics of image；According to attention Power weight and characteristics of image, are calculated comprehensive characteristics.

In one embodiment, computer program makes processor also execute following steps: defeated by the first model extraction Enter the content of text in image；Computer program make processor execute by characteristics of image and corresponding class label text into Row obtains specifically executing following steps when the step of comprehensive characteristics across modality fusion: by characteristics of image, corresponding with characteristics of image Content of text and class label text corresponding with characteristics of image carry out obtaining comprehensive characteristics across modality fusion.

In one embodiment, computer program is executing processor by the second model treatment comprehensive characteristics, defeated Following steps are specifically executed when the step of the iamge description text of input picture out: obtaining image corresponding with input picture and retouches in advance State text；Each term vector that comprehensive characteristics and image describe text in advance is sequentially input to the second model；Pass through the second model The comprehensive characteristics and term vector sequentially input are handled, the iamge description text of input picture is exported.

In one embodiment, computer program makes processor also execute following steps: it is corresponding to obtain input picture Question text；Extract the text feature of question text；According to text feature, Automobile driving processing is carried out to characteristics of image, is obtained To attention weight；Weighted image feature is determined according to characteristics of image and attention weight；Divided according to weighted image feature Class processing obtains the corresponding answer text of question text.

Above-mentioned computer equipment, by the characteristics of image of the first model extraction input picture, and determining and input picture phase The class label text answered can rapidly and accurately obtain the characteristics of image and corresponding class label text of input picture.It will Characteristics of image and corresponding class label text carry out obtaining comprehensive characteristics across modality fusion, then comprehensive by the second model treatment Feature is closed, iamge description text is obtained.In this way, can make the second model that can make full use of input picture during processing The characteristics of image of itself, and can be in conjunction with classification information belonging to input picture.It is careful in this way and be sufficiently used input picture Feature obtained the dual guidance of characteristics of image and class label text when understanding image, substantially increased figure Accuracy as understanding information, improves computer equipment to the understandability of image.

In one embodiment, a kind of computer equipment, including memory and processor are provided, is stored in memory Computer program, when computer program is executed by processor so that processor execute following steps: obtain input picture and Question text corresponding with input picture；Extract the characteristics of image of input picture；Extract the text feature of question text；According to text Eigen carries out Automobile driving processing, the power that gains attention weight to characteristics of image；It is true according to characteristics of image and attention weight Determine weighted image feature；Classification processing is carried out according to weighted image feature, obtains the corresponding answer text of question text.

In one embodiment, computer program makes processor the step of executing the text feature for extracting question text When specifically execute following steps: obtain word sequence corresponding with question text；Word segmentation processing is carried out to question text, obtains and asks Inscribe the corresponding word sequence of text；The text feature of word sequence, word sequence and the whole sentence of question text is extracted respectively.

In one embodiment, computer program is executing processor according to text feature, carries out to characteristics of image Automobile driving processing, when step of the power that gains attention weight, specifically execute following steps: respectively according to word sequence, word sequence and The text feature of the whole sentence of question text carries out Automobile driving processing to characteristics of image, obtains the first attention weight, the second note Meaning power weight and third attention weight；Computer program makes processor true according to characteristics of image and attention weight in execution Determine specifically to execute following steps when the step of weighted image feature: according to the first attention weight, the second attention weight and Three L's power weight determines weighted image feature in conjunction with characteristics of image.

In one embodiment, computer program is executing processor according to the first attention weight, the second attention Power weight and third attention weight specifically execute following steps when determining the step of weighted image feature in conjunction with characteristics of image: Processing is weighted to characteristics of image according to the first attention weight, the second attention weight and third attention weight respectively, Obtain corresponding first intermediate image feature；By each first intermediate image Fusion Features, the second intermediate image feature is obtained；According to The text feature of the whole sentence of question text carries out Automobile driving processing to the second intermediate image feature, obtains the 4th attention power Value；Weighted image feature is determined according to the second intermediate image feature and the 4th attention weight.

In one embodiment, computer program is executing processor according to text feature, carries out to characteristics of image Automobile driving processing, when step of the power that gains attention weight, specifically execute following steps: text feature is mapped to the first mark Quasi- feature；By image feature maps at the second standard feature；Point multiplication operation is carried out to the first standard feature and the second standard feature, Obtain intermediate features；Intermediate features are successively carried out with pondization processing and recurrence processing, the power that gains attention weight.

In one embodiment, computer program makes processor the step of executing the characteristics of image for extracting input picture When specifically execute following steps: by convolutional neural networks, extract the characteristics of image of input picture；Computer program to handle Device specifically executes following steps when executing the step for extracting the text feature of question text: by Recognition with Recurrent Neural Network, extracting The text feature of question text；Computer program is executing processor according to weighted image feature progress classification processing, obtains Following steps are specifically executed when obtaining the step of the corresponding answer text of question text: weighted image feature is input to machine learning Classifier carries out classification processing, obtains answer text corresponding with question text.

Above-mentioned computer equipment extracts the characteristics of image of input picture, extracts question text corresponding with input picture Text feature, and according to text feature, Automobile driving processing, the power that gains attention weight, according to image are carried out to characteristics of image Feature and attention weight determine weighted image feature.Classification processing, output and problem text are carried out according to weighted image feature again This corresponding answer text.In this way, can be carried out at Automobile driving according to the corresponding text feature of question text to characteristics of image Reason, to obtain weighted image feature, so that characteristics of image relevant to question text can be focused on during image processing On, then by carrying out classification processing to weighted image feature the accuracy of answer text can be greatly improved, that is, it mentions significantly The high accuracy of image understanding information, improves computer equipment to the understandability of image.

A kind of computer readable storage medium, is stored with computer program, real when which is executed by processor Existing following steps: in one embodiment, a kind of computer equipment, including memory and processor is provided, is stored up in memory There is computer program, when computer program is executed by processor, so that processor executes following steps: obtaining input picture； Pass through the characteristics of image of the first model extraction input picture；By the first model and according to characteristics of image, determining and input picture Corresponding class label text；Characteristics of image and corresponding class label text are carried out to obtain comprehensive characteristics across modality fusion； By the second model treatment comprehensive characteristics, the iamge description text of input picture is exported.

Above-mentioned computer readable storage medium, by the characteristics of image of the first model extraction input picture, and it is determining with it is defeated Enter the corresponding class label text of image, can rapidly and accurately obtain the characteristics of image and corresponding class label of input picture Text.Characteristics of image and corresponding class label text are carried out to obtain comprehensive characteristics, then pass through the second model across modality fusion Comprehensive characteristics are handled, iamge description text is obtained.In this way, the second model can be made during processing can make full use of and is defeated Enter the characteristics of image of image itself, and can be in conjunction with classification information belonging to input picture.It is careful in this way and be sufficiently used defeated The feature for entering image has obtained the dual guidance of characteristics of image and class label text, has mentioned significantly when understanding image The high accuracy of image understanding information, improves computer equipment to the understandability of image.

A kind of computer readable storage medium, is stored with computer program, real when which is executed by processor Existing following steps: input picture and question text corresponding with input picture are obtained；Extract the characteristics of image of input picture； Extract the text feature of question text；According to text feature, Automobile driving processing, the power that gains attention power are carried out to characteristics of image Value；Weighted image feature is determined according to characteristics of image and attention weight；Classification processing is carried out according to weighted image feature, is obtained The corresponding answer text of question text.

Above-mentioned computer readable storage medium extracts the characteristics of image of input picture, extracts ask corresponding with input picture The text feature of text is inscribed, and according to text feature, Automobile driving processing carried out to characteristics of image, the power that gains attention weight, Weighted image feature is determined according to characteristics of image and attention weight.Classification processing, output are carried out according to weighted image feature again Answer text corresponding with question text.In this way, can be paid attention to according to the corresponding text feature of question text characteristics of image Power allocation processing, to obtain weighted image feature, so that can focus on during image processing relevant to question text On characteristics of image, then the accuracy by the way that answer text can be greatly improved to weighted image feature progress classification processing, also It is the accuracy for substantially increasing image understanding information, improves computer equipment to the understandability of image.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a non-volatile computer and can be read In storage medium, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, provided herein Each embodiment used in any reference to memory, storage, database or other media, may each comprise non-volatile And/or volatile memory.Nonvolatile memory may include that read-only memory (ROM), programming ROM (PROM), electricity can be compiled Journey ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) directly RAM (RDRAM), straight Connect memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..

Each technical characteristic of above embodiments can be combined arbitrarily, for simplicity of description, not to above-described embodiment In each technical characteristic it is all possible combination be all described, as long as however, the combination of these technical characteristics be not present lance Shield all should be considered as described in this specification.

The several embodiments of the application above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously The limitation to the application the scope of the patents therefore cannot be interpreted as.It should be pointed out that for those of ordinary skill in the art For, without departing from the concept of this application, various modifications and improvements can be made, these belong to the guarantor of the application Protect range.Therefore, the scope of protection shall be subject to the appended claims for the application patent.

Claims

1. a kind of image processing method, comprising:

Obtain input picture；

By first model and according to described image feature, class label text corresponding with the input picture is determined；

2. the method according to claim 1, wherein the figure for passing through input picture described in the first model extraction As feature includes:

Mutually different multiple candidate regions in the input picture are determined by the first model；

By first model, the characteristics of image of each candidate region is extracted respectively.

3. defeated according to the method described in claim 2, it is characterized in that, described pass through comprehensive characteristics described in the second model treatment The iamge description text of the input picture includes: out

By the corresponding comprehensive characteristics splicing in each candidate region, splicing feature is obtained；

By splicing feature described in the second model treatment, the iamge description text of the input picture is exported.

4. the method according to claim 1, wherein described by described image feature and corresponding class label text Across modality fusion, obtaining comprehensive characteristics includes: for this progress

Determine coded data corresponding with the class label text；

According to the coded data, Automobile driving processing, the power that gains attention weight are carried out to described image feature；

According to the attention weight and described image feature, comprehensive characteristics are calculated.

5. the method according to claim 1, wherein the method also includes:

Pass through the content of text in input picture described in first model extraction；

Described to carry out described image feature and corresponding class label text across modality fusion, obtaining comprehensive characteristics includes:

By described image feature, content of text corresponding with described image feature and classification corresponding with described image feature Label text carries out obtaining comprehensive characteristics across modality fusion.

6. pass through comprehensive characteristics described in the second model treatment the method according to claim 1, wherein described, it is defeated The iamge description text of the input picture includes: out

It obtains image corresponding with the input picture and describes text in advance；

Each term vector that the comprehensive characteristics and described image describe text in advance is sequentially input to the second model；

The comprehensive characteristics and term vector sequentially input by second model treatment, export the iamge description of the input picture Text.

7. method according to any one of claims 1 to 6, which is characterized in that the method also includes:

Obtain the corresponding question text of the input picture；

Extract the text feature of described problem text；

8. a kind of image processing method, comprising:

Obtain input picture and question text corresponding with the input picture；

Extract the characteristics of image of the input picture；

Extract the text feature of described problem text；

9. according to the method described in claim 8, it is characterized in that, the text feature for extracting described problem text includes:

Obtain word sequence corresponding with described problem text；

Word segmentation processing is carried out to described problem text, obtains word sequence corresponding with described problem text；

The text feature of the word sequence, the word sequence and the whole sentence of described problem text is extracted respectively.

10. according to the method described in claim 9, it is characterized in that, described according to the text feature, to described image feature Automobile driving processing is carried out, the power that gains attention weight includes:

Respectively according to the word sequence, the text feature of the word sequence and the whole sentence of described problem text, to described image feature Automobile driving processing is carried out, the first attention weight, the second attention weight and third attention weight are obtained；

It is described to determine that weighted image feature includes: according to described image feature and the attention weight

According to the first attention weight, the second attention weight and the third attention weight, in conjunction with the figure As feature, weighted image feature is determined.

11. according to the method described in claim 10, it is characterized in that, it is described according to the first attention weight, described Two attention weights and the third attention weight determine that weighted image feature includes: in conjunction with described image feature

Respectively according to the first attention weight, the second attention weight and the third attention weight to the figure As feature is weighted processing, the corresponding first intermediate image feature of acquisition；

By each first intermediate image Fusion Features, the second intermediate image feature is obtained；

According to the text feature of the whole sentence of described problem text, Automobile driving processing is carried out to the second intermediate image feature, Obtain the 4th attention weight；

Weighted image feature is determined according to the second intermediate image feature and the 4th attention weight.

12. according to the method described in claim 8, it is characterized in that, described according to the text feature, to described image feature Automobile driving processing is carried out, the power that gains attention weight includes:

The text feature is mapped to the first standard feature；

By described image Feature Mapping at the second standard feature；

Point multiplication operation is carried out to first standard feature and second standard feature, obtains intermediate features；

The intermediate features are successively carried out with pondization processing and recurrence processing, the power that gains attention weight.

13. the method according to any one of claim 8 to 12, which is characterized in that the extraction input picture Characteristics of image includes:

By convolutional neural networks, the characteristics of image of the input picture is extracted；

It is described extract described problem text text feature include:

By Recognition with Recurrent Neural Network, the text feature of described problem text is extracted；

Described to carry out classification processing according to the weighted image feature, obtaining the corresponding answer text of described problem text includes:

The weighted image feature is input to Machine learning classifiers and carries out classification processing, is obtained corresponding with described problem text Answer text.

14. a kind of image processing apparatus, described device include:

Module is obtained, for obtaining input picture；

Determining module, for by first model and according to described image feature, determination to be corresponding with the input picture Class label text；

Fusion Module obtains comprehensive spy across modality fusion for carrying out described image feature and corresponding class label text Sign；

Output module, for exporting the iamge description text of the input picture by comprehensive characteristics described in the second model treatment.

15. a kind of image processing apparatus, comprising:

Automobile driving processing module, for carrying out Automobile driving processing to described image feature according to the text feature, The power that gains attention weight；

Categorization module obtains the corresponding answer of described problem text for carrying out classification processing according to the weighted image feature Text.

16. a kind of computer readable storage medium is stored with computer program, when the computer program is executed by processor, So that the processor is executed such as the step of any one of claims 1 to 13 the method.

17. a kind of computer equipment, including memory and processor, the memory is stored with computer program, the calculating When machine program is executed by the processor, so that the processor is executed such as any one of claims 1 to 13 the method Step.