CN109002852A - Image processing method, device, computer readable storage medium and computer equipment - Google Patents
Image processing method, device, computer readable storage medium and computer equipment Download PDFInfo
- Publication number
- CN109002852A CN109002852A CN201810758796.5A CN201810758796A CN109002852A CN 109002852 A CN109002852 A CN 109002852A CN 201810758796 A CN201810758796 A CN 201810758796A CN 109002852 A CN109002852 A CN 109002852A
- Authority
- CN
- China
- Prior art keywords
- text
- image
- feature
- input picture
- attention weight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
Abstract
This application involves a kind of image processing method, device, computer readable storage medium and computer equipments, which comprises obtains input picture;Pass through the characteristics of image of input picture described in the first model extraction;By first model and according to described image feature, class label text corresponding with the input picture is determined;Described image feature and corresponding class label text are carried out to obtain comprehensive characteristics across modality fusion;By comprehensive characteristics described in the second model treatment, the iamge description text of the input picture is exported.The accuracy of image understanding information can be improved in scheme provided by the present application.
Description
Technical field
This application involves field of computer technology, more particularly to a kind of image processing method, device, computer-readable deposit
Storage media and computer equipment.
Background technique
With the development of computer technology, various challenges are handled by computer equipment or are carried out with people mutual
It is dynamic to have become more and more frequently.For example, being helped by computer equipment it is appreciated that image, especially for child, old age
The tools such as people, visual dysfunction person or language understanding obstacle person are very helpful.
Traditional image understanding method is usually the characteristics of image for extracting image, and characteristics of image and preset text is common
It is input in encoder, is decoded by decoder, to obtain image understanding information.However, traditional image understanding method,
Image is handled by coding-decoded structure, with the increase of processing time, can slowly lack the finger of characteristics of image
It leads, so that image understanding is not accurate enough.
Summary of the invention
Based on this, it is necessary in traditional image understanding scheme to the not accurate enough technical problem of image understanding,
A kind of image processing method, device, computer readable storage medium and computer equipment are provided.
A kind of image processing method, comprising:
Obtain input picture;
Pass through the characteristics of image of input picture described in the first model extraction;
By first model and according to described image feature, class label text corresponding with the input picture is determined
This;
Described image feature and corresponding class label text are carried out to obtain comprehensive characteristics across modality fusion;
By comprehensive characteristics described in the second model treatment, the iamge description text of the input picture is exported.
A kind of image processing apparatus, described device include:
Module is obtained, for obtaining input picture;
Extraction module, for the characteristics of image by input picture described in the first model extraction;
Determining module, for by first model and according to described image feature, determination and the input picture phase
The class label text answered;
Fusion Module, for obtain comprehensive across modality fusion described image feature and corresponding class label text
Close feature;
Output module, for exporting the iamge description of the input picture by comprehensive characteristics described in the second model treatment
Text.
A kind of computer readable storage medium is stored with computer program, when the computer program is executed by processor,
So that the processor executes the step of described image processing method.
A kind of computer equipment, including memory and processor, the memory are stored with computer program, the calculating
When machine program is executed by the processor, so that the step of processor executes described image processing method.
Above-mentioned image processing method, device, computer readable storage medium and computer equipment pass through the first model extraction
The characteristics of image of input picture, and determine class label text corresponding with input picture, it can rapidly and accurately be inputted
The characteristics of image of image and corresponding class label text.Characteristics of image and corresponding class label text are carried out cross-module state to melt
It closes, obtains comprehensive characteristics, then by the second model treatment comprehensive characteristics, obtain iamge description text.In this way, second can be made
Model can make full use of the characteristics of image of input picture itself during processing and combine classification belonging to input picture
Information.Feature that is careful in this way and being sufficiently used input picture, when understanding image, obtained characteristics of image and
The dual guidance of class label text, substantially increases the accuracy of image understanding information, improves computer equipment to image
Understandability.
A kind of image processing method, comprising:
Obtain input picture and question text corresponding with the input picture;
Extract the characteristics of image of the input picture;
Extract the text feature of described problem text;
According to the text feature, Automobile driving processing, the power that gains attention weight are carried out to described image feature;
Weighted image feature is determined according to described image feature and the attention weight;
Classification processing is carried out according to the weighted image feature, obtains the corresponding answer text of described problem text.
A kind of image processing apparatus, comprising:
Module is obtained, for obtaining input picture and question text corresponding with the input picture;
Extraction module, for extracting the characteristics of image of the input picture;
The extraction module is also used to extract the text feature of described problem text;
Automobile driving processing, for carrying out Automobile driving processing to described image feature according to the text feature,
The power that gains attention weight;
Determining module, for determining weighted image feature according to described image feature and the attention weight;
It is corresponding to obtain described problem text for carrying out classification processing according to the weighted image feature for categorization module
Answer text.
A kind of computer readable storage medium is stored with computer program, when the computer program is executed by processor,
So that the processor executes the step of described image processing method.
A kind of computer equipment, including memory and processor, the memory are stored with computer program, the calculating
When machine program is executed by the processor, so that the step of processor executes described image processing method.
Above-mentioned image processing method, device, computer readable storage medium and computer equipment extract the figure of input picture
As feature, the text feature of question text corresponding with input picture is extracted, and according to text feature, characteristics of image is infused
Meaning power allocation processing, the power that gains attention weight determine weighted image feature according to characteristics of image and attention weight.Foundation adds again
It weighs characteristics of image and carries out classification processing, export answer text corresponding with question text.In this way, can be corresponding according to question text
Text feature carries out Automobile driving processing to characteristics of image, to obtain weighted image feature, so that during image processing
It can focus on characteristics of image relevant to question text, then can be significantly by carrying out classification processing to weighted image feature
The accuracy of answer text is improved, that is, substantially increases the accuracy of image understanding information, improves computer equipment pair
The understandability of image.
Detailed description of the invention
Fig. 1 is the applied environment figure of image processing method in one embodiment;
Fig. 2 is the flow diagram of image processing method in one embodiment;
Fig. 3 is the schematic diagram of input picture in one embodiment;
Fig. 4 is obtain comprehensive across modality fusion by characteristics of image and corresponding class label text in one embodiment
Close the flow diagram of characterization step;
Fig. 5 is flow diagram the step of carrying out image question and answer in one embodiment;
Fig. 6 is the flow diagram of image processing method in another embodiment;
Fig. 7 is the flow diagram of image processing method in another embodiment;
Fig. 8 is the flow diagram of image processing method in one embodiment;
Fig. 9 is flow diagram the step of extracting the text feature of question text in one embodiment;
Figure 10 is the flow diagram of image processing method in another embodiment;
Figure 11 is the flow diagram of image processing method in another embodiment;
Figure 12 is the structural block diagram of image processing apparatus in one embodiment;
Figure 13 is the structural block diagram of image processing apparatus in another embodiment;
Figure 14 is the structural block diagram of image processing apparatus in one embodiment;
Figure 15 is the structural block diagram of computer equipment in one embodiment;
Figure 16 is the structural block diagram of computer equipment in another embodiment.
Specific embodiment
It is with reference to the accompanying drawings and embodiments, right in order to which the objects, technical solutions and advantages of the application are more clearly understood
The application is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain the application, and
It is not used in restriction the application.
Fig. 1 is the applied environment figure of image processing method in one embodiment.Referring to Fig.1, the image processing method application
In image processing system.The image processing system includes terminal 110 and server 120.Image processing method can be in terminal 110
Or completed in server 120, terminal 110 can directly acquire input picture, and execute above-mentioned image processing method in terminal side;Or
Input can also be sent to server after obtaining input picture by person, terminal 110, so that server obtains input picture and executes
Above-mentioned image processing method.Terminal 110 and server 120 pass through network connection.Terminal 110 specifically can be terminal console or shifting
Dynamic terminal, mobile terminal specifically can be at least one of mobile phone, tablet computer and laptop etc..Server 120 can be with
It is realized with the server cluster of the either multiple server compositions of independent server.
As shown in Fig. 2, in one embodiment, providing a kind of image processing method.The present embodiment is mainly in this way
Applied to the computer equipment in above-mentioned Fig. 1, illustrated such as terminal 110 or server 120.Referring to Fig. 2, the image procossing
Method specifically comprises the following steps:
S202 obtains input picture.
Specifically, computer equipment can obtain local image as input picture, or pass through network connection, USB
The communication modes such as (Universal Serial Bus, universal serial bus) interface connection obtain defeated from other computer equipments
Enter image.
In one embodiment, terminal can acquire image under the current visual field of camera by camera, by acquisition
Image is as input picture.Alternatively, terminal can be by showing that image shows interface to user, user can show in interface in image
It carries out choosing operation, terminal can be using the image chosen as input picture.Wherein, image shows that image shown in interface can be with
It is the image of terminal local storage, is also possible to image of the terminal by network connection access server to obtain.
In one embodiment, terminal can be performed locally image processing method after getting input picture.Alternatively, terminal
Input picture can be sent to server, so that server obtains input picture and executes image processing method.
S204 passes through the characteristics of image of the first model extraction input picture.
Wherein, model is the model being made of artificial neural network.Artificial neural network (Artificial Neural
Networks is abbreviated as ANNs), also referred to as neural network (NNs) or make link model (Connection Model).People
Artificial neural networks can be abstracted human brain neuroid from information processing angle, to establish certain model, by different companies
The mode of connecing forms different networks.Neural network or neural network are also often directly referred to as in engineering and academia.
Neural network model such as CNN (Convolutional Neural Network, convolutional neural networks) model,
DNN (Deep Neural Network, deep neural network) model and RNN (Recurrent Neural Network, circulation
Neural network) model etc..
Wherein, convolutional neural networks include convolutional layer (Convolutional Layer) and pond layer (Pooling
Layer).There are many convolutional neural networks models, such as VGG (Visual Geometry Group vision collection is combined) network mould
Type, GoogleNet (Google's network) model or ResNet (energy efficiency evaluation system) network model etc..Deep neural network includes defeated
Enter layer, hidden layer and output layer, is the relationship connected entirely between layers.Recognition with Recurrent Neural Network is a kind of pair of sequence data modeling
Neural network, i.e. the output of a sequence current output and front is also related.The specific form of expression is that network can be to preceding
The information in face is remembered and is applied in the calculating currently exported, i.e. node between hidden layer is no longer connectionless but have company
It connects, and it further includes the output of last moment hidden layer that the input of hidden layer, which not only includes the output of input layer,.Circulation nerve
Network model, such as LSTM (Long Short-Term Memory Neural Network, long Memory Neural Networks in short-term) mould
Type.
Characteristics of image is the feature of the color for indicating image, texture, shape or spatial relationship etc..In the present embodiment, scheme
As feature specifically can be the color that can indicate image that computer equipment extracts from input picture, texture, shape or
The data such as spatial relationship obtain the expression or description of " non-image " of image, such as numerical value, vector or symbol.
In the present embodiment, the first model specifically can be convolutional neural networks model, such as ResNet-80.Computer
Input picture can be input in the first model by equipment, pass through the characteristics of image of the first model extraction input picture.For example, calculating
Input picture can be input in convolutional neural networks model by machine equipment, by the convolutional layer of convolutional neural networks to input picture
Process of convolution is carried out, the characteristics of image of input picture is extracted.Namely convolutional neural networks can be by convolutional layer to input picture
After carrying out process of convolution, the feature map (characteristic pattern) of input picture is obtained, feature map here is exactly this implementation
Characteristics of image in example.
In one embodiment, the first model is with the image and corresponding class label work in image library (ImageNet)
For training data, the model for classifying to input picture that learning training obtains is carried out.Computer equipment is being got
After input picture, input picture is inputted into the first model, passes through the image of the convolutional layer structure extraction input picture of the first model
Feature passes through the pond layer structure and/or the corresponding class label text of full articulamentum structure determination input picture of the first model.
S206 determines class label text corresponding with input picture by the first model and according to characteristics of image.
Wherein, class label text is the corresponding label text of classification belonging to input picture.Specifically, computer equipment
Subsequent classification processing can be carried out by the first model extraction characteristics of image, then to the characteristics of image of extraction, obtain input picture
Classification, and then determine the corresponding class label text of input picture.
In one embodiment, the first model specifically can be convolutional neural networks model.Computer equipment can will input
Image is input in convolutional neural networks model, to extract the characteristics of image of input picture.Pass through pond layer and full articulamentum again
Characteristics of image is handled, the probability value of input picture generic is obtained.By class label corresponding to most probable value
As class label corresponding with input picture.
In one embodiment, computer equipment can be handled input picture by multitask convolutional neural networks,
To obtain input picture multiple class label texts accordingly.Wherein, multitask convolutional neural networks are can to carry out multitask
The convolutional neural networks of study.The network structure of multitask convolutional neural networks and the structure of single task convolutional neural networks slightly have
It is different.For single task convolutional neural networks, that is, independent neural network, each network is only one for input
The function of output.And multitask convolutional neural networks are then directed to input can multiple outputs, the corresponding task of each output.
It is understood that these outputs can connect all neurons for the hidden layer that they share, Mr. Yu is used in these hidden layers
The feature of a task can also be utilized by other tasks, and multiple tasks is promoted to learn jointly, in this way, the spy that single network learns
Sign can help the study of another network.
S208 carries out characteristics of image and corresponding class label text to obtain comprehensive characteristics across modality fusion.
It wherein, is that will there are the data of different modalities to merge across modality fusion.In the present embodiment, different modalities
Data specifically refer to and the corresponding characteristics of image of input picture and text data corresponding with class label text.Specifically,
The characteristics of image of extraction and corresponding class label text can be mapped to the data in the same space, then mapping by computer equipment
Data after penetrating carry out fusion treatment, obtain comprehensive characteristics.
In one embodiment, pass through the characteristics of image of the first model extraction input picture.Computer equipment can be by following
The text feature of ring neural network extraction class label text.Wherein, the form of expression of characteristics of image and text feature is ok
It is vector form.Computer equipment, can be special by characteristics of image and text before merging to characteristics of image and text feature
Sign is converted into canonical form respectively, makes the feature vector of the two all in same range.For example, can respectively to characteristics of image and
Text feature is normalized.Common normalization algorithm has function method and probability density method.Wherein, function method, such as
Feature (has all been normalized to a consistent section, for example mean value is 0, variance by maximum-minimum function, Mean-Variance function
Section for 1) or hyperbolic sigmoid (S sigmoid growth curve) function etc..
Further, computer equipment can be to the characteristics of image and corresponding class label text correspondence after normalized
Text feature, execute mixing operation, obtain comprehensive characteristics.Wherein, algorithm characteristics of image and text feature merged
The algorithm based on Bayesian decision theory, the algorithm based on sparse representation theory specifically can be used or calculated based on deep learning theory
Method etc..Alternatively, computer equipment can be weighted summation to two vectors after normalized, by characteristics of image and text
Feature is merged, and comprehensive characteristics are obtained.
In one embodiment, computer equipment can extract the text spy of class label text by Recognition with Recurrent Neural Network
Sign does Automobile driving processing, that is, attention processing, the power that gains attention distribution power to characteristics of image and text feature
Value, that is, attention weight (attention value), then by attention value and feature combinations, obtain comprehensive
Close feature.
Wherein, attention is handled, it can be understood as selectively filters out a small amount of important information simultaneously from bulk information
It focuses on these important informations, ignores most unessential information.The process of focusing is embodied in the meter of Automobile driving weight
Count in, Automobile driving weight it is more big more, then more focus on its corresponding characteristics of image.
S210 exports the iamge description text of input picture by the second model treatment comprehensive characteristics.
Wherein, iamge description text be describe input picture text, such as identification input picture in object, understand object
Relationship etc. between body, iamge description text specifically can be a word, a whole sentence or paragraph text etc..Second model is specific
It can be Recognition with Recurrent Neural Network model, for example (Long Short-Term Memory Neural Network, length is in short-term by LSTM
Memory Neural Networks) model.
Specifically, comprehensive characteristics can be input in the second model by computer equipment, by the second model to comprehensive characteristics
It is handled, to export the iamge description text of input picture.
In one embodiment, step S210 can specifically include following steps: obtain image corresponding with input picture
Pre- description text;Each term vector that comprehensive characteristics and image describe text in advance is sequentially input to the second model;Pass through second
The comprehensive characteristics and term vector that model treatment is sequentially input export the iamge description text of input picture.
Wherein, it is the text that input picture is described in advance that image describes text in advance.It is specific that image describes text in advance
It can be after thinking to understand input picture, obtained initial more coarse description text.Image describes text in advance
It can be language germline in unified with iamge description text, be also possible to different language germlines.For example, image describes text in advance
It can be the text that input picture is described with Chinese, and iamge description text is then to be retouched with English to input picture
The text stated.
In one embodiment, computer equipment can obtain image corresponding with input picture and describe text in advance, and obtain
Image describes each term vector of text in advance.Computer equipment can be using coding-decoded mode, using comprehensive characteristics as the
The input of one moment, using each term vector as the input of following instant, the synthesis sequentially input by the second model treatment
Feature and term vector export iamge description text.In this way, the second model can describe text in conjunction with comprehensive characteristics and image in advance,
The iamge description text for the output for being more is bonded input picture, substantially increases the accuracy of image understanding information.
Above-mentioned image processing method, by the characteristics of image of the first model extraction input picture, and determining and input picture
Corresponding class label text, can rapidly and accurately obtain the characteristics of image and corresponding class label text of input picture.
Characteristics of image and corresponding class label text are carried out to obtain comprehensive characteristics, then pass through the second model treatment across modality fusion
Comprehensive characteristics obtain iamge description text.In this way, the second model can be made input can be made full use of to scheme during processing
As the characteristics of image of itself, and can be in conjunction with classification information belonging to input picture.It is careful in this way and be sufficiently used input figure
The feature of picture has obtained the dual guidance of characteristics of image and class label text, has substantially increased when understanding image
The accuracy of image understanding information improves computer equipment to the understandability of image.
It in one embodiment, include: by first by the step of characteristics of image of the first model extraction input picture
Model determines mutually different multiple candidate regions in input picture;By the first model, the figure of each candidate region is extracted respectively
As feature.
Specifically, computer equipment can determine multiple targets in input picture by the first model treatment input picture,
And mutually different multiple candidate regions in input picture, that is, Region Proposal are determined according to corresponding target.Its
In, each candidate region is different, can partly overlap or not be overlapped completely.Wherein, the overlapping of candidate region refers to not
There is identical pixel in same candidate region.Computer equipment can extract respectively the image of each candidate region by the first model
Feature.
Wherein, there are many algorithms that the division of candidate region is carried out to input picture, for example sliding window judgement can be used
Method, target detection method (Selective Search for Object Recognition) or SSD (Single Shot
Multibox Detector, the more frame detections of single-shot) algorithm etc..
In one embodiment, computer equipment can be by the first model and special according to the corresponding image in each candidate region
Sign determines class label text corresponding with each candidate region.For example, Fig. 3 is shown in one embodiment with reference to Fig. 3
Input picture schematic diagram.As shown in figure 3, input picture includes a room, a brook, a dog and a people.Its
In, brook is in the front in house, and dog is by brook, the left side of the people in house.Above-mentioned input picture is input in the first model,
First model can determine multiple candidate regions, such as region A-D included by dotted line frame in Fig. 3.Correspondingly, the first model can divide
The characteristics of image in corresponding candidate region is indescribably taken, determines class label text corresponding with each candidate region.Such as with candidate regions
A corresponding class label text in domain is " house ", class label text corresponding with candidate region B is " people " and candidate region C
Corresponding class label text " brook " and class label text " dog " corresponding with candidate region D.
In above-described embodiment, mutually different multiple candidate regions in input picture are determined by the first model, and respectively
The characteristics of image of each candidate region is extracted, to determine multiple class label texts corresponding with input picture.
In one embodiment, step S210 exports the image of input picture that is, by the second model treatment comprehensive characteristics
The step of describing text specifically includes: the corresponding comprehensive characteristics in each candidate region being spliced, splicing feature is obtained;Pass through the second mould
Type processing splicing feature, exports the iamge description text of input picture.
Specifically, the corresponding characteristics of image in each candidate region and class label text can be carried out cross-module state by computer equipment
Fusion, obtains the corresponding comprehensive characteristics in each candidate region.Computer equipment can splice the corresponding comprehensive characteristics in each candidate region,
Splicing feature is obtained, feature is spliced by the second model treatment, exports the iamge description text of input picture.
In one embodiment, computer equipment can determine mutually different multiple candidate regions in input picture, calculate
After machine equipment determines candidate region, mesh is extracted as object candidate area in the candidate region that may be selected to meet preset condition
It marks the characteristics of image of candidate region and determines the corresponding class label text of object candidate area, respectively to target candidate
The corresponding characteristics of image in region and class label text carry out obtaining multiple comprehensive characteristics across modality fusion.
Wherein, preset condition presets ratio for example, the ratio of the area of the area and input picture of candidate region meets, or
The maximum several former of ratio, such as front three.Preset condition also for example, by network model learn under big data most by
The target that people welcome, selects the candidate region comprising respective objects of preset quantity.
In above-described embodiment, by splicing the corresponding comprehensive characteristics in each candidate region, splicing feature is obtained, further according to spelling
Feature output iamge description text is connect, image information is more fully utilized, characteristics of image and class label text are carried out effective
Ground combines, and substantially increases the accuracy of image understanding information.
In one embodiment, step S208, that is, characteristics of image and corresponding class label text are subjected to cross-module
State fusion, the step of obtaining comprehensive characteristics specifically includes the following steps:
S402 determines coded data corresponding with class label text.
Wherein, coded data is to carry out the data that coded treatment obtains to class label text, and coded data can represent
Class label text in data encoded, that is, the present embodiment.Common coding mode has: unipolar code, polar code,
Bipolar code, zero code, diphase code, non-return to zero code, Manchester's code, Differential Manchester Encoding, multilevel coding etc..
In one embodiment, the mapping relations of class label text and coded data can be preset in computer equipment.
According to mapping relations, coded data corresponding with class label text is determined.For example, class label such as can be preset
Text " dog " corresponds to coded data " 0001 ", class label text " people " corresponds to coded data " 0002 ", class label text
This " mountain " corresponds to coded data " 0003 ", class label text " house " corresponds to coded data " 0101 " etc..Work as computer
When equipment determines that class label corresponding with characteristics of image is " dog ", then it can determine corresponding coded data " 0001 ".
In one embodiment, computer equipment can extract the text spy of class label text by Recognition with Recurrent Neural Network
Sign, using corresponding text feature as coded data corresponding with class label text.
S404 carries out Automobile driving processing, the power that gains attention weight to characteristics of image according to coded data.
In one embodiment, computer equipment can carry out at Automobile driving characteristics of image according to coded data
Reason, the power that gains attention weight.
In one embodiment, computer equipment can be mapped respectively coded data and characteristics of image by preset standard rule
At the standard vector in the same space.Dot product behaviour is carried out to standard vector corresponding with coded data and characteristics of image respectively again
Make, obtains intermediate result.Pond processing (such as sum pooling processing) and recurrence processing (ratio are successively carried out to intermediate result
Such as softmax processing), the power that gains attention weight.
Comprehensive characteristics are calculated according to attention weight and characteristics of image in S406.
Specifically, computer equipment can synthesis by attention weight and corresponding feature combinations, after being weighted
Feature.In one embodiment, computer equipment can be realized by attention model by characteristics of image and corresponding classification mark
Text is signed to carry out obtaining the step of integrating text across modality fusion.Extremely by characteristics of image and corresponding class label text input
In attention model, attention model can automatically learn weight by network structure, the power that gains attention weight.Again by attention
Weight and characteristics of image are combined, and obtain comprehensive characteristics.In obtained comprehensive characteristics, ground that attention model more focuses
Side, shared weight are bigger.
In above-described embodiment, by carrying out Automobile driving processing to characteristics of image and corresponding coded data, infused
Meaning power weight, then attention weight and characteristics of image are combined, comprehensive characteristics are obtained, so that member more important in comprehensive characteristics
Weight shared by element is bigger, and may make can focus on object element during image processing, substantially increases image understanding letter
The accuracy of breath improves computer equipment to the understandability of image.
In one embodiment, image processing method further include: by the text in the first model extraction input picture
Hold.The step of characteristics of image and corresponding class label text are carried out across modality fusion, comprehensive characteristics are obtained specifically includes: will
Characteristics of image, content of text corresponding with characteristics of image and class label text corresponding with characteristics of image carry out cross-module state
Fusion, obtains comprehensive characteristics.
It specifically, include content of text in input picture.Multi-instance learning (Multiple can be used in computer equipment
Instance Learning) method, from input picture extract have semantic meaning content of text.By characteristics of image, with
The corresponding content of text of characteristics of image and class label text corresponding with characteristics of image obtain comprehensive across modality fusion
Close feature.
In one embodiment, computer equipment determines mutually different multiple candidates in input picture by the first model
Content of text can be corresponded to phase when computer equipment extracts the content of text with semantic meaning from input picture by region
The candidate region answered.Correspondingly, computer equipment can be by the corresponding characteristics of image in each candidate region, content of text, class label
Text is carried out across modality fusion, to obtain comprehensive characteristics.
In above-described embodiment, by extracting the content of text in input picture, by characteristics of image, corresponding with characteristics of image
Content of text and class label text three corresponding with characteristics of image carry out across modality fusion, can more sufficiently meticulously
The feature of input picture is excavated, so that iamge description text is more acurrate, further improves the accuracy of image understanding information,
Computer equipment is improved to the understandability of image.
In one embodiment, which further includes the steps that carrying out image question and answer, which specifically includes:
S502 obtains the corresponding question text of input picture.
Wherein, question text is the text for the problem of description is directed to input picture.For example, with reference to the input figure in Fig. 3
Picture, corresponding question text specifically can be " what is before house? ", " what the left side in house is? " or " beside brook
What is there? " Deng.
Specifically, computer equipment can obtain local text corresponding with input picture as question text, or pass through
The communication modes such as network connection, the connection of USB (Universal Serial Bus, universal serial bus) interface are from other computers
Question text is obtained at equipment.
In one embodiment, terminal can be by showing that image shows interface to user, and user can show interface in image
In carry out choosing operation, terminal can be using the image chosen as input picture.Terminal can show defeated shown in interface in image
Enter and shows preset question text by image.User can carry out choosing operation in image displaying interface, and terminal chooses user
The problem of text as question text corresponding with input picture.
In one embodiment, terminal can call local voice collection device to acquire voice data.In local to voice
Data are identified, or corresponding voice data is sent to server, to identify to voice data, are obtained corresponding
Question text.
In one embodiment, terminal can be performed locally at image after getting input picture and corresponding question text
Reason method.Alternatively, input picture and corresponding question text can be sent to server by terminal, so that server acquisition is defeated
Enter image and corresponding question text and executes image processing method.
S504 extracts the text feature of question text.
Specifically, computer equipment can extract the text feature of question text by Recognition with Recurrent Neural Network.Recycle nerve net
Network, such as LSTM network.In one embodiment, the word, word of the extractable question text of computer equipment or the text of whole sentence
Feature.
S506 carries out Automobile driving processing, the power that gains attention weight to characteristics of image according to text feature.
In one embodiment, computer equipment can carry out at Automobile driving characteristics of image according to text feature
Reason, the power that gains attention weight.
In one embodiment, computer equipment can be mapped respectively text feature and characteristics of image by preset standard rule
At the standard vector in the same space.Dot product behaviour is carried out to standard vector corresponding with coded data and characteristics of image respectively again
Make, obtains intermediate result.Pond processing (such as sum pooling processing) and recurrence processing (ratio are successively carried out to intermediate result
Such as softmax processing), the power that gains attention weight.
S508 determines weighted image feature according to characteristics of image and attention weight.
Specifically, computer equipment can weighting by attention weight and corresponding feature combinations, after being weighted
Characteristics of image.In one embodiment, computer equipment can be realized characteristics of image by attention model and accordingly be asked
Inscribe text across modality fusion, the step of obtaining weighted image feature.Characteristics of image and corresponding question text are input to note
It anticipates in power model, attention model can automatically learn weight by network structure, the power that gains attention weight.Attention is weighed again
Value and characteristics of image are combined, and obtain weighted image feature.It is more related to question text in obtained weighted image feature
Place, shared weight is bigger.
S510 carries out classification processing according to weighted image feature, obtains the corresponding answer text of question text.
Specifically, computer equipment can carry out classification processing to weighted image feature by Machine learning classifiers, obtain
Class label text belonging to weighted image feature.Using corresponding class label text as answer text corresponding with question text
This.
In one embodiment, weighted image feature can be input to trained machine learning classification by computer equipment
Device carries out 3000 class classification, obtains corresponding class label text, answer using class label text as corresponding with question text
Case text.
For example, with reference to the input picture of Fig. 3, when question text corresponding with input picture is " to be before house
What? " when, the answer text obtained according to above-mentioned image processing method is " brook ";When the problem text corresponding to input picture
Originally be " what has beside brook? " when, the answer text obtained according to above-mentioned image processing method is " dog ".
In above-described embodiment, the text feature of question text corresponding with input picture is extracted, and according to text feature, it is right
Characteristics of image carries out Automobile driving processing, and the power that gains attention weight determines weighted graph according to characteristics of image and attention weight
As feature.Classification processing is carried out according to weighted image feature again, exports answer text corresponding with question text.In this way, can root
According to the corresponding text feature of question text, Automobile driving processing is carried out to characteristics of image, to obtain weighted image feature, so that
It can be focused on characteristics of image relevant to question text during image processing, then by being carried out to weighted image feature
Classification processing can greatly improve the accuracy of answer text, that is, substantially increase the accuracy of image understanding information, mention
High understandability of the computer equipment to image.
In one embodiment, the flow diagram of image processing method in one embodiment is shown with reference to Fig. 6, Fig. 6.
As shown in fig. 6, computer equipment can combine the first model, the second model and attention model, an Image is constructed
Caption system, for handling input picture, to obtain the iamge description text of input picture.Wherein Image caption system
The structure for serving as the first model in system is CNN model structure, and serve as the second model structure is RNN model structure.In this way, can
By a complete Image caption system, input picture is handled, exports image reason corresponding with input picture
Solve text.
As shown in fig. 6, input picture (Image) can be input in the Image caption system, pass through convolutional Neural
Network model (CNN network structure) determines multiple candidate regions (Region Proposal), then passes through convolutional neural networks model
(CNN network structure) extracts the characteristics of image (Feature map) of corresponding candidate region.Pass through convolutional neural networks model
(CNN network structure) determines class label text (Label) corresponding with each candidate region.By attention model to classification mark
It signs text and characteristics of image executes Automobile driving processing, obtain corresponding comprehensive characteristics.Comprehensive characteristics are input to shot and long term
In memory network model (LSTM network structure), corresponding iamge description text (Image Caption) is exported.
As shown in fig. 7, in a specific embodiment, image processing method includes:
S702 obtains input picture.
S704 determines mutually different multiple candidate regions in input picture by the first model.
S706 extracts the characteristics of image of each candidate region by the first model respectively.
S708 determines class label text corresponding with input picture by the first model and according to characteristics of image.
S710 determines coded data corresponding with class label text.
S712 carries out Automobile driving processing, the power that gains attention weight to characteristics of image according to coded data.
Comprehensive characteristics are calculated according to attention weight and characteristics of image in S714.
S716 splices the corresponding comprehensive characteristics in each candidate region, obtains splicing feature.
S718 obtains image corresponding with input picture and describes text in advance.
S720 sequentially inputs each term vector that splicing feature and image describe text in advance to the second model.
S722, the splicing feature sequentially input by the second model treatment and term vector, the image for exporting input picture are retouched
State text.
Above-mentioned image processing method, by the characteristics of image of the first model extraction input picture, and determining and input picture
Corresponding class label text, can rapidly and accurately obtain the characteristics of image and corresponding class label text of input picture.
Characteristics of image and corresponding class label text are carried out to obtain comprehensive characteristics, then pass through the second model treatment across modality fusion
Comprehensive characteristics obtain iamge description text.In this way, the second model can be made input can be made full use of to scheme during processing
As the characteristics of image of itself, and can be in conjunction with classification information belonging to input picture.It is careful in this way and be sufficiently used input figure
The feature of picture has obtained the dual guidance of characteristics of image and class label text, has substantially increased when understanding image
The accuracy of image understanding information improves computer equipment to the understandability of image.
Fig. 7 is the flow diagram of image processing method in one embodiment.Although should be understood that the process of Fig. 7
Each step in figure is successively shown according to the instruction of arrow, but these steps are not the inevitable sequence indicated according to arrow
Successively execute.Unless expressly stating otherwise herein, there is no stringent sequences to limit for the execution of these steps, these steps can
To execute in other order.Moreover, at least part step in Fig. 7 may include multiple sub-steps or multiple stages,
These sub-steps or stage are not necessarily to execute completion in synchronization, but can execute at different times, these
Sub-step perhaps the stage execution sequence be also not necessarily successively carry out but can be with the son of other steps or other steps
Step or at least part in stage execute in turn or alternately.
As shown in figure 8, in one embodiment, providing a kind of image processing method.The present embodiment is mainly in this way
Applied to the computer equipment in above-mentioned Fig. 1, illustrated such as terminal 110 or server 120.Referring to Fig. 8, the image procossing
Method specifically comprises the following steps:
S802 obtains input picture and question text corresponding with input picture.
Specifically, computer equipment can obtain local image and corresponding text as input picture and accordingly
Question text, or input figure is obtained from other computer equipments by communication modes such as network connection, USB interface connections
Picture and corresponding question text.
In one embodiment, terminal can acquire image under the current visual field of camera by camera, by acquisition
Image is as input picture.In one embodiment, terminal can call local voice collection device to acquire voice data.At this
Ground identifies voice data, or corresponding voice data is sent to server, to identify to voice data, obtains
To corresponding question text.
In one embodiment, terminal can show that image shows interface, and user can show in interface in image and choose
Operation, terminal can be using the image chosen as input picture.Wherein, image shows that image shown in interface can be terminal sheet
The image of ground storage is also possible to image of the terminal by network connection access server to obtain.Terminal can be shown in image
Preset question text is shown by input picture shown in interface.User can carry out choosing operation in image displaying interface,
The problem of terminal chooses user text is as question text corresponding with input picture.
In one embodiment, terminal can be performed locally at image after getting input picture and corresponding question text
Reason method.Alternatively, input picture and corresponding question text can be sent to server by terminal, so that server acquisition is defeated
Enter image and corresponding question text and executes image processing method.
S804 extracts the characteristics of image of input picture.
In one embodiment, computer equipment can extract input figure by convolutional neural networks, such as ResNet-80
The characteristics of image of picture.Input picture is input in convolutional neural networks, input is schemed by the convolutional layer of convolutional neural networks
As carrying out process of convolution, the characteristics of image of input picture is extracted.Namely convolutional neural networks can scheme input by convolutional layer
After carrying out process of convolution, the feature map (characteristic pattern) of input picture is obtained, feature map here is exactly this reality
Apply the characteristics of image in example.
In one embodiment, convolutional neural networks are with the image and corresponding classification mark in image library (ImageNet)
Label are used as training data, carry out what learning training obtained.Computer equipment inputs input picture after getting input picture
Convolutional neural networks pass through the characteristics of image of the convolutional layer structure extraction input picture of convolutional neural networks.
S806 extracts the text feature of question text.
Specifically, computer equipment can extract the text feature of question text by Recognition with Recurrent Neural Network.Recycle nerve net
Network, such as LSTM network.In one embodiment, the word, word of the extractable question text of computer equipment or the text of whole sentence
Feature.
S808 carries out Automobile driving processing, the power that gains attention weight to characteristics of image according to text feature.
Specifically, computer equipment can carry out Automobile driving processing to characteristics of image, be infused according to text feature
Meaning power weight.
In one embodiment, text feature can be mapped to the first standard feature by computer equipment, and characteristics of image is reflected
Penetrate into the second standard feature.Wherein the first standard feature and the second standard feature are the features under same mapping space.By
One standard feature is added with the second standard feature, then carries out nonlinear operation, finally carries out softmax processing, gain attention power
Weight.
In one embodiment, text feature can be mapped to the first standard feature by computer equipment, and characteristics of image is reflected
Penetrate into the second standard feature.Wherein the first standard feature and the second standard feature are the features under same mapping space.It is right again
First standard feature and second feature carry out dot product operation, obtain intermediate features.Intermediate features are successively carried out with pond processing (ratio
Such as sum pooling processing) and recurrence processing (such as softmax processing), the power that gains attention weight.
S810 determines weighted image feature according to characteristics of image and attention weight.
Specifically, computer equipment can weighting by attention weight and corresponding feature combinations, after being weighted
Characteristics of image.In one embodiment, computer equipment can be realized characteristics of image by attention model and accordingly be asked
Inscribe text across modality fusion, the step of obtaining weighted image feature.Characteristics of image and corresponding question text are input to note
It anticipates in power model, attention model can automatically learn weight by network structure, the power that gains attention weight.Attention is weighed again
Value and characteristics of image are combined, and obtain weighted image feature.It is more related to question text in obtained weighted image feature
Place, shared weight is bigger.
S812 carries out classification processing according to weighted image feature, obtains the corresponding answer text of question text.
Specifically, computer equipment can carry out classification processing to weighted image feature by Machine learning classifiers, obtain
Class label text belonging to weighted image feature.Using corresponding class label text as answer text corresponding with question text
This.
In one embodiment, weighted image feature can be input to trained machine learning classification by computer equipment
Device carries out 3000 class classification, obtains corresponding class label text, answer using class label text as corresponding with question text
Case text.
In above-described embodiment, the text feature of question text corresponding with input picture is extracted, and according to text feature, it is right
Characteristics of image carries out Automobile driving processing, and the power that gains attention weight determines weighted graph according to characteristics of image and attention weight
As feature.Classification processing is carried out according to weighted image feature again, exports answer text corresponding with question text.In this way, can root
According to the corresponding text feature of question text, Automobile driving processing is carried out to characteristics of image, to obtain weighted image feature, so that
It can be focused on characteristics of image relevant to question text during image processing, then by being carried out to weighted image feature
Classification processing can greatly improve the accuracy of answer text, that is, substantially increase the accuracy of image understanding information, mention
High understandability of the computer equipment to image.
In one embodiment, step S806, i.e. the step of text feature of extraction question text are specifically included:
S902 obtains word sequence corresponding with question text.
Specifically, computer equipment can split question text, obtain the word sequence of corresponding single word composition.
S904 carries out word segmentation processing to question text, obtains word sequence corresponding with question text.
Specifically, computer equipment can be used segmenting method and carry out word segmentation processing to question text, obtain being made of each word
Word sequence.Computer equipment can be used segmentation methods or participle model based on dictionary etc. and segment to question text.
Wherein, the segmentation methods based on dictionary specifically can be Forward Maximum Method algorithm based on dictionary, reverse maximum matching algorithm,
Minimum segmentation algorithm or self-reinforcing in double directions etc..Participle model specifically can be hidden Markov model or CRF
(conditional random field algorithm, condition random field algorithm) model etc..
In one embodiment, after computer equipment segments question text, the word obtained to participle removes stop words
Afterwards, word sequence is obtained.Wherein, stop words (Stop Words) refers in information retrieval, to save memory space and improving inspection
Rope efficiency, the certain words or word that meeting automatic fitration is fallen before or after handling natural language data (or text), such as it is some
Using very extensive word, auxiliary words of mood, polite formula word, preposition or conjunction etc..
S906 extracts the text feature of word sequence, word sequence and the whole sentence of question text respectively.
Specifically, it is whole can to extract respectively word sequence, word sequence and question text by Recognition with Recurrent Neural Network for computer equipment
The text feature of sentence.
In above-described embodiment, the text of the corresponding word sequence of question text, word sequence and the whole sentence of question text is extracted respectively
Feature can press word rank to question text, and word rank and sentence level are sufficiently excavated and asked to carry out multi-level feature extraction
Inscribe the text information of text.
In one embodiment, step S808 carries out Automobile driving processing to characteristics of image that is, according to text feature,
Gain attention power weight the step of include: respectively according to word sequence, the text feature of word sequence and the whole sentence of question text, to image
Feature carries out Automobile driving processing, obtains the first attention weight, the second attention weight and third attention weight.Step
S810, i.e., the step of weighted image feature being determined according to characteristics of image and attention weight include: according to the first attention weight,
Second attention weight and third attention weight determine weighted image feature in conjunction with characteristics of image.
Specifically, computer equipment can be respectively according to word sequence, the text feature of word sequence and the whole sentence of question text, to figure
As feature progress Automobile driving processing, the first attention weight, the second attention weight and third attention weight are obtained.Into
And according to the first attention weight, the second attention weight and third attention weight, in conjunction with characteristics of image, determine weighted graph
As feature.
In one embodiment, computer equipment can be respectively according to the first attention weight, the second attention weight and
Three L's power weight is weighted processing to characteristics of image, obtains corresponding first intermediate image feature.By each first middle graph
As Fusion Features, the second intermediate image feature is obtained, and by the second characteristics of image directly as weighted image feature.
In one embodiment, computer equipment can pay attention to the first attention weight, the second attention weight and third
Power weight is merged, such as weighted sum, obtains comprehensive attention weight.According to comprehensive attention weight and characteristics of image,
The second intermediate image feature is obtained, and by the second intermediate image feature directly as weighted image feature.
In one embodiment, computer equipment can be according to the first attention weight, the second attention weight and third
Attention weight is weighted processing to characteristics of image, obtains corresponding first intermediate image feature.By each first intermediate image
Fusion Features obtain the second intermediate image feature.According to the text feature of the whole sentence of question text, to the second intermediate image feature into
The processing of row Automobile driving, obtains the 4th attention weight.It is determined according to the second intermediate image feature and the 4th attention weight
Weighted image feature.
In one embodiment, computer equipment is by the first attention weight, the second attention weight and third attention
Weight is combined with characteristics of image respectively, is obtained corresponding with the word rank of question text, word rank and sentence level accordingly
The first intermediate image feature.Computer equipment can be by the corresponding first intermediate image feature of word rank corresponding with word rank
One intermediate image feature is overlapped, then the first intermediate image feature corresponding with sentence level is overlapped, and is obtained in second
Between characteristics of image.
In one embodiment, computer equipment can be according to the text feature of the whole sentence of question text, to the second intermediate image
Feature carries out Automobile driving processing again, obtains the 4th attention weight, according to the second intermediate image feature and the 4th attention
Weight determines weighted image feature.In above-described embodiment, pass through being paid attention to respectively with characteristics of image at many levels for question text
After power allocation processing, the second intermediate image feature is obtained.Further according to the text feature of the whole sentence of question text, to the second intermediate image
Feature does Automobile driving processing, to obtain weighted image feature, so that the emphasis of weighted image feature is closer to question text
Content, and then can be improved and subsequent carry out the accuracy of answer text that classification processing obtains to weighted image feature.
In above-described embodiment, respectively according to word sequence corresponding with question text, the text of word sequence and the whole sentence of question text
Eigen carries out Automobile driving processing to characteristics of image, obtains the first attention weight, the second attention weight and third note
Meaning power weight, further according to the first attention weight, the second attention weight and third attention weight, in conjunction with characteristics of image, really
Determine weighted image feature.In this way, the text information of question text can sufficiently be excavated, so that the emphasis of weighted image feature is more
Close to the content of question text, and then the subsequent answer text obtained to weighted image feature progress classification processing can be improved
Accuracy.
In one embodiment, the flow chart of image processing method in one embodiment is shown with reference to Figure 10, Figure 10.Such as
Shown in Figure 10, computer equipment can extract the characteristics of image of input picture by convolutional neural networks.By recycling nerve net
Network extracts the text feature of question text.Weighted image feature is input to Machine learning classifiers and carries out classification processing, is obtained
Answer text corresponding with question text.In the present embodiment, computer equipment can be by convolutional neural networks, Recognition with Recurrent Neural Network
It is combined with Machine learning classifiers, constructs a vision question and answer (visual question answering) system.
As shown in Figure 10, input picture (image) can be input in the vision question answering system, passes through convolutional neural networks
The characteristics of image (feature map) of model (CNN network structure) extraction input picture.Question text is input to the vision to ask
It answers in system, the text feature (question of question text is extracted by shot and long term memory network model (LSTM network structure)
feature).Automobile driving processing (Attention processing) is done to characteristics of image and text feature, then does recurrence processing
(softmax processing), the power that gains attention weight (Attention value).According to attention weight and characteristics of image, is obtained
Two intermediate image features (Attention map).By the second intermediate image feature (Attention map) and the whole sentence of question text
Automobile driving processing (Attention) is done, weighted image feature is obtained.Weighted image feature is input to machine learning classification
Classified in device (Classification) processing, obtain answer text (Answer) corresponding with question text.
In one embodiment, computer equipment also can be used co-attention's (coordination-Automobile driving processing)
Mode carries out Automobile driving processing to characteristics of image and text feature.Co-attention processing mode is primarily referred to as according to text
Eigen carries out Automobile driving processing to characteristics of image, carries out Automobile driving processing to text feature according to characteristics of image,
The result by the two processing combines again, and details are not described herein again.
As shown in figure 11, in one specifically embodiment, image processing method the following steps are included:
S1102 obtains input picture and question text corresponding with input picture.
S1104 extracts the characteristics of image of input picture by convolutional neural networks.
S1106 obtains word sequence corresponding with question text.
S1108 carries out word segmentation processing to question text, obtains word sequence corresponding with question text.
S1110 extracts the text feature of word sequence, word sequence and the whole sentence of question text by Recognition with Recurrent Neural Network respectively.
S1112 infuses characteristics of image respectively according to word sequence, the text feature of word sequence and the whole sentence of question text
Meaning power allocation processing, obtains the first attention weight, the second attention weight and third attention weight.
S1114, respectively according to the first attention weight, the second attention weight and third attention weight to characteristics of image
It is weighted processing, obtains corresponding first intermediate image feature.
Each first intermediate image Fusion Features are obtained the second intermediate image feature by S1116.
S1118 carries out at Automobile driving the second intermediate image feature according to the text feature of the whole sentence of question text
Reason, obtains the 4th attention weight.
S1120 determines weighted image feature according to the second intermediate image feature and the 4th attention weight.
Weighted image feature is input to Machine learning classifiers and carries out classification processing, obtained and question text pair by S1122
The answer text answered.
Above-mentioned image processing method extracts the characteristics of image of input picture, extracts question text corresponding with input picture
Text feature Automobile driving processing, the power that gains attention weight, according to figure are carried out to characteristics of image and according to text feature
As feature and attention weight determine weighted image feature.Classification processing, output and problem are carried out according to weighted image feature again
The corresponding answer text of text.In this way, Automobile driving can be carried out to characteristics of image according to the corresponding text feature of question text
Processing, to obtain weighted image feature, so that it is special to focus on image relevant to question text during image processing
In sign, then the accuracy by the way that answer text can be greatly improved to weighted image feature progress classification processing, that is, significantly
The accuracy for improving image understanding information improves computer equipment to the understandability of image.
Figure 11 is the flow diagram of image processing method in one embodiment.Although should be understood that the stream of Figure 11
Each step in journey figure is successively shown according to the instruction of arrow, but these steps are not inevitable according to the suitable of arrow instruction
Sequence successively executes.Unless expressly stating otherwise herein, there is no stringent sequences to limit for the execution of these steps, these steps
It can execute in other order.Moreover, at least part step in Figure 11 may include multiple sub-steps or multiple ranks
Section, these sub-steps or stage are not necessarily to execute completion in synchronization, but can execute at different times, this
The execution sequence in a little step perhaps stage be also not necessarily successively carry out but can be with other steps or other steps
Sub-step or at least part in stage execute in turn or alternately.
In concrete application scene, a new image can be input in above-mentioned image processing system by user, at image
Reason system executes above-mentioned image processing method, provides the understanding for the image.For example, image processing system can export the figure
The iamge description text of picture.Alternatively, user can propose several problems for given image, image processing system is executed
Above-mentioned image processing method can export corresponding answer text.Especially in education sector, above-mentioned image processing method can be with
It helps user fast and effectively to understand the semantic information in figure, and question and answer interaction can occur with user, especially pair
It is very helpful in tools such as child, the elderly, visual dysfunction person or language understanding obstacle persons.
As shown in figure 12, in one embodiment, a kind of image processing apparatus 1200 is provided, comprising: obtain module
1201, extraction module 1202, determining module 1203, Fusion Module 1204 and output module 1205.
Module 1201 is obtained, for obtaining input picture.
Extraction module 1202, for passing through the characteristics of image of the first model extraction input picture.
Determining module 1203, for determining classification corresponding with input picture by the first model and according to characteristics of image
Label text.
Fusion Module 1204, for obtain comprehensive across modality fusion characteristics of image and corresponding class label text
Close feature.
Output module 1205, for exporting the iamge description text of input picture by the second model treatment comprehensive characteristics.
In one embodiment, extraction module 1202 is also used to determine by the first model mutually different in input picture
Multiple candidate regions;By the first model, the characteristics of image of each candidate region is extracted respectively.
In one embodiment, output module 1205 is also used to obtain the corresponding comprehensive characteristics splicing in each candidate region
Splice feature;Splice feature by the second model treatment, exports the iamge description text of input picture.
In one embodiment, Fusion Module 1204 is also used to determine coded data corresponding with class label text;Root
According to coded data, Automobile driving processing, the power that gains attention weight are carried out to characteristics of image;It is special according to attention weight and image
Sign, is calculated comprehensive characteristics.
In one embodiment, extraction module 1202 is also used to by the text in the first model extraction input picture
Hold.Fusion Module 1204 is also used to characteristics of image, content of text corresponding with characteristics of image and corresponding with characteristics of image
Class label text carries out obtaining comprehensive characteristics across modality fusion.
In one embodiment, output module 1205 is also used to obtain image corresponding with input picture and describes text in advance;
Each term vector that comprehensive characteristics and image describe text in advance is sequentially input to the second model;Successively by the second model treatment
The comprehensive characteristics and term vector of input export the iamge description text of input picture.
As shown in figure 13, in one embodiment, image processing apparatus 1200 further includes Automobile driving processing module
1206。
It obtains module 1201 and is also used to obtain the corresponding question text of input picture.
Extraction module 1202 is also used to extract the text feature of question text.
Automobile driving processing module 1206, for carrying out Automobile driving processing to characteristics of image according to text feature,
The power that gains attention weight.
Determining module 1203 is also used to determine weighted image feature according to characteristics of image and attention weight.
Output module 1205 is also used to carry out classification processing according to weighted image feature, obtains the corresponding answer of question text
Text.
Above-mentioned image processing apparatus, by the characteristics of image of the first model extraction input picture, and determining and input picture
Corresponding class label text, can rapidly and accurately obtain the characteristics of image and corresponding class label text of input picture.
Characteristics of image and corresponding class label text are carried out to obtain comprehensive characteristics, then pass through the second model treatment across modality fusion
Comprehensive characteristics obtain iamge description text.In this way, the second model can be made input can be made full use of to scheme during processing
As the characteristics of image of itself, and can be in conjunction with classification information belonging to input picture.It is careful in this way and be sufficiently used input figure
The feature of picture has obtained the dual guidance of characteristics of image and class label text, has substantially increased when understanding image
The accuracy of image understanding information improves computer equipment to the understandability of image.
As shown in figure 14, in one embodiment, a kind of image processing apparatus 1400 is provided, comprising: obtain module
1401, extraction module 1402, Automobile driving processing module 1403, determining module 1404 and categorization module 1405.
Module 1401 is obtained, for obtaining input picture and question text corresponding with input picture.
Extraction module 1402, for extracting the characteristics of image of input picture.
Extraction module 1402 is also used to extract the text feature of question text.
Automobile driving processing module 1403, for carrying out Automobile driving processing to characteristics of image according to text feature,
The power that gains attention weight.
Determining module 1404, for determining weighted image feature according to characteristics of image and attention weight.
Categorization module 1405 obtains the corresponding answer of question text for carrying out classification processing according to weighted image feature
Text.
In one embodiment, extraction module 1402 is also used to obtain word sequence corresponding with question text;To problem text
This progress word segmentation processing obtains word sequence corresponding with question text;Word sequence, word sequence and the whole sentence of question text are extracted respectively
Text feature.
In one embodiment, Automobile driving processing module 1403 is also used to according to word sequence, word sequence and ask respectively
The text feature for inscribing the whole sentence of text carries out Automobile driving processing to characteristics of image, obtains the first attention weight, the second attention
Power weight and third attention weight.Determining module 1404 be also used to according to the first attention weight, the second attention weight and
Third attention weight determines weighted image feature in conjunction with characteristics of image.
In one embodiment, determining module 1404 is also used to be weighed according to the first attention weight, the second attention respectively
Value and third attention weight are weighted processing to characteristics of image, obtain corresponding first intermediate image feature;By each first
Intermediate image Fusion Features obtain the second intermediate image feature;According to the text feature of the whole sentence of question text, to the second middle graph
As feature progress Automobile driving processing, the 4th attention weight is obtained;According to the second intermediate image feature and the 4th attention
Weight determines weighted image feature.
In one embodiment, Automobile driving processing module 1403 is also used to for text feature being mapped to the first standard spy
Sign;By image feature maps at the second standard feature;Point multiplication operation is carried out to the first standard feature and the second standard feature, is obtained
Intermediate features;Intermediate features are successively carried out with pondization processing and recurrence processing, the power that gains attention weight.
In one embodiment, extraction module 1402 is also used to extract the image of input picture by convolutional neural networks
Feature.By Recognition with Recurrent Neural Network, the text feature of question text is extracted.Categorization module 1405 is also used to weighted image feature
It is input to Machine learning classifiers and carries out classification processing, obtain answer text corresponding with question text.
Above-mentioned image processing apparatus extracts the characteristics of image of input picture, extracts question text corresponding with input picture
Text feature Automobile driving processing, the power that gains attention weight, according to figure are carried out to characteristics of image and according to text feature
As feature and attention weight determine weighted image feature.Classification processing, output and problem are carried out according to weighted image feature again
The corresponding answer text of text.In this way, Automobile driving can be carried out to characteristics of image according to the corresponding text feature of question text
Processing, to obtain weighted image feature, so that it is special to focus on image relevant to question text during image processing
In sign, then the accuracy by the way that answer text can be greatly improved to weighted image feature progress classification processing, that is, significantly
The accuracy for improving image understanding information improves computer equipment to the understandability of image.
Figure 15 shows the internal structure chart of computer equipment in one embodiment.The computer equipment specifically can be figure
Terminal 110 in 1.As shown in figure 15, it includes the place connected by system bus which, which includes the computer equipment,
Manage device, memory, network interface, input unit and display screen.Wherein, memory includes non-volatile memory medium and interior storage
Device.The non-volatile memory medium of the computer equipment is stored with operating system, can also be stored with computer program, the computer
When program is executed by processor, processor may make to realize route method for digging.Computer can also be stored in the built-in storage
Program when the computer program is executed by processor, may make processor to execute route method for digging.The display of computer equipment
Screen can be liquid crystal display or electric ink display screen, and the input unit of computer equipment can be to be covered on display screen
Touch layer is also possible to the key being arranged on computer equipment shell, trace ball or Trackpad, can also be external keyboard,
Trackpad or mouse etc..
Figure 16 shows the internal structure chart of computer equipment in one embodiment.The computer equipment specifically can be figure
Server 120 in 1.As shown in figure 16, it includes being connected by system bus which, which includes the computer equipment,
Processor, memory and network interface.Wherein, memory includes non-volatile memory medium and built-in storage.The computer
The non-volatile memory medium of equipment is stored with operating system, can also be stored with computer program, and the computer program is processed
When device executes, processor may make to realize route method for digging.Computer program can also be stored in the built-in storage, the calculating
When machine program is executed by processor, processor may make to execute route method for digging.
It will be understood by those skilled in the art that structure shown in Figure 15 and Figure 16, only with application scheme phase
The block diagram of the part-structure of pass does not constitute the restriction for the computer equipment being applied thereon to application scheme, specifically
Computer equipment may include perhaps combining certain components or with different than more or fewer components as shown in the figure
Component layout.
In one embodiment, image processing apparatus provided by the present application can be implemented as a kind of shape of computer program
Formula, computer program are run in the computer equipment shown in as shown in Figure 15 or Figure 16.In the memory of computer equipment
The each program module for forming the image processing apparatus can be stored, for example, obtaining module, extraction module, determination shown in Figure 12
Module, Fusion Module and output module.Also for example, acquisition module, extraction module, Automobile driving shown in Figure 14 handle mould
Block, determining module and categorization module.The computer program that each program module is constituted executes processor in this specification to retouch
Step in the hair style recognition methods of each embodiment of the application stated.
For example, computer equipment shown in Figure 15 or Figure 16 can pass through obtaining in image processing apparatus as shown in figure 12
Modulus block executes step S202.Computer equipment can execute step S204 by extraction module.Computer equipment can pass through determination
Module executes step S206.Computer equipment can execute step S208 by Fusion Module.Computer equipment can be by exporting mould
Block executes step S210.
For example, computer equipment shown in Figure 15 or Figure 16 can pass through obtaining in image processing apparatus as shown in figure 14
Modulus block executes step S802.Computer equipment can execute step S804 and S806 by extraction module.Computer equipment can lead to
It crosses Automobile driving processing module and executes step S808.Computer equipment can execute step S210 by determining module.Computer
Equipment can execute step S812 by categorization module.
In one embodiment, a kind of computer equipment, including memory and processor are provided, is stored in memory
Computer program, when computer program is executed by processor, so that processor executes following steps: obtaining input picture;Pass through
The characteristics of image of first model extraction input picture;By the first model and according to characteristics of image, determination is corresponding to input picture
Class label text;Characteristics of image and corresponding class label text are carried out to obtain comprehensive characteristics across modality fusion;Pass through
Second model treatment comprehensive characteristics export the iamge description text of input picture.
In one embodiment, computer program makes processor execute the figure for passing through the first model extraction input picture
As feature step when specifically execute following steps: mutually different multiple candidate regions in input picture are determined by the first model
Domain;By the first model, the characteristics of image of each candidate region is extracted respectively.
In one embodiment, computer program is executing processor by the second model treatment comprehensive characteristics, defeated
Following steps are specifically executed when the step of the iamge description text of input picture out: the corresponding comprehensive characteristics in each candidate region are spelled
It connects, obtains splicing feature;Splice feature by the second model treatment, exports the iamge description text of input picture.
In one embodiment, computer program is executing processor by characteristics of image and corresponding class label text
This progress obtains specifically executing following steps when the step of comprehensive characteristics across modality fusion: determination is corresponding to class label text
Coded data;According to coded data, Automobile driving processing, the power that gains attention weight are carried out to characteristics of image;According to attention
Power weight and characteristics of image, are calculated comprehensive characteristics.
In one embodiment, computer program makes processor also execute following steps: defeated by the first model extraction
Enter the content of text in image;Computer program make processor execute by characteristics of image and corresponding class label text into
Row obtains specifically executing following steps when the step of comprehensive characteristics across modality fusion: by characteristics of image, corresponding with characteristics of image
Content of text and class label text corresponding with characteristics of image carry out obtaining comprehensive characteristics across modality fusion.
In one embodiment, computer program is executing processor by the second model treatment comprehensive characteristics, defeated
Following steps are specifically executed when the step of the iamge description text of input picture out: obtaining image corresponding with input picture and retouches in advance
State text;Each term vector that comprehensive characteristics and image describe text in advance is sequentially input to the second model;Pass through the second model
The comprehensive characteristics and term vector sequentially input are handled, the iamge description text of input picture is exported.
In one embodiment, computer program makes processor also execute following steps: it is corresponding to obtain input picture
Question text;Extract the text feature of question text;According to text feature, Automobile driving processing is carried out to characteristics of image, is obtained
To attention weight;Weighted image feature is determined according to characteristics of image and attention weight;Divided according to weighted image feature
Class processing obtains the corresponding answer text of question text.
Above-mentioned computer equipment, by the characteristics of image of the first model extraction input picture, and determining and input picture phase
The class label text answered can rapidly and accurately obtain the characteristics of image and corresponding class label text of input picture.It will
Characteristics of image and corresponding class label text carry out obtaining comprehensive characteristics across modality fusion, then comprehensive by the second model treatment
Feature is closed, iamge description text is obtained.In this way, can make the second model that can make full use of input picture during processing
The characteristics of image of itself, and can be in conjunction with classification information belonging to input picture.It is careful in this way and be sufficiently used input picture
Feature obtained the dual guidance of characteristics of image and class label text when understanding image, substantially increased figure
Accuracy as understanding information, improves computer equipment to the understandability of image.
In one embodiment, a kind of computer equipment, including memory and processor are provided, is stored in memory
Computer program, when computer program is executed by processor so that processor execute following steps: obtain input picture and
Question text corresponding with input picture;Extract the characteristics of image of input picture;Extract the text feature of question text;According to text
Eigen carries out Automobile driving processing, the power that gains attention weight to characteristics of image;It is true according to characteristics of image and attention weight
Determine weighted image feature;Classification processing is carried out according to weighted image feature, obtains the corresponding answer text of question text.
In one embodiment, computer program makes processor the step of executing the text feature for extracting question text
When specifically execute following steps: obtain word sequence corresponding with question text;Word segmentation processing is carried out to question text, obtains and asks
Inscribe the corresponding word sequence of text;The text feature of word sequence, word sequence and the whole sentence of question text is extracted respectively.
In one embodiment, computer program is executing processor according to text feature, carries out to characteristics of image
Automobile driving processing, when step of the power that gains attention weight, specifically execute following steps: respectively according to word sequence, word sequence and
The text feature of the whole sentence of question text carries out Automobile driving processing to characteristics of image, obtains the first attention weight, the second note
Meaning power weight and third attention weight;Computer program makes processor true according to characteristics of image and attention weight in execution
Determine specifically to execute following steps when the step of weighted image feature: according to the first attention weight, the second attention weight and
Three L's power weight determines weighted image feature in conjunction with characteristics of image.
In one embodiment, computer program is executing processor according to the first attention weight, the second attention
Power weight and third attention weight specifically execute following steps when determining the step of weighted image feature in conjunction with characteristics of image:
Processing is weighted to characteristics of image according to the first attention weight, the second attention weight and third attention weight respectively,
Obtain corresponding first intermediate image feature;By each first intermediate image Fusion Features, the second intermediate image feature is obtained;According to
The text feature of the whole sentence of question text carries out Automobile driving processing to the second intermediate image feature, obtains the 4th attention power
Value;Weighted image feature is determined according to the second intermediate image feature and the 4th attention weight.
In one embodiment, computer program is executing processor according to text feature, carries out to characteristics of image
Automobile driving processing, when step of the power that gains attention weight, specifically execute following steps: text feature is mapped to the first mark
Quasi- feature;By image feature maps at the second standard feature;Point multiplication operation is carried out to the first standard feature and the second standard feature,
Obtain intermediate features;Intermediate features are successively carried out with pondization processing and recurrence processing, the power that gains attention weight.
In one embodiment, computer program makes processor the step of executing the characteristics of image for extracting input picture
When specifically execute following steps: by convolutional neural networks, extract the characteristics of image of input picture;Computer program to handle
Device specifically executes following steps when executing the step for extracting the text feature of question text: by Recognition with Recurrent Neural Network, extracting
The text feature of question text;Computer program is executing processor according to weighted image feature progress classification processing, obtains
Following steps are specifically executed when obtaining the step of the corresponding answer text of question text: weighted image feature is input to machine learning
Classifier carries out classification processing, obtains answer text corresponding with question text.
Above-mentioned computer equipment extracts the characteristics of image of input picture, extracts question text corresponding with input picture
Text feature, and according to text feature, Automobile driving processing, the power that gains attention weight, according to image are carried out to characteristics of image
Feature and attention weight determine weighted image feature.Classification processing, output and problem text are carried out according to weighted image feature again
This corresponding answer text.In this way, can be carried out at Automobile driving according to the corresponding text feature of question text to characteristics of image
Reason, to obtain weighted image feature, so that characteristics of image relevant to question text can be focused on during image processing
On, then by carrying out classification processing to weighted image feature the accuracy of answer text can be greatly improved, that is, it mentions significantly
The high accuracy of image understanding information, improves computer equipment to the understandability of image.
A kind of computer readable storage medium, is stored with computer program, real when which is executed by processor
Existing following steps: in one embodiment, a kind of computer equipment, including memory and processor is provided, is stored up in memory
There is computer program, when computer program is executed by processor, so that processor executes following steps: obtaining input picture;
Pass through the characteristics of image of the first model extraction input picture;By the first model and according to characteristics of image, determining and input picture
Corresponding class label text;Characteristics of image and corresponding class label text are carried out to obtain comprehensive characteristics across modality fusion;
By the second model treatment comprehensive characteristics, the iamge description text of input picture is exported.
In one embodiment, computer program makes processor execute the figure for passing through the first model extraction input picture
As feature step when specifically execute following steps: mutually different multiple candidate regions in input picture are determined by the first model
Domain;By the first model, the characteristics of image of each candidate region is extracted respectively.
In one embodiment, computer program is executing processor by the second model treatment comprehensive characteristics, defeated
Following steps are specifically executed when the step of the iamge description text of input picture out: the corresponding comprehensive characteristics in each candidate region are spelled
It connects, obtains splicing feature;Splice feature by the second model treatment, exports the iamge description text of input picture.
In one embodiment, computer program is executing processor by characteristics of image and corresponding class label text
This progress obtains specifically executing following steps when the step of comprehensive characteristics across modality fusion: determination is corresponding to class label text
Coded data;According to coded data, Automobile driving processing, the power that gains attention weight are carried out to characteristics of image;According to attention
Power weight and characteristics of image, are calculated comprehensive characteristics.
In one embodiment, computer program makes processor also execute following steps: defeated by the first model extraction
Enter the content of text in image;Computer program make processor execute by characteristics of image and corresponding class label text into
Row obtains specifically executing following steps when the step of comprehensive characteristics across modality fusion: by characteristics of image, corresponding with characteristics of image
Content of text and class label text corresponding with characteristics of image carry out obtaining comprehensive characteristics across modality fusion.
In one embodiment, computer program is executing processor by the second model treatment comprehensive characteristics, defeated
Following steps are specifically executed when the step of the iamge description text of input picture out: obtaining image corresponding with input picture and retouches in advance
State text;Each term vector that comprehensive characteristics and image describe text in advance is sequentially input to the second model;Pass through the second model
The comprehensive characteristics and term vector sequentially input are handled, the iamge description text of input picture is exported.
In one embodiment, computer program makes processor also execute following steps: it is corresponding to obtain input picture
Question text;Extract the text feature of question text;According to text feature, Automobile driving processing is carried out to characteristics of image, is obtained
To attention weight;Weighted image feature is determined according to characteristics of image and attention weight;Divided according to weighted image feature
Class processing obtains the corresponding answer text of question text.
Above-mentioned computer readable storage medium, by the characteristics of image of the first model extraction input picture, and it is determining with it is defeated
Enter the corresponding class label text of image, can rapidly and accurately obtain the characteristics of image and corresponding class label of input picture
Text.Characteristics of image and corresponding class label text are carried out to obtain comprehensive characteristics, then pass through the second model across modality fusion
Comprehensive characteristics are handled, iamge description text is obtained.In this way, the second model can be made during processing can make full use of and is defeated
Enter the characteristics of image of image itself, and can be in conjunction with classification information belonging to input picture.It is careful in this way and be sufficiently used defeated
The feature for entering image has obtained the dual guidance of characteristics of image and class label text, has mentioned significantly when understanding image
The high accuracy of image understanding information, improves computer equipment to the understandability of image.
A kind of computer readable storage medium, is stored with computer program, real when which is executed by processor
Existing following steps: input picture and question text corresponding with input picture are obtained;Extract the characteristics of image of input picture;
Extract the text feature of question text;According to text feature, Automobile driving processing, the power that gains attention power are carried out to characteristics of image
Value;Weighted image feature is determined according to characteristics of image and attention weight;Classification processing is carried out according to weighted image feature, is obtained
The corresponding answer text of question text.
In one embodiment, computer program makes processor the step of executing the text feature for extracting question text
When specifically execute following steps: obtain word sequence corresponding with question text;Word segmentation processing is carried out to question text, obtains and asks
Inscribe the corresponding word sequence of text;The text feature of word sequence, word sequence and the whole sentence of question text is extracted respectively.
In one embodiment, computer program is executing processor according to text feature, carries out to characteristics of image
Automobile driving processing, when step of the power that gains attention weight, specifically execute following steps: respectively according to word sequence, word sequence and
The text feature of the whole sentence of question text carries out Automobile driving processing to characteristics of image, obtains the first attention weight, the second note
Meaning power weight and third attention weight;Computer program makes processor true according to characteristics of image and attention weight in execution
Determine specifically to execute following steps when the step of weighted image feature: according to the first attention weight, the second attention weight and
Three L's power weight determines weighted image feature in conjunction with characteristics of image.
In one embodiment, computer program is executing processor according to the first attention weight, the second attention
Power weight and third attention weight specifically execute following steps when determining the step of weighted image feature in conjunction with characteristics of image:
Processing is weighted to characteristics of image according to the first attention weight, the second attention weight and third attention weight respectively,
Obtain corresponding first intermediate image feature;By each first intermediate image Fusion Features, the second intermediate image feature is obtained;According to
The text feature of the whole sentence of question text carries out Automobile driving processing to the second intermediate image feature, obtains the 4th attention power
Value;Weighted image feature is determined according to the second intermediate image feature and the 4th attention weight.
In one embodiment, computer program is executing processor according to text feature, carries out to characteristics of image
Automobile driving processing, when step of the power that gains attention weight, specifically execute following steps: text feature is mapped to the first mark
Quasi- feature;By image feature maps at the second standard feature;Point multiplication operation is carried out to the first standard feature and the second standard feature,
Obtain intermediate features;Intermediate features are successively carried out with pondization processing and recurrence processing, the power that gains attention weight.
In one embodiment, computer program makes processor the step of executing the characteristics of image for extracting input picture
When specifically execute following steps: by convolutional neural networks, extract the characteristics of image of input picture;Computer program to handle
Device specifically executes following steps when executing the step for extracting the text feature of question text: by Recognition with Recurrent Neural Network, extracting
The text feature of question text;Computer program is executing processor according to weighted image feature progress classification processing, obtains
Following steps are specifically executed when obtaining the step of the corresponding answer text of question text: weighted image feature is input to machine learning
Classifier carries out classification processing, obtains answer text corresponding with question text.
Above-mentioned computer readable storage medium extracts the characteristics of image of input picture, extracts ask corresponding with input picture
The text feature of text is inscribed, and according to text feature, Automobile driving processing carried out to characteristics of image, the power that gains attention weight,
Weighted image feature is determined according to characteristics of image and attention weight.Classification processing, output are carried out according to weighted image feature again
Answer text corresponding with question text.In this way, can be paid attention to according to the corresponding text feature of question text characteristics of image
Power allocation processing, to obtain weighted image feature, so that can focus on during image processing relevant to question text
On characteristics of image, then the accuracy by the way that answer text can be greatly improved to weighted image feature progress classification processing, also
It is the accuracy for substantially increasing image understanding information, improves computer equipment to the understandability of image.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, the program can be stored in a non-volatile computer and can be read
In storage medium, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, provided herein
Each embodiment used in any reference to memory, storage, database or other media, may each comprise non-volatile
And/or volatile memory.Nonvolatile memory may include that read-only memory (ROM), programming ROM (PROM), electricity can be compiled
Journey ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include random access memory
(RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, such as static state RAM
(SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhanced SDRAM
(ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) directly RAM (RDRAM), straight
Connect memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
Each technical characteristic of above embodiments can be combined arbitrarily, for simplicity of description, not to above-described embodiment
In each technical characteristic it is all possible combination be all described, as long as however, the combination of these technical characteristics be not present lance
Shield all should be considered as described in this specification.
The several embodiments of the application above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously
The limitation to the application the scope of the patents therefore cannot be interpreted as.It should be pointed out that for those of ordinary skill in the art
For, without departing from the concept of this application, various modifications and improvements can be made, these belong to the guarantor of the application
Protect range.Therefore, the scope of protection shall be subject to the appended claims for the application patent.
Claims (17)
1. a kind of image processing method, comprising:
Obtain input picture;
Pass through the characteristics of image of input picture described in the first model extraction;
By first model and according to described image feature, class label text corresponding with the input picture is determined;
Described image feature and corresponding class label text are carried out to obtain comprehensive characteristics across modality fusion;
By comprehensive characteristics described in the second model treatment, the iamge description text of the input picture is exported.
2. the method according to claim 1, wherein the figure for passing through input picture described in the first model extraction
As feature includes:
Mutually different multiple candidate regions in the input picture are determined by the first model;
By first model, the characteristics of image of each candidate region is extracted respectively.
3. defeated according to the method described in claim 2, it is characterized in that, described pass through comprehensive characteristics described in the second model treatment
The iamge description text of the input picture includes: out
By the corresponding comprehensive characteristics splicing in each candidate region, splicing feature is obtained;
By splicing feature described in the second model treatment, the iamge description text of the input picture is exported.
4. the method according to claim 1, wherein described by described image feature and corresponding class label text
Across modality fusion, obtaining comprehensive characteristics includes: for this progress
Determine coded data corresponding with the class label text;
According to the coded data, Automobile driving processing, the power that gains attention weight are carried out to described image feature;
According to the attention weight and described image feature, comprehensive characteristics are calculated.
5. the method according to claim 1, wherein the method also includes:
Pass through the content of text in input picture described in first model extraction;
Described to carry out described image feature and corresponding class label text across modality fusion, obtaining comprehensive characteristics includes:
By described image feature, content of text corresponding with described image feature and classification corresponding with described image feature
Label text carries out obtaining comprehensive characteristics across modality fusion.
6. pass through comprehensive characteristics described in the second model treatment the method according to claim 1, wherein described, it is defeated
The iamge description text of the input picture includes: out
It obtains image corresponding with the input picture and describes text in advance;
Each term vector that the comprehensive characteristics and described image describe text in advance is sequentially input to the second model;
The comprehensive characteristics and term vector sequentially input by second model treatment, export the iamge description of the input picture
Text.
7. method according to any one of claims 1 to 6, which is characterized in that the method also includes:
Obtain the corresponding question text of the input picture;
Extract the text feature of described problem text;
According to the text feature, Automobile driving processing, the power that gains attention weight are carried out to described image feature;
Weighted image feature is determined according to described image feature and the attention weight;
Classification processing is carried out according to the weighted image feature, obtains the corresponding answer text of described problem text.
8. a kind of image processing method, comprising:
Obtain input picture and question text corresponding with the input picture;
Extract the characteristics of image of the input picture;
Extract the text feature of described problem text;
According to the text feature, Automobile driving processing, the power that gains attention weight are carried out to described image feature;
Weighted image feature is determined according to described image feature and the attention weight;
Classification processing is carried out according to the weighted image feature, obtains the corresponding answer text of described problem text.
9. according to the method described in claim 8, it is characterized in that, the text feature for extracting described problem text includes:
Obtain word sequence corresponding with described problem text;
Word segmentation processing is carried out to described problem text, obtains word sequence corresponding with described problem text;
The text feature of the word sequence, the word sequence and the whole sentence of described problem text is extracted respectively.
10. according to the method described in claim 9, it is characterized in that, described according to the text feature, to described image feature
Automobile driving processing is carried out, the power that gains attention weight includes:
Respectively according to the word sequence, the text feature of the word sequence and the whole sentence of described problem text, to described image feature
Automobile driving processing is carried out, the first attention weight, the second attention weight and third attention weight are obtained;
It is described to determine that weighted image feature includes: according to described image feature and the attention weight
According to the first attention weight, the second attention weight and the third attention weight, in conjunction with the figure
As feature, weighted image feature is determined.
11. according to the method described in claim 10, it is characterized in that, it is described according to the first attention weight, described
Two attention weights and the third attention weight determine that weighted image feature includes: in conjunction with described image feature
Respectively according to the first attention weight, the second attention weight and the third attention weight to the figure
As feature is weighted processing, the corresponding first intermediate image feature of acquisition;
By each first intermediate image Fusion Features, the second intermediate image feature is obtained;
According to the text feature of the whole sentence of described problem text, Automobile driving processing is carried out to the second intermediate image feature,
Obtain the 4th attention weight;
Weighted image feature is determined according to the second intermediate image feature and the 4th attention weight.
12. according to the method described in claim 8, it is characterized in that, described according to the text feature, to described image feature
Automobile driving processing is carried out, the power that gains attention weight includes:
The text feature is mapped to the first standard feature;
By described image Feature Mapping at the second standard feature;
Point multiplication operation is carried out to first standard feature and second standard feature, obtains intermediate features;
The intermediate features are successively carried out with pondization processing and recurrence processing, the power that gains attention weight.
13. the method according to any one of claim 8 to 12, which is characterized in that the extraction input picture
Characteristics of image includes:
By convolutional neural networks, the characteristics of image of the input picture is extracted;
It is described extract described problem text text feature include:
By Recognition with Recurrent Neural Network, the text feature of described problem text is extracted;
Described to carry out classification processing according to the weighted image feature, obtaining the corresponding answer text of described problem text includes:
The weighted image feature is input to Machine learning classifiers and carries out classification processing, is obtained corresponding with described problem text
Answer text.
14. a kind of image processing apparatus, described device include:
Module is obtained, for obtaining input picture;
Extraction module, for the characteristics of image by input picture described in the first model extraction;
Determining module, for by first model and according to described image feature, determination to be corresponding with the input picture
Class label text;
Fusion Module obtains comprehensive spy across modality fusion for carrying out described image feature and corresponding class label text
Sign;
Output module, for exporting the iamge description text of the input picture by comprehensive characteristics described in the second model treatment.
15. a kind of image processing apparatus, comprising:
Module is obtained, for obtaining input picture and question text corresponding with the input picture;
Extraction module, for extracting the characteristics of image of the input picture;
The extraction module is also used to extract the text feature of described problem text;
Automobile driving processing module, for carrying out Automobile driving processing to described image feature according to the text feature,
The power that gains attention weight;
Determining module, for determining weighted image feature according to described image feature and the attention weight;
Categorization module obtains the corresponding answer of described problem text for carrying out classification processing according to the weighted image feature
Text.
16. a kind of computer readable storage medium is stored with computer program, when the computer program is executed by processor,
So that the processor is executed such as the step of any one of claims 1 to 13 the method.
17. a kind of computer equipment, including memory and processor, the memory is stored with computer program, the calculating
When machine program is executed by the processor, so that the processor is executed such as any one of claims 1 to 13 the method
Step.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810758796.5A CN109002852B (en) | 2018-07-11 | 2018-07-11 | Image processing method, apparatus, computer readable storage medium and computer device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810758796.5A CN109002852B (en) | 2018-07-11 | 2018-07-11 | Image processing method, apparatus, computer readable storage medium and computer device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109002852A true CN109002852A (en) | 2018-12-14 |
CN109002852B CN109002852B (en) | 2023-05-23 |
Family
ID=64598961
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810758796.5A Active CN109002852B (en) | 2018-07-11 | 2018-07-11 | Image processing method, apparatus, computer readable storage medium and computer device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109002852B (en) |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109635150A (en) * | 2018-12-19 | 2019-04-16 | 腾讯科技(深圳)有限公司 | Document creation method, device and storage medium |
CN109740515A (en) * | 2018-12-29 | 2019-05-10 | 科大讯飞股份有限公司 | One kind reading and appraising method and device |
CN109766465A (en) * | 2018-12-26 | 2019-05-17 | 中国矿业大学 | A kind of picture and text fusion book recommendation method based on machine learning |
CN109858499A (en) * | 2019-01-23 | 2019-06-07 | 哈尔滨理工大学 | A kind of tank armor object detection method based on Faster R-CNN |
CN109886309A (en) * | 2019-01-25 | 2019-06-14 | 成都浩天联讯信息技术有限公司 | A method of digital picture identity is forged in identification |
CN109947977A (en) * | 2019-03-13 | 2019-06-28 | 广东小天才科技有限公司 | A kind of intension recognizing method and device, terminal device of combination image |
CN110110772A (en) * | 2019-04-25 | 2019-08-09 | 北京小米智能科技有限公司 | Determine the method, apparatus and computer readable storage medium of image tag accuracy |
CN110135441A (en) * | 2019-05-17 | 2019-08-16 | 北京邮电大学 | A kind of text of image describes method and device |
CN110689052A (en) * | 2019-09-06 | 2020-01-14 | 平安国际智慧城市科技股份有限公司 | Session message processing method, device, computer equipment and storage medium |
CN110717514A (en) * | 2019-09-06 | 2020-01-21 | 平安国际智慧城市科技股份有限公司 | Session intention identification method and device, computer equipment and storage medium |
CN111209961A (en) * | 2020-01-03 | 2020-05-29 | 广州海洋地质调查局 | Method for identifying benthos in cold spring area and processing terminal |
CN111563551A (en) * | 2020-04-30 | 2020-08-21 | 支付宝(杭州)信息技术有限公司 | Multi-mode information fusion method and device and electronic equipment |
CN111611420A (en) * | 2020-05-26 | 2020-09-01 | 北京字节跳动网络技术有限公司 | Method and apparatus for generating image description information |
WO2020173329A1 (en) * | 2019-02-26 | 2020-09-03 | 腾讯科技(深圳)有限公司 | Image fusion method, model training method, and related device |
CN111669587A (en) * | 2020-04-17 | 2020-09-15 | 北京大学 | Mimic compression method and device of video image, storage medium and terminal |
WO2020182112A1 (en) * | 2019-03-13 | 2020-09-17 | 腾讯科技(深圳)有限公司 | Image region positioning method, model training method, and related apparatus |
CN111767727A (en) * | 2020-06-24 | 2020-10-13 | 北京奇艺世纪科技有限公司 | Data processing method and device |
CN111967487A (en) * | 2020-03-23 | 2020-11-20 | 同济大学 | Incremental data enhancement method for visual question-answer model training and application |
CN112016493A (en) * | 2020-09-03 | 2020-12-01 | 科大讯飞股份有限公司 | Image description method and device, electronic equipment and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106503055A (en) * | 2016-09-27 | 2017-03-15 | 天津大学 | A kind of generation method from structured text to iamge description |
CN106777185A (en) * | 2016-12-23 | 2017-05-31 | 浙江大学 | A kind of across media Chinese herbal medicine image search methods based on deep learning |
CN107066583A (en) * | 2017-04-14 | 2017-08-18 | 华侨大学 | A kind of picture and text cross-module state sensibility classification method merged based on compact bilinearity |
CN107391505A (en) * | 2016-05-16 | 2017-11-24 | 腾讯科技(深圳)有限公司 | A kind of image processing method and system |
CN107683469A (en) * | 2015-12-30 | 2018-02-09 | 中国科学院深圳先进技术研究院 | A kind of product classification method and device based on deep learning |
CN107766349A (en) * | 2016-08-16 | 2018-03-06 | 阿里巴巴集团控股有限公司 | A kind of method, apparatus, equipment and client for generating text |
CN107979764A (en) * | 2017-12-06 | 2018-05-01 | 中国石油大学(华东) | Video caption generation method based on semantic segmentation and multilayer notice frame |
-
2018
- 2018-07-11 CN CN201810758796.5A patent/CN109002852B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107683469A (en) * | 2015-12-30 | 2018-02-09 | 中国科学院深圳先进技术研究院 | A kind of product classification method and device based on deep learning |
CN107391505A (en) * | 2016-05-16 | 2017-11-24 | 腾讯科技(深圳)有限公司 | A kind of image processing method and system |
CN107766349A (en) * | 2016-08-16 | 2018-03-06 | 阿里巴巴集团控股有限公司 | A kind of method, apparatus, equipment and client for generating text |
CN106503055A (en) * | 2016-09-27 | 2017-03-15 | 天津大学 | A kind of generation method from structured text to iamge description |
CN106777185A (en) * | 2016-12-23 | 2017-05-31 | 浙江大学 | A kind of across media Chinese herbal medicine image search methods based on deep learning |
CN107066583A (en) * | 2017-04-14 | 2017-08-18 | 华侨大学 | A kind of picture and text cross-module state sensibility classification method merged based on compact bilinearity |
CN107979764A (en) * | 2017-12-06 | 2018-05-01 | 中国石油大学(华东) | Video caption generation method based on semantic segmentation and multilayer notice frame |
Non-Patent Citations (4)
Title |
---|
ANDREJ KARPATHY ET AL.: ""Deep Visual-Semantic Alignments for Generating Image Descriptions", 《IEEE》 * |
曹刘彬 等: "基于连续Skip-gram及深度学习的图像描述方法", 《测试技术学报》 * |
谢金宝 等: "基于语义理解注意力神经网络的多元特征融合中文文本分类", 《电子与信息学报》 * |
马龙龙 等: "图像的文本描述方法研究综述", 《中文信息学报》 * |
Cited By (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109635150A (en) * | 2018-12-19 | 2019-04-16 | 腾讯科技(深圳)有限公司 | Document creation method, device and storage medium |
CN109766465A (en) * | 2018-12-26 | 2019-05-17 | 中国矿业大学 | A kind of picture and text fusion book recommendation method based on machine learning |
CN109740515A (en) * | 2018-12-29 | 2019-05-10 | 科大讯飞股份有限公司 | One kind reading and appraising method and device |
CN109858499A (en) * | 2019-01-23 | 2019-06-07 | 哈尔滨理工大学 | A kind of tank armor object detection method based on Faster R-CNN |
CN109886309A (en) * | 2019-01-25 | 2019-06-14 | 成都浩天联讯信息技术有限公司 | A method of digital picture identity is forged in identification |
TWI725746B (en) * | 2019-02-26 | 2021-04-21 | 大陸商騰訊科技(深圳)有限公司 | Image fusion method, model training method, and related device |
WO2020173329A1 (en) * | 2019-02-26 | 2020-09-03 | 腾讯科技(深圳)有限公司 | Image fusion method, model training method, and related device |
US11776097B2 (en) | 2019-02-26 | 2023-10-03 | Tencent Technology (Shenzhen) Company Limited | Image fusion method, model training method, and related apparatuses |
CN109947977A (en) * | 2019-03-13 | 2019-06-28 | 广东小天才科技有限公司 | A kind of intension recognizing method and device, terminal device of combination image |
WO2020182112A1 (en) * | 2019-03-13 | 2020-09-17 | 腾讯科技(深圳)有限公司 | Image region positioning method, model training method, and related apparatus |
CN110110772A (en) * | 2019-04-25 | 2019-08-09 | 北京小米智能科技有限公司 | Determine the method, apparatus and computer readable storage medium of image tag accuracy |
CN110135441B (en) * | 2019-05-17 | 2020-03-03 | 北京邮电大学 | Text description method and device for image |
CN110135441A (en) * | 2019-05-17 | 2019-08-16 | 北京邮电大学 | A kind of text of image describes method and device |
CN110689052B (en) * | 2019-09-06 | 2022-03-11 | 平安国际智慧城市科技股份有限公司 | Session message processing method, device, computer equipment and storage medium |
CN110689052A (en) * | 2019-09-06 | 2020-01-14 | 平安国际智慧城市科技股份有限公司 | Session message processing method, device, computer equipment and storage medium |
CN110717514A (en) * | 2019-09-06 | 2020-01-21 | 平安国际智慧城市科技股份有限公司 | Session intention identification method and device, computer equipment and storage medium |
CN111209961B (en) * | 2020-01-03 | 2020-10-09 | 广州海洋地质调查局 | Method for identifying benthos in cold spring area and processing terminal |
CN111209961A (en) * | 2020-01-03 | 2020-05-29 | 广州海洋地质调查局 | Method for identifying benthos in cold spring area and processing terminal |
CN111967487A (en) * | 2020-03-23 | 2020-11-20 | 同济大学 | Incremental data enhancement method for visual question-answer model training and application |
CN111967487B (en) * | 2020-03-23 | 2022-09-20 | 同济大学 | Incremental data enhancement method for visual question-answer model training and application |
CN111669587A (en) * | 2020-04-17 | 2020-09-15 | 北京大学 | Mimic compression method and device of video image, storage medium and terminal |
CN111669587B (en) * | 2020-04-17 | 2021-07-20 | 北京大学 | Mimic compression method and device of video image, storage medium and terminal |
CN111563551A (en) * | 2020-04-30 | 2020-08-21 | 支付宝(杭州)信息技术有限公司 | Multi-mode information fusion method and device and electronic equipment |
CN111611420A (en) * | 2020-05-26 | 2020-09-01 | 北京字节跳动网络技术有限公司 | Method and apparatus for generating image description information |
CN111611420B (en) * | 2020-05-26 | 2024-01-23 | 北京字节跳动网络技术有限公司 | Method and device for generating image description information |
CN111767727A (en) * | 2020-06-24 | 2020-10-13 | 北京奇艺世纪科技有限公司 | Data processing method and device |
CN111767727B (en) * | 2020-06-24 | 2024-02-06 | 北京奇艺世纪科技有限公司 | Data processing method and device |
CN112016493A (en) * | 2020-09-03 | 2020-12-01 | 科大讯飞股份有限公司 | Image description method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN109002852B (en) | 2023-05-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109002852A (en) | Image processing method, device, computer readable storage medium and computer equipment | |
Anderson et al. | Bottom-up and top-down attention for image captioning and visual question answering | |
Chen et al. | Spatial memory for context reasoning in object detection | |
Arevalo et al. | Gated multimodal networks | |
CN111859912B (en) | PCNN model-based remote supervision relationship extraction method with entity perception | |
CN112084331A (en) | Text processing method, text processing device, model training method, model training device, computer equipment and storage medium | |
Zhou et al. | A real-time global inference network for one-stage referring expression comprehension | |
CN110866140A (en) | Image feature extraction model training method, image searching method and computer equipment | |
Ding et al. | Deep interactive image matting with feature propagation | |
CN114926835A (en) | Text generation method and device, and model training method and device | |
CN115858847B (en) | Combined query image retrieval method based on cross-modal attention reservation | |
CN115408517A (en) | Knowledge injection-based multi-modal irony recognition method of double-attention network | |
CN111915618A (en) | Example segmentation algorithm and computing device based on peak response enhancement | |
Połap | Hybrid image analysis model for hashtag recommendation through the use of deep learning methods | |
Thangavel et al. | A novel method for image captioning using multimodal feature fusion employing mask RNN and LSTM models | |
CN113989405A (en) | Image generation method based on small sample continuous learning | |
Kumar et al. | Region driven remote sensing image captioning | |
CN111563161B (en) | Statement identification method, statement identification device and intelligent equipment | |
CN113159053A (en) | Image recognition method and device and computing equipment | |
CN112560440A (en) | Deep learning-based syntax dependence method for aspect-level emotion analysis | |
Cui et al. | Multi-scale interpretation model for convolutional neural networks: Building trust based on hierarchical interpretation | |
CN114443916B (en) | Supply and demand matching method and system for test data | |
CN112287159B (en) | Retrieval method, electronic device and computer readable medium | |
CN113779244B (en) | Document emotion classification method and device, storage medium and electronic equipment | |
CN115098646A (en) | Multilevel relation analysis and mining method for image-text data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |