CN107194407A

CN107194407A - A kind of method and apparatus of image understanding

Info

Publication number: CN107194407A
Application number: CN201710353016.4A
Authority: CN
Inventors: 祁斌川; 潘照明; 戴朝约
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2017-05-18
Filing date: 2017-05-18
Publication date: 2017-09-22
Anticipated expiration: 2037-05-18
Also published as: CN107194407B

Abstract

In embodiments of the present invention, a kind of method of image understanding is proposed, including：The image for treating understanding using the first convolution neural network model is handled, and obtains the first output valve；The first output valve progress is handled and obtains the second output valve；De-convolution operation is carried out to the second convolution neural network model according to second output valve, target keyword is obtained, and the target keyword is used as to the description to the image to be understood；In this scenario, the image for treating understanding using convolutional neural networks model is understood, is no longer to carry out image understanding using the method for classification, it is thus possible to improve the degree of accuracy of image understanding, and reduce the consuming of human resources.

Description

A kind of method and apparatus of image understanding

Technical field

Embodiments of the present invention are related to image understanding field, more specifically, embodiments of the present invention are related to a kind of figure Method and apparatus as understanding.

Background technology

This part is it is intended that the embodiments of the present invention stated in claims provide background or context.Herein Description recognizes it is prior art not because not being included in this part.

The implementation for image understanding is typically all to be classified by being trained in the class scope manually set at present Device, to differentiate in image whether include specific object.Instantly, the method for image classification is just concentrated on above depth network.

It is exactly ImageNet than more typical example, it is the maximum image recognition data manually marked in the world at present Storehouse, current many famous deep learning networks, such as AlexNet, GoogleNet, VGGNet is in this image data Train what is come in storehouse in 1000 classifications, compared to traditional grader, achieved in the accuracy rate of classification breakthrough Progress.

Due to the method for above-mentioned simply image classification, and classification is simply in the ownership of limited classification inner region partial image, figure It is built upon as understanding on the basis of sufficient stock of knowledge, it describes the Analysis of Deep Implications of image, if using image point If understanding image, there is the relatively low defect of the degree of accuracy in the method for class.

Meanwhile, there is following defect in existing image classification, for example：

1. the quality of the result of machine learning is heavily dependent on the quality and quantity of data, the standard of this meaning data It is standby to need to carry out substantial amounts of artificial mark work.The quality of mark, in limited assortment set in advance, accuracy is also Vary with each individual, can be changed with the difference of mark person.Typical such as face mark, different people is judged in two width pictures People when whether being same person, often in the presence of certain difference；

2. an image understanding is converted to the classification for solving specific limited assortment so that trains the model come and is only applicable In the classification of limited assortment.Such as, the model of recognition of face be just difficult to recognize gorilla face, even if they like, therefore, Different network models must be trained in different application scenarios, so, there are human resources and expend larger defect；

3. the network model trained on limited data set, often only effective in the type data.Although can carry out Certain transfer learning, but training set determines the use scope of network model substantially.

The content of the invention

Therefore, prior art has that the degree of accuracy is relatively low and human resources expend larger defect, and this is unusual troublesome Process.

Therefore, a kind of method of improved image understanding is highly desirable to, so as to solve the degree of accuracy present in prior art Relatively low, human resources expend larger defect.

In the present context, embodiments of the present invention are expected to provide a kind of method and apparatus of new image understanding.

There is provided a kind of method of image understanding in the first aspect of embodiment of the present invention, including：

The image for treating understanding using the first convolution neural network model is handled, and obtains the first output valve；

The first output valve progress is handled and obtains the second output valve；

De-convolution operation is carried out to the second convolution neural network model according to second output valve, target critical is obtained Word, and the target keyword is used as to the description to the image to be understood.

In one embodiment, the method according to the above-mentioned embodiment of the present invention, using the first convolutional Neural Network model treat understanding image handled before, methods described also includes：

Sample image and sample word corresponding with each sample image in crawl training webpage；

According to the sample image and sample word grabbed to the first convolution neural network model and the volume Two Product neural network model is trained.

In some embodiments, the method according to any of the above-described embodiment of the present invention, according to what is grabbed Sample image and sample word are trained to the first convolution neural network model, including：

The 2nd sample image is carried out using the first convolution neural network model including the weighted value Jing Guo 1 suboptimization Processing, obtains sample image output valve, and determine the target sample word of sample word corresponding with the 2nd sample image Output valve；

Error between the sample image output valve and the target sample word output valve is calculated using loss function Value；

And the weighted value Jing Guo 1 suboptimization is optimized again according to the error amount, obtain the weight of the 2nd suboptimization Value；

Return using the first convolution neural network model for including the weighted value Jing Guo 2 suboptimization to the 3rd sample image Handled, the step of obtaining sample image output valve, until weighted value optimizes by n times.

In some embodiments, the method according to any of the above-described embodiment of the present invention, it is described to pass through 1 suboptimum The weighted value of change is obtained in the following way：

The 1st sample image is handled using the first convolution neural network model including presetting initial weight value, Sample image output valve is obtained, and determines the target sample word output valve of sample word corresponding with the 1st sample image；

And the default initial weight value is optimized according to the error amount, obtain the weighted value by 1 suboptimization.

In some embodiments, the method according to any of the above-described embodiment of the present invention, according to what is grabbed Sample image and sample word are trained to the second convolution neural network model, including：

The 2nd sample word is carried out using the second convolution neural network model including the weighted value Jing Guo 1 suboptimization Processing, obtains sample word output valve, and determine the target sample image of sample image corresponding with the 2nd sample word Output valve；

Error between the sample word output valve and the target sample image output valve is calculated using loss function Value；

And the weighted value Jing Guo 1 suboptimization is optimized again according to the error amount, obtain the power by 2 suboptimization Weight values；

Return using the second convolution neural network model for including the weighted value Jing Guo 2 suboptimization to the 3rd sample word Handled, the step of obtaining sample word output valve, until weighted value optimizes by n times.

The 1st sample word is handled using the second convolution neural network model including presetting initial weight value, Sample word output valve is obtained, and determines the target sample image output of sample image corresponding with the 1st sample word Value；

In some embodiments, the method according to any of the above-described embodiment of the present invention, crawl training webpage In sample image and sample word corresponding with each sample image after, according to the sample image and sample word grabbed Before being trained to the first convolution neural network model and the second convolution neural network model, methods described is also wrapped Include：

Multiple sample images that crawl is obtained are carried out with the conversion on yardstick so that each sample image reaches that identical is tieed up Degree.

Keyword and symbol extraction are carried out to the sample word of crawl, and the keyword and symbol that are obtained to extraction are compiled Code.

In some embodiments, the method according to any of the above-described embodiment of the present invention, to the sample of crawl Word carries out keyword extraction, including：

Keyword extraction is carried out to the sample word of crawl using the reverse document-frequency TF-IDF of word frequency；

Keyword is filtered out in the keyword obtained from extraction, and the weighted value of the keyword to filtering out carries out normalizing Change is handled.

In some embodiments, the method according to any of the above-described embodiment of the present invention, is obtained from extraction Keyword is filtered out in keyword, including：

It is determined that extracting the keyword weight value of obtained keyword, the keyword weight value is subjected to ascending sort, and It regard the corresponding keyword of continuous multiple keyword weight values since the first as the keyword filtered out；Or

It is determined that extracting the keyword weight value of obtained keyword, the keyword weight value is subjected to descending sequence sequence, And it regard the corresponding keyword of continuous multiple keyword weight values since last position as the keyword filtered out.

In some embodiments, the method according to any of the above-described embodiment of the present invention, according to what is grabbed Sample image and sample word are instructed to the first convolution neural network model and the second convolution neural network model Practice, including：

According to the sample word after the sample image or coding after conversion process, to the first convolution neural network model It is trained with the second convolution neural network model.

In some embodiments, the method according to any of the above-described embodiment of the present invention, according to described second Output valve carries out de-convolution operation to the second convolution neural network model, obtains target keyword, including：

De-convolution operation is carried out to the second convolution neural network model according to second output valve, multiple keys are obtained Word；

It regard weighted value highest keyword in the multiple keyword as the target keyword.

There is provided a kind of device of image understanding in the second aspect of embodiment of the present invention, including：

Processing unit, the image for being treated understanding using the first convolution neural network model is handled, and obtains first Output valve；

The processing unit is additionally operable to, and the first output valve progress is handled and obtains the second output valve；

Arithmetic element, for carrying out de-convolution operation to the second convolution neural network model according to second output valve, Obtain target keyword；

Determining unit, for the target keyword to be used as to the description to the image to be understood.

In one embodiment, the device according to the above-mentioned embodiment of the present invention, described device also includes grabbing Unit and training unit are taken, wherein：

The placement unit, for capturing sample image and sample corresponding with each sample image text in training webpage Word；

The training unit, the sample image and sample word grabbed for basis is to first convolutional neural networks Model and the second convolution neural network model are trained.

In some embodiments, the device according to any of the above-described embodiment of the present invention, the training unit Including the first image output valve determining unit, the first word output valve determining unit, the first error amount computing unit and first excellent Change unit, wherein：

Described first image output valve determining unit, for using the first convolution for including the weighted value Jing Guo 1 suboptimization Neural network model is handled the 2nd sample image, obtains sample image output valve；

The first word output valve determining unit, for determining sample word corresponding with the 2nd sample image Target sample word output valve；

The first error amount computing unit, for calculating the sample image output valve and the mesh using loss function Error amount between standard specimen this word output valve；

The first optimization unit, for being optimized again to the weighted value Jing Guo 1 suboptimization according to the error amount, is obtained To the weighted value of the 2nd suboptimization；

Described first image output valve training unit is additionally operable to, and is returned using including the of weighted value Jing Guo 2 suboptimization One convolution neural network model is handled the 3rd sample image, the step of obtaining sample image output valve, until weighted value Optimize by n times.

In some embodiments, the device according to any of the above-described embodiment of the present invention, it is described to pass through 1 suboptimum The weighted value of change is obtained in the following way：

In some embodiments, the device according to any of the above-described embodiment of the present invention, the training unit It is excellent including the second word output valve determining unit, the second image output valve determining unit, the second error amount computing unit and second Change unit, wherein：

The second word output valve determining unit, for using the second convolution for including the weighted value Jing Guo 1 suboptimization Neural network model is handled the 2nd sample word, obtains sample word output valve；

The second image output valve determining unit, for determining sample image corresponding with the 2nd sample word Target sample image output valve；

The second error amount computing unit, for calculating the sample word output valve and the mesh using loss function Mark the error amount between sample image output valve；

The second optimization unit, for being optimized again to the weighted value Jing Guo 1 suboptimization according to the error amount, is obtained To the weighted value Jing Guo 2 suboptimization；

The second word output valve determining unit is additionally operable to, and is returned using including the of weighted value Jing Guo 2 suboptimization Two convolutional neural networks models are handled the 3rd sample word, the step of obtaining sample word output valve, until weighted value Optimize by n times.

In some embodiments, the device according to any of the above-described embodiment of the present invention, described device is also wrapped Change of scale unit is included, multiple sample images for being obtained to crawl carry out the conversion on yardstick so that each sample image Reach identical dimension.

In some embodiments, the device according to any of the above-described embodiment of the present invention, described device is also wrapped Extraction unit and coding unit are included, wherein：

The extraction unit, keyword and symbol extraction are carried out for the sample word to crawl；

The coding unit, keyword and symbol for being obtained to extraction are encoded.

In some embodiments, the device according to any of the above-described embodiment of the present invention, the extraction unit When carrying out keyword extraction to the sample word of crawl, it is specially：

Keyword is filtered out in the keyword obtained from extraction, and the weighted value of the keyword to filtering out is normalized Processing.

In some embodiments, the device according to any of the above-described embodiment of the present invention, the extraction unit When filtering out keyword in the keyword obtained from extraction, it is specially：

In some embodiments, the device according to any of the above-described embodiment of the present invention, the training unit According to the sample image and sample word grabbed to the first convolution neural network model and the second convolution nerve net When network model is trained, it is specially：

In some embodiments, the device according to any of the above-described embodiment of the present invention, the arithmetic element De-convolution operation is carried out to the second convolution neural network model according to second output valve, when obtaining target keyword, specifically For：

There is provided a kind of method of image understanding in the third aspect of embodiment of the present invention, including：

According to the sample image and sample word grabbed to the first convolution neural network model and the second convolution nerve net Network model is trained so as to the power of the first convolution neural network model and the second convolution neural network model Weight values are optimized；

According to the first convolution neural network model and the second convolution neural network model after training, reason is treated The image of solution is handled, and obtains target keyword；

The target keyword is used as to the description to the image to be understood.

In one embodiment, the method according to the above-mentioned embodiment of the present invention, according to the sample grabbed Image and sample word are trained to the first convolution neural network model and the second convolution neural network model, including：Using Loss function calculates the sample image output valve and the second convolutional neural networks mould of the first convolution neural network model Error amount between the sample word output valve of type, and make the error amount minimize to optimize first convolutional neural networks The weighted value of model and the second convolution neural network model.

In some embodiments, the method according to any of the above-described embodiment of the present invention, according to what is grabbed Sample image and sample word are trained to the first convolution neural network model and the second convolution neural network model, including：

The 2nd sample image is carried out using the first convolution neural network model including the weighted value Jing Guo 1 suboptimization Processing, obtains sample image output valve；

The 2nd sample word is carried out using the second convolution neural network model including the weighted value Jing Guo 1 suboptimization Processing, obtains sample word output valve；

Error amount between the sample image output valve and the sample word output valve is calculated using loss function；

And according to weighted value and second convolution of the error amount to the first convolution neural network model Jing Guo 1 suboptimization The weighted value of neural network model is optimized again, obtain the 2nd suboptimization the first convolution neural network model weighted value and The weighted value of second convolution neural network model；

Repeat the above steps, until the weighted value and the second convolution neural network model of the first convolution neural network model Weighted value optimizes by n times.

In some embodiments, the method according to any of the above-described embodiment of the present invention, it is described to pass through 1 suboptimum The weighted value for the first convolution neural network model changed is obtained in the following way：

The 1st sample image is handled using the first convolution neural network model including presetting initial weight value, Sample image output valve is obtained, using the second convolution neural network model including presetting initial weight value to the 1st sample text Word is handled, and obtains sample word output valve；

And the default initial weight value is optimized according to the error amount, obtain the first volume by 1 suboptimization The weighted value of product neural network model.

In some embodiments, the method according to any of the above-described embodiment of the present invention, crawl training webpage In sample image and sample word corresponding with each sample image after, according to the sample image and sample word grabbed Before being trained to the first convolution neural network model and the second convolution neural network model, methods described also includes：

In some embodiments, the method according to any of the above-described embodiment of the present invention, after training The first convolution neural network model and the second convolution neural network model, the image for treating understanding are handled, obtained To target keyword, including：

The image to be understood is handled using the first convolution neural network model, the first output is obtained Value；

De-convolution operation is carried out to the second convolution neural network model according to second output valve, target pass is obtained Keyword.

There is provided a kind of device of image understanding in the fourth aspect of embodiment of the present invention, including：

Placement unit, for capturing sample image and sample word corresponding with each sample image in training webpage；

Training unit, for according to the sample image that grabs and sample word to the first convolution neural network model and the Two convolutional neural networks models are trained so as to the first convolution neural network model and the second convolution nerve net The weighted value of network model is optimized；

Processing unit, for according to the first convolution neural network model and the second convolution nerve net after training Network model, the image for treating understanding is handled, and obtains target keyword；

In one embodiment, according to the present invention above-mentioned embodiment described in device, the placement unit according to The sample image and sample word grabbed is instructed to the first convolution neural network model and the second convolution neural network model When practicing, it is specially：

The sample image output valve and the volume Two of the first convolution neural network model are calculated using loss function Error amount between the sample word output valve of product neural network model, and make the error amount minimize to optimize described first The weighted value of convolutional neural networks model and the second convolution neural network model.

The first word output valve determining unit, for using the second convolution for including the weighted value Jing Guo 1 suboptimization Neural network model is handled the 2nd sample word, obtains sample word output valve；

The first error amount computing unit, for calculating the sample image output valve and the sample using loss function Error amount between this word output valve；

It is described first optimization unit, for according to the error amount to the first convolutional neural networks mould Jing Guo 1 suboptimization The weighted value of the weighted value of type and the second convolution neural network model is optimized again, obtains the first convolution god of the 2nd suboptimization The weighted value of weighted value and the second convolution neural network model through network model；Repeat the above steps, until the first convolution god The weighted value of weighted value and the second convolution neural network model through network model optimizes by n times.

In some embodiments, the device according to any of the above-described embodiment of the present invention, it is described to pass through 1 suboptimum The weighted value for the first convolution neural network model changed is obtained in the following way：

The 1st sample image is handled using the first convolution neural network model including presetting initial weight value, Obtain sample image output valve；

The 1st sample word is handled using the second convolution neural network model including presetting initial weight value, Obtain sample word output valve；

In some embodiments, the device according to any of the above-described embodiment of the present invention, the training unit The first convolution neural network model and the second convolution neural network model are entered according to the sample image and sample word grabbed During row training, it is specially：

In some embodiments, the device according to any of the above-described embodiment of the present invention, the processing unit According to the first convolution neural network model and the second convolution neural network model after training, the image of understanding is treated Handled, when obtaining target keyword, be specially：

In some embodiments, the device according to any of the above-described embodiment of the present invention, the processing unit De-convolution operation is carried out to the second convolution neural network model according to second output valve, when obtaining target keyword, specifically For：

There is provided a kind of device of image understanding in the 5th aspect of embodiment of the present invention, including：

One or more processor；

Memory；

The program in the memory is stored in, when by one or more of computing devices, described program makes Described device performs the method as described in first aspect either any one embodiment of first aspect or performs such as third party Method described in any one embodiment of face or the third aspect.

There is provided a kind of computer-readable recording medium in the 6th aspect of embodiment of the present invention, the computer Readable storage medium storing program for executing has program stored therein, when described program is performed by a device for being used for image understanding so that described device The method as described in first aspect either any one embodiment of first aspect of progress performs such as third aspect or the Method described in any one embodiment of three aspects.

In the embodiment of the present invention, a kind of method of image understanding is proposed, including：Using the first convolution neural network model pair Image to be understood is handled, and obtains the first output valve；The first output valve progress is handled and obtains the second output valve；Root De-convolution operation is carried out to the second convolution neural network model according to second output valve, target keyword is obtained, and will be described Target keyword is used as the description to the image to be understood；In this scenario, reason is treated using convolutional neural networks model The image of solution is understood, is no longer to carry out image understanding using the method for classification, it is thus possible to improve the standard of image understanding The sample image in neural network model is trained to come from internet web page in exactness, this programme, without substantial amounts of artificial mark work Make, for this angle, the degree of accuracy of image understanding can also be improved, it is possible to reduce the consuming of human resources.Due to The neural network model that the present invention is trained can apply to extensive occasion, be not limited in the range of limited pre-set categories, Therefore, this programme can improve application.

Brief description of the drawings

Detailed description below, above-mentioned and other mesh of exemplary embodiment of the invention are read by reference to accompanying drawing , feature and advantage will become prone to understand.In the accompanying drawings, if showing the present invention's by way of example, and not by way of limitation Dry embodiment, wherein：

Fig. 1 schematically shows a kind of flow chart of the image understanding according to embodiment of the present invention；

Fig. 2 schematically shows another flow chart of the image understanding according to embodiment of the present invention；

Fig. 3 schematically shows a kind of schematic diagram of the image understanding device according to embodiment of the present invention；

Fig. 4 schematically shows another schematic diagram of the image understanding device according to embodiment of the present invention；

Fig. 5 schematically shows another schematic diagram of the image understanding device according to embodiment of the present invention；

Fig. 6 schematically shows another schematic diagram of the image understanding device according to embodiment of the present invention；

In the accompanying drawings, identical or corresponding label represents identical or corresponding part.

Embodiment

The principle and spirit of the present invention is described below with reference to some illustrative embodiments.It should be appreciated that providing this A little embodiments are used for the purpose of better understood when those skilled in the art and then realizing the present invention, and not with any Mode limits the scope of the present invention.On the contrary, these embodiments are provided so that the disclosure is more thorough and complete, and The scope of the present disclosure can intactly be conveyed to those skilled in the art.

One skilled in the art will appreciate that embodiments of the present invention can be implemented as a kind of system, device, equipment, method Or computer program product.Therefore, the disclosure can be implemented as following form, i.e.,：Complete hardware, complete software (including firmware, resident software, microcode etc.), or the form that hardware and software is combined.

According to the embodiment of the present invention, it is proposed that a kind of method and apparatus of image understanding.

Below to present document relates to technical term be briefly described.

Image understanding：Image understanding (Image Understanding, IU) is the semantic understanding to image.It is to scheme As being object, knowledge is core, have in research image correlation between what target, target, image be what scene with And how application scenarios.

Convolutional neural networks：Convolutional neural networks are one kind of artificial neural network, it has also become current speech is analyzed and schemed As the study hotspot in identification field.Its weights share network structure and are allowed to be more closely similar to biological neural network, reduce network The complexity of model, reduces the quantity of weights.

Deconvolution neutral net：Assuming that A=B*C, what the formula was represented is：B and C convolution is A, that is to say, that known B And C, ask this process of A to be called convolution.If that known A and B ask C or known A and C to seek B, then this process is just called instead Convolution.Deconvolution network are corresponding with convolution network (abbreviation CNN), in CNN, are Feature map are obtained by input image convolution feature filter, and in devonvolution network, are Input image are obtained by feature map convolution feature filter.

One-Hot：One-hot coding, also known as one efficient coding, its method is come to N number of shape using N bit status registers State is encoded, and each state is by his independent register-bit, and any when, wherein only one effectively.

In addition, any number of elements in accompanying drawing is used to example and unrestricted, and any name is only used for distinguishing, Without any limitation.

Below with reference to the principle and spirit of some representative embodiments of the present invention, in detail the explaination present invention.

Summary of the invention

The inventors discovered that, using substantial amounts of picture library (image and word of each pictures in picture library are known) to figure It is trained as corresponding first convolution neural network model, and to the corresponding second convolution neural network model of word so that The value that the image of a certain picture is obtained after the processing of the first convolution neural network model, passes through volume Two with the word of the picture The value obtained after product neural network model processing is almost identical, so when a secondary picture is taken, it is possible to according to first Convolutional neural networks model and the second convolution neural network model obtain the word of the picture.

After the general principle of the present invention is described, lower mask body introduces the various non-limiting embodiment party of the present invention Formula.

Application scenarios overview

For example, being carried out using 10000 pictures to the first convolution neural network model and the second convolution neural network model Training so that the value that image in a certain picture is obtained after the processing of the first convolution neural network model, and in the picture The value that word is obtained after the processing of the second convolution neural network model is identical or makes the error amount between them with predefined function Convergence, e.g., the value that image is obtained after the first convolution neural network model in picture 1 passes through volume Two with word in picture 1 The value obtained after product neural network model is identical, the value that image is obtained after the first convolution neural network model in picture 2, with The value that word is obtained after the second convolution neural network model in picture 2 is identical, in each picture in 10000 pictures The value that image is obtained after the first convolution neural network model, passes through the second convolution neural network model with word in the picture The value obtained afterwards is identical, so, when a pictures are taken, if only knowing image, will be by the first convolution god The value obtained after network model carries out deconvolution calculating to the second convolution neural network model, obtains the word of the picture.

Illustrative methods

With reference to application scenarios above, it is described with reference to Figure 1 according to exemplary embodiment of the invention for scheming Method as understanding.Understand spirit and principles of the present invention it should be noted that above-mentioned application scenarios are for only for ease of and show Go out, embodiments of the present invention are unrestricted in this regard.On the contrary, embodiments of the present invention can apply to it is applicable Any scene.

As shown in fig.1, in the embodiment of the present invention, a kind of method 10 of image understanding is proposed, including：

Step 100：The image for treating understanding using the first convolution neural network model is handled, and obtains the first output Value；

Step 110：The first output valve progress is handled and obtains the second output valve；

Step 120：De-convolution operation is carried out to the second convolution neural network model according to second output valve, mesh is obtained Keyword is marked, and the target keyword is used as to the description to the image to be understood.

In order to improve the degree of accuracy of obtained keyword, further, reason is treated using the first convolution neural network model Before the image of solution is handled, methods described also includes：

In the embodiment of the present invention, according to the sample image and sample word grabbed to the first convolutional neural networks mould When type is trained, it is alternatively possible in the following way：

Previously described is the process for the weighted value that the 2nd suboptimization optimizes the first convolution neural network model to n-th, but Above-mentioned optimization is carried out on the basis of the 1st suboptimization, therefore, and the 1st suboptimization weight is also proposed in the embodiment of the present invention The process of value, alternatively, the weighted value Jing Guo 1 suboptimization is obtained in the following way：

It is previously described to be the method being trained to the first convolution neural network model, the embodiment of the present invention, also carry Go out the method being trained to the second convolution neural network model, according to the sample image and sample word grabbed to described When two convolutional neural networks models are trained, it is alternatively possible in the following way：

And the weighted value Jing Guo 1 suboptimization is optimized again according to the error amount, obtain the weight by 2 suboptimization Value；

Previously described is the process for the weighted value that the 2nd suboptimization optimizes the second convolution neural network model to n-th, but Above-mentioned optimization is carried out on the basis of the 1st suboptimization, therefore, in the embodiment of the present invention, it is also proposed that the 1st suboptimization second The method of convolutional neural networks model, the weighted value Jing Guo 1 suboptimization is obtained in the following way：

In order to improve the degree of accuracy of training result, further, crawl training webpage in sample image and with each sample After the corresponding sample word of this image, according to the sample image and sample word grabbed to first convolutional neural networks Before model and the second convolution neural network model are trained, methods described also includes：

Multiple sample images that crawl is obtained are carried out with the conversion on yardstick so that each sample image reaches that identical is tieed up Degree.Wherein, alternatively, dimension can refer to height, width and rgb format.

In the embodiment of the present invention, in order to improve training effectiveness, further, crawl training webpage in sample image and with After the corresponding sample word of each sample image, according to the sample image and sample word grabbed to first convolution god Before being trained through network model and the second convolution neural network model, methods described also includes：

Wherein, in coding, it is alternatively possible to be encoded using One-Hot modes.

The amount of the word included due to a width picture may be very big, for the use of text, it is assumed that every width picture correspondence Text in vocabulary in 1000 magnitudes, in order to compress the sample size of training, keyword is carried out to the sample word of crawl and carried When taking, it is alternatively possible in the following way：

Using TF-IDF (Term Frequency-inverse Document Frequency, the reverse file frequency of word frequency Rate) to the sample word progress keyword extraction of crawl；

Wherein, alternatively, keyword is filtered out in the keyword obtained from extraction, including：

For example, weighted value is sorted, the weighted value of preceding 20 keywords after sequence is normalized.

In the embodiment of the present invention, according to the sample image and sample word grabbed to the first convolutional neural networks mould When type and the second convolution neural network model are trained, it is alternatively possible in the following way：

In the embodiment of the present invention, deconvolution fortune is carried out to the second convolution neural network model according to second output valve Calculate, when obtaining target keyword, it is alternatively possible in the following way：

For example, convolution of the first convolution neural network model comprising 5 group, 2 layers of full connection characteristics of image, 1 layer connect entirely Connect characteristic of division and be designated as image_fc.Second convolution neural network model is designated as using 1 convolution, 1 layer of full link sort feature word_fc.In order to ensure that the full connection category feature output valve of the corresponding word of image and its is identical, loss function can be using such as Lower formula：

Wherein f is picture and the corresponding network of word.

As shown in fig.2, the embodiment of the present invention proposes a kind of method 20 of image understanding, including：

Step 200：Sample image and sample word corresponding with each sample image in crawl training webpage；

Step 210：According to the sample image and sample word grabbed to the first convolution neural network model and volume Two Product neural network model is trained so as to the first convolution neural network model and the second convolutional neural networks mould The weighted value of type is optimized；

Step 220：According to the first convolution neural network model and the second convolutional neural networks mould after training Type, the image for treating understanding is handled, and obtains target keyword；

Step 230：The target keyword is used as to the description to the image to be understood.

In the embodiment of the present invention, according to the sample image and sample word grabbed to the first convolution neural network model and When second convolution neural network model is trained, it is alternatively possible in the following way：

In the embodiment of the present invention, alternatively, the weighted value of the first convolution neural network model Jing Guo 1 suboptimization is Obtain in the following way：

In order to improve the degree of accuracy of obtained keyword, further, crawl training webpage in sample image and with it is each After the corresponding sample word of individual sample image, according to the sample image and sample word grabbed to the first convolutional neural networks Before model and the second convolution neural network model are trained, methods described also includes：

In the embodiment of the present invention, in order to improve training effectiveness, further, crawl training webpage in sample image and with After the corresponding sample word of each sample image, according to the sample image and sample word grabbed to the first convolution nerve net Before network model and the second convolution neural network model are trained, methods described also includes：

Keyword extraction is carried out to the sample word of crawl using TF-IDF；

Wherein, keyword is filtered out in the keyword obtained from extraction, including：

In the embodiment of the present invention, according to the first convolution neural network model and second convolutional Neural after training Network model, the image for treating understanding is handled, when obtaining target keyword, it is alternatively possible in the following way：

In the embodiment of the present invention, alternatively, the second convolution neural network model is carried out according to second output valve anti- Convolution algorithm, obtains target keyword, including：

Wherein f is picture and the corresponding network of word.

Example devices

After the method for exemplary embodiment of the invention is described, next, exemplary to the present invention with reference to Fig. 3 Embodiment, illustrate for the device of image understanding.

As shown in fig.3, a kind of device 30 of image understanding is also proposed, including：

Processing unit 300, is handled for treating the image of understanding using the first convolution neural network model, obtains the One output valve；

The processing unit 300 is additionally operable to, and the first output valve progress is handled and obtains the second output valve；

Arithmetic element 310, for carrying out deconvolution fortune to the second convolution neural network model according to second output valve Calculate, obtain target keyword；

Determining unit 320, for the target keyword to be used as to the description to the image to be understood.

In order to improve the degree of accuracy of obtained keyword, further, described device 30 also includes placement unit 330 and instruction Practice unit 340, wherein：

The placement unit 330, for capturing sample image and sample corresponding with each sample image in training webpage This word；

The training unit 340, the sample image and sample word grabbed for basis is to first convolutional Neural Network model and the second convolution neural network model are trained.

In the embodiment of the present invention, alternatively, the training unit 340 include the first image output valve determining unit 340A1, The optimization unit 340D1 of first word output valve determining unit 340B1, the first error amount computing unit 340C1 and first, wherein：

Described first image output valve determining unit 340A1, for using the first of the weighted value included Jing Guo 1 suboptimization Convolutional neural networks model is handled the 2nd sample image, obtains sample image output valve；

The first word output valve determining unit 340B1, for determining sample corresponding with the 2nd sample image The target sample word output valve of word；

The first error amount computing unit 340C1, for using loss function calculate the sample image output valve and Error amount between the target sample word output valve；

The first optimization unit 340D1, it is excellent for being carried out again to the weighted value Jing Guo 1 suboptimization according to the error amount Change, obtain the weighted value of the 2nd suboptimization；

Described first image output valve determining unit 340A1 is additionally operable to, and is returned using including the weight Jing Guo 2 suboptimization First convolution neural network model of value is handled the 3rd sample image, the step of obtaining sample image output valve, until Weighted value optimizes by n times.

Previously described is the process for the weighted value that the 2nd suboptimization optimizes the first convolution neural network model to n-th, but Above-mentioned optimization is carried out on the basis of the 1st suboptimization, therefore, in the embodiment of the present invention gives a kind of 1st suboptimization The method of weighted value, alternatively, the weighted value Jing Guo 1 suboptimization is obtained in the following way：

It is previously described to be the method being trained to the first convolution neural network model, the embodiment of the present invention, also carry Go out the method being trained to the second convolution neural network model, the training unit 340 is determined including the second word output valve Unit 340A2, the second image output valve determining unit 340B2, the optimization units of the second error amount computing unit 340C2 and second 340D2, wherein：

The second word output valve determining unit 340A2, for using the second of the weighted value included Jing Guo 1 suboptimization Convolutional neural networks model is handled the 2nd sample word, obtains sample word output valve；

The second image output valve determining unit 340B2, for determining sample corresponding with the 2nd sample word The target sample image output valve of image；

The second error amount computing unit 340C2, for using loss function calculate the sample word output valve and Error amount between the target sample image output valve；

The second optimization unit 340D2, it is excellent for being carried out again to the weighted value Jing Guo 1 suboptimization according to the error amount Change, obtain the weighted value by 2 suboptimization；

The second word output valve determining unit 340A2 is additionally operable to, and is returned using including the weighted value Jing Guo 2 suboptimization The second convolution neural network model the 3rd sample word is handled, the step of obtaining sample word output valve, until power Weight values optimize by n times.

Previously described is the process for the weighted value that the 2nd suboptimization optimizes the second convolution neural network model to n-th, but Above-mentioned optimization is carried out on the basis of the 1st suboptimization, therefore, and a kind of 1st suboptimization is also proposed in the embodiment of the present invention The method of the weighted value of second convolution neural network model, alternatively, the weighted value Jing Guo 1 suboptimization are to use such as lower section What formula was obtained：

In order to improve the degree of accuracy of training result, further, described device also includes change of scale unit 350, is used for Multiple sample images that crawl is obtained are carried out with the conversion on yardstick so that each sample image reaches identical dimension.Wherein, Alternatively, dimension can refer to height, width and rgb format.

In the embodiment of the present invention, in order to improve training effectiveness, further, described device also includes the He of extraction unit 360 Coding unit 370, wherein：

The extraction unit 360, keyword and symbol extraction are carried out for the sample word to crawl；

The coding unit 370, keyword and symbol for being obtained to extraction are encoded.

The amount of the word included due to a width picture may be very big, for the use of text, it is assumed that every width picture correspondence Text in vocabulary in 1000 magnitudes, in order to compress the sample size of training, the sample text of described 360 pairs of crawls of extraction unit When word carries out keyword extraction, it is specially：

Keyword extraction is carried out to the sample word of crawl using TF-IDF；

Wherein, alternatively, when filtering out keyword in the keyword that the extraction unit 360 is obtained from extraction, it is specially：

In the embodiment of the present invention, alternatively, the training unit 340 is according to the sample image and sample word pair grabbed When the first convolution neural network model and the second convolution neural network model are trained, it is specially：

In the embodiment of the present invention, alternatively, the arithmetic element 310 is according to second output valve to the second convolutional Neural Network model carries out de-convolution operation, when obtaining target keyword, is specially：

For example, convolution of the first convolution neural network model comprising 5 group, 2 layers of full connection characteristics of image, 1 layer connect entirely Connect characteristic of division and be designated as image_fc.Literary second convolution neural network model is using 1 convolution, 1 layer of full link sort feature note For word_fc.In order to ensure that the full connection category feature output valve of the corresponding word of image and its is almost identical, so loss function Equation below can be used：

Wherein f is picture and the corresponding network of word.

As shown in fig.4, in the embodiment of the present invention, it is also proposed that a kind of device 40 of image understanding, including：

Placement unit 400, for capturing sample image and sample corresponding with each sample image text in training webpage Word；

Training unit 410, the sample image and sample word grabbed for basis is to the first convolution neural network model It is trained with the second convolution neural network model so as to the first convolution neural network model and second convolution The weighted value of neural network model is optimized；

Processing unit 420, for according to the first convolution neural network model after training and second convolution god Through network model, the image for treating understanding is handled, and obtains target keyword；

Determining unit 430, for the target keyword to be used as to the description to the image to be understood.

In the embodiment of the present invention, alternatively, the placement unit 400 is according to the sample image and sample word pair grabbed When first convolution neural network model and the second convolution neural network model are trained, it is specially：

In the embodiment of the present invention, alternatively, the training unit 410 include the first image output valve determining unit 410A, The optimization unit 410D of first word output valve determining unit 410B, the first error amount computing unit 410C and first, wherein：

Described first image output valve determining unit 410A, for using the first of the weighted value included Jing Guo 1 suboptimization Convolutional neural networks model is handled the 2nd sample image, obtains sample image output valve；

The first word output valve determining unit 410B, for using the second of the weighted value included Jing Guo 1 suboptimization Convolutional neural networks model is handled the 2nd sample word, obtains sample word output valve；

The first error amount computing unit 410C, for calculating the sample image output valve and institute using loss function State the error amount between sample word output valve；

It is described first optimization unit 410D, for according to the error amount to the first convolution nerve net Jing Guo 1 suboptimization The weighted value of the weighted value of network model and the second convolution neural network model is optimized again, obtains the first volume of the 2nd suboptimization The weighted value of product neural network model and the weighted value of the second convolution neural network model；Repeat the above steps, until the first volume The weighted value of product neural network model and the weighted value of the second convolution neural network model optimize by n times.

In the embodiment of the present invention, alternatively, the weighted value of the first convolution neural network model Jing Guo 1 suboptimization Obtain in the following way：

In order to improve the degree of accuracy of obtained keyword, further, described device also includes change of scale unit 440, Multiple sample images for being obtained to crawl carry out the conversion on yardstick so that each sample image reaches identical dimension. Wherein, alternatively, dimension can refer to height, width and rgb format.

In the embodiment of the present invention, in order to improve training effectiveness, further, described device also includes the He of extraction unit 450 Coding unit 460, wherein：

The extraction unit 450, keyword and symbol extraction are carried out for the sample word to crawl；

The coding unit 460, keyword and symbol for being obtained to extraction are encoded.

The amount of the word included due to a width picture may be very big, for the use of text, it is assumed that every width picture correspondence Text in vocabulary in 1000 magnitudes, in order to compress the sample size of training, the sample text of described 450 pairs of crawls of extraction unit When word carries out keyword extraction, it is specially：

Wherein, when filtering out keyword in the keyword that the extraction unit 450 is obtained from extraction, it is specially：

In the embodiment of the present invention, alternatively, the training unit 410 is according to the sample image and sample word pair grabbed When first convolution neural network model and the second convolution neural network model are trained, it is specially：

In the embodiment of the present invention, alternatively, the processing unit 420 is according to first convolutional neural networks after training Model and the second convolution neural network model, the image for treating understanding is handled, when obtaining target keyword, specifically For：

In the embodiment of the present invention, alternatively, the processing unit 420 is according to second output valve to the second convolutional Neural Network model carries out de-convolution operation, when obtaining target keyword, is specially：

Wherein f is picture and the corresponding network of word.

Example devices

After the method and apparatus of exemplary embodiment of the invention are described, next, introducing according to the present invention's The device for image understanding of another exemplary embodiment.

Person of ordinary skill in the field it is understood that various aspects of the invention can be implemented as system, method or Program product.Therefore, various aspects of the invention can be implemented as following form, i.e.,：It is complete hardware embodiment, complete Complete Software Implementation (including firmware, microcode etc.), or the embodiment combined in terms of hardware and software, it can unite here Referred to as " circuit ", " module " or " system ".

In some possible embodiments, can at least it be included at least according to the device for image understanding of the present invention One processing unit and at least one memory cell.Wherein, the memory cell has program stored therein code, works as described program When code is performed by the processing unit so that the processing unit is performed to be retouched in above-mentioned " illustrative methods " part of this specification Step in the method for image understanding according to various illustrative embodiments of the invention stated.For example, the processing is single Member can perform step 100 as shown in fig. 1：Treated using the first convolution neural network model at the image of understanding Reason, obtains the first output valve；Step 110：The first output valve progress is handled and obtains the second output valve；Step 120：According to Second output valve carries out de-convolution operation to the second convolution neural network model, obtains target keyword, and by the mesh Mark keyword is used as the description to the image to be understood.

In another example, the processing unit can perform step 200 as shown in Figure 2：Sample in crawl training webpage Image and sample word corresponding with each sample image；Step 210：According to the sample image and sample word grabbed to One convolution neural network model and the second convolution neural network model are trained so as to the first convolutional neural networks mould The weighted value of type and the second convolution neural network model is optimized；Step 220：According to first convolution after training Neural network model and the second convolution neural network model, the image for treating understanding are handled, and obtain target keyword； Step 230：The target keyword is used as to the description to the image to be understood.

The device for image understanding according to the embodiment of the invention is described referring to Fig. 5.Fig. 5 is shown The device 50 for image understanding be only an example, should not appoint to the function of the embodiment of the present invention and using range band What is limited.

As shown in figure 5, the device 50 for image understanding is showed in the form of universal computing device.For image understanding The component of device 50 can include but is not limited to：At least one above-mentioned processing unit 16, at least one above-mentioned memory cell 28, company Connect the bus 18 of different system component (including memory cell 28 and processing unit 16).

Bus 18 represents the one or more in a few class bus structures, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.

Memory cell 28 can include the computer-readable recording medium of form of volatile memory, such as random access memory (RAM) 31 and/or cache memory 32, can also further read-only storage (ROM) 34.

Memory cell 28 can also include program/utility 41 with one group of (at least one) program module 42, this The program module 42 of sample includes but is not limited to：Operating system, one or more application program, other program modules and program The realization of network environment is potentially included in each or certain combination in data, these examples.

Device 50 for image understanding can also be with one or more external equipments 14 (such as keyboard, sensing equipment, indigo plant Tooth equipment etc.) communication, the equipment interacted with the device 50 that this is used for image understanding can be also enabled a user to one or more Communication, and/or times communicated with the device 50 for enabling this to be used for image understanding with one or more of the other computing device What equipment (such as router, modem etc.) communication.This communication can be entered by input/output (I/O) interface 22 OK.Also, the device 50 for image understanding can also pass through network adapter 21 and one or more network (such as local Net (LAN), wide area network (WAN) and/or public network, such as internet) communication.As illustrated, network adapter 21 is by total Line 18 communicates with other modules of the device 50 for image understanding.It should be understood that although not shown in the drawings, can combine is used for The device 50 of image understanding uses other hardware and/or software module, includes but is not limited to：Microcode, device driver, redundancy Processing unit, external disk drive array, RAID system, tape drive and data backup storage system etc..

Exemplary process product

In some possible embodiments, various aspects of the invention are also implemented as a kind of shape of program product Formula, it includes program code, when described program product is run in equipment, and described program code is used to perform the equipment It is used for image reason according to various illustrative embodiments of the invention described in above-mentioned " illustrative methods " part of this specification Step in the method for solution, for example, the equipment can perform step 100 as shown in fig. 1：Using the first convolution nerve net The image that network model treats understanding is handled, and obtains the first output valve；Step 110：First output valve is handled Obtain the second output valve；Step 120：Deconvolution fortune is carried out to the second convolution neural network model according to second output valve Calculate, obtain target keyword, and the target keyword is used as to the description to the image to be understood.

For example, the equipment can perform step 200 as shown in Figure 2：Crawl training webpage in sample image and Sample word corresponding with each sample image；Step 210：According to the sample image and sample word grabbed to the first convolution Neural network model and the second convolution neural network model are trained so as to the first convolution neural network model and institute The weighted value for stating the second convolution neural network model is optimized；Step 220：According to the first convolution nerve net after training Network model and the second convolution neural network model, the image for treating understanding are handled, and obtain target keyword；Step 230：The target keyword is used as to the description to the image to be understood.

Described program product can use any combination of one or more computer-readable recording mediums.Computer-readable recording medium can be readable letter Number medium or readable storage medium storing program for executing.Readable storage medium storing program for executing for example may be-but not limited to-electricity, magnetic, optical, electromagnetic, red System, device or the device of outside line or semiconductor, or any combination above.The more specifically example of readable storage medium storing program for executing (non exhaustive list) includes：Electrical connection, portable disc with one or more wires, hard disk, random access memory (RAM), read-only storage (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc Read memory (CD-ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.

As shown in fig. 6, describe the program product 60 for image understanding according to the embodiment of the present invention, it can be with Using portable compact disc read only memory (CD-ROM) and including program code, it is possible in terminal device, such as personal electricity Run on brain.However, the program product not limited to this of the present invention, in this document, readable storage medium storing program for executing can be any includes Or the tangible medium of storage program, the program can be commanded execution system, device or device using or in connection make With.

Readable signal medium can be included in a base band or as the data-signal of carrier wave part propagation, wherein carrying Readable program code.The data-signal of this propagation can take various forms, including --- but being not limited to --- electromagnetism letter Number, optical signal or above-mentioned any appropriate combination.Readable signal medium can also be beyond readable storage medium storing program for executing it is any can Read medium, the computer-readable recording medium can send, propagate or transmit for by instruction execution system, device or device use or Program in connection.

The program code included on computer-readable recording medium can be transmitted with any appropriate medium, including --- but being not limited to --- Wirelessly, wired, optical cable, RF etc., or above-mentioned any appropriate combination.

It can be write with any combination of one or more programming languages for performing the program that the present invention is operated Code, described program design language includes object oriented program language-Java, C++ etc., in addition to conventional Procedural programming language-such as " C " language or similar programming language.Program code can be fully in user Perform, partly perform on a user device on computing device, being performed as an independent software kit, partly in user's calculating Its upper side point is performed or performed completely in remote computing device or server on a remote computing.It is remote being related to In the situation of journey computing device, remote computing device can be by the network of any kind --- including LAN (LAN) or wide Domain net (WAN)-be connected to user calculating equipment, or, it may be connected to external computing device (for example utilizes Internet service Provider comes by Internet connection).

If although it should be noted that being referred to the equipment for drying or son dress of the equipment for image understanding in above-detailed Put, but this division is only not enforceable.In fact, according to the embodiment of the present invention, above-described two or The feature and function of more devices can embody in one apparatus.Conversely, the feature and work(of an above-described device It can be further divided into being embodied by multiple devices.

In addition, although the operation of the inventive method is described with particular order in the accompanying drawings, this do not require that or Hint must be performed according to the particular order these operation, or the operation having to carry out shown in whole could realize it is desired As a result.Additionally or alternatively, it is convenient to omit some steps, multiple steps are merged into a step execution, and/or by one Step is decomposed into execution of multiple steps.

Although describing spirit and principles of the present invention by reference to some embodiments, it should be appreciated that, this Invention is not limited to disclosed embodiment, and the division to each side does not mean that the feature in these aspects can not yet Combination is this to divide merely to the convenience of statement to be benefited.It is contemplated that cover appended claims spirit and In the range of included various modifications and equivalent arrangements.

Claims

1. a kind of method of image understanding, including：

De-convolution operation is carried out to the second convolution neural network model according to second output valve, target keyword is obtained, and The target keyword is used as to the description to the image to be understood.

2. the method as described in claim 1, the image for treating understanding using the first convolution neural network model carries out processing Before, methods described also includes：

According to the sample image and sample word grabbed to the first convolution neural network model and second convolution god It is trained through network model.

3. method as claimed in claim 2, according to the sample image and sample word grabbed to first convolutional Neural Network model is trained, including：

The 2nd sample image is handled using the first convolution neural network model including the weighted value Jing Guo 1 suboptimization, Sample image output valve is obtained, and determines the target sample word output of sample word corresponding with the 2nd sample image Value；

Error amount between the sample image output valve and the target sample word output valve is calculated using loss function；

And the weighted value Jing Guo 1 suboptimization is optimized again according to the error amount, obtain the weighted value of the 2nd suboptimization；

Return and the 3rd sample image is carried out using the first convolution neural network model for including the weighted value Jing Guo 2 suboptimization Processing, the step of obtaining sample image output valve, until weighted value optimizes by n times.

4. method as claimed in claim 3, the weighted value Jing Guo 1 suboptimization is obtained in the following way：

The 1st sample image is handled using including the first convolution neural network model for presetting initial weight value, obtained Sample image output valve, and determine the target sample word output valve of sample word corresponding with the 1st sample image；

5. method as claimed in claim 2, according to the sample image and sample word grabbed to second convolutional Neural Network model is trained, including：

The 2nd sample word is handled using the second convolution neural network model including the weighted value Jing Guo 1 suboptimization, Sample word output valve is obtained, and determines the target sample image output of sample image corresponding with the 2nd sample word Value；

Error amount between the sample word output valve and the target sample image output valve is calculated using loss function；

And the weighted value Jing Guo 1 suboptimization is optimized again according to the error amount, obtain the weighted value by 2 suboptimization；

Return and the 3rd sample word is carried out using the second convolution neural network model for including the weighted value Jing Guo 2 suboptimization Processing, the step of obtaining sample word output valve, until weighted value optimizes by n times.

6. method as claimed in claim 5, the weighted value Jing Guo 1 suboptimization is obtained in the following way：

The 1st sample word is handled using including the second convolution neural network model for presetting initial weight value, obtained Sample word output valve, and determine the target sample image output valve of sample image corresponding with the 1st sample word；

7. sample image and sample corresponding with each sample image in method as claimed in claim 2, crawl training webpage After this word, according to the sample image and sample word grabbed to the first convolution neural network model and described second Before convolutional neural networks model is trained, methods described also includes：

Multiple sample images that crawl is obtained are carried out with the conversion on yardstick so that each sample image reaches identical dimension.

8. sample image and sample corresponding with each sample image in method as claimed in claim 2, crawl training webpage After this word, according to the sample image and sample word grabbed to the first convolution neural network model and described second Before convolutional neural networks model is trained, methods described also includes：

Keyword and symbol extraction are carried out to the sample word of crawl, and the keyword and symbol that are obtained to extraction are encoded.

9. method as claimed in claim 7 or 8, keyword extraction is carried out to the sample word of crawl, including：

Keyword is filtered out in the keyword obtained from extraction, and place is normalized in the weighted value of the keyword to filtering out Reason.

10. keyword is filtered out in method as claimed in claim 9, the keyword obtained from extraction, including：

It is determined that extract the obtained keyword weight value of keyword, the keyword weight value is subjected to ascending sort, and will be from The corresponding keyword of continuous multiple keyword weight values that first place starts is used as the keyword filtered out；Or

It is determined that extract the obtained keyword weight value of keyword, the keyword weight value is subjected to descending sort, and will be from The corresponding keyword of continuous multiple keyword weight values that last position starts is as the keyword filtered out.

11. the method as described in claim any one of 7-10, according to the sample image and sample word grabbed to described One convolution neural network model and the second convolution neural network model are trained, including：

According to the sample word after the sample image or coding after conversion process, to the first convolution neural network model and institute The second convolution neural network model is stated to be trained.

12. the method as described in claim 1, warp is carried out according to second output valve to the second convolution neural network model Product computing, obtains target keyword, including：

De-convolution operation is carried out to the second convolution neural network model according to second output valve, multiple keywords are obtained；

13. a kind of device of image understanding, including：

Processing unit, the image for being treated understanding using the first convolution neural network model is handled, and obtains the first output Value；

Arithmetic element, for carrying out de-convolution operation to the second convolution neural network model according to second output valve, is obtained Target keyword；

14. a kind of method of image understanding, including：

According to the sample image and sample word grabbed to the first convolution neural network model and the second convolutional neural networks mould Type is trained so that the weighted value of the first convolution neural network model and the second convolution neural network model is entered Row optimization；

According to the first convolution neural network model and the second convolution neural network model after training, understanding is treated Image is handled, and obtains target keyword；

The target keyword is used as to the description to the image to be understood.