CN108021897A

CN108021897A - Picture answering method and device

Info

Publication number: CN108021897A
Application number: CN201711308733.1A
Authority: CN
Inventors: 杨松
Original assignee: Beijing Xiaomi Mobile Software Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd
Priority date: 2017-12-11
Filing date: 2017-12-11
Publication date: 2018-05-11
Anticipated expiration: 2037-12-11
Also published as: CN108021897B

Abstract

The disclosure is directed to a kind of picture answering method and device, including receive picture and it is related to the picture the problem of；The image feature vector of the picture is extracted by CNN；The coding characteristic vector of described problem is extracted by the first RNN；Described image feature vector and the coding characteristic vector are spliced, obtain union feature vector；The union feature vector is decoded by decoding neutral net, generates the answer of described problem；Wherein, the CNN, the first RNN and decoding neutral net are the neutral nets trained in a manner of end to end in the case of three as a whole network.The answer to picture the problem of related can be generated according to the picture answering method and device of the embodiment of the present disclosure, and CNN, the first RNN and decoding neutral net of each step use are in the case where three is as an entirety, carry out what is obtained under training end to end, cumulative errors are eliminated, improve the accuracy rate of question and answer.

Description

Picture answering method and device

Technical field

This disclosure relates to technical field of information processing, more particularly to picture answering method and device.

Background technology

With the development of the information processing technology, intelligent answer technology receives more and more attention, and intelligent Answer System can To answer the problem of user proposes automatically in the form of question-response.It is exactly to the question and answer for example, for the question answering system of picture System provides a picture and the problem of for the picture, and intelligent Answer System can automatically answer the problem.So can be with Visually impaired person is helped to understand the world.In correlation technique, the accuracy rate for the answer that intelligent Answer System provides is relatively low.

The content of the invention

To overcome problem present in correlation technique, the disclosure provides a kind of picture answering method and device, can be directed to The problem of picture, there is provided accurate answer.

According to the first aspect of the embodiment of the present disclosure, there is provided a kind of picture answering method, including：Receive picture and with it is described The problem of picture is related；The image feature vector of the picture is extracted by CNN；RNN, which is followed, by first extracts described problem Coding characteristic vector；Described image feature vector and the coding characteristic vector are spliced, obtain union feature vector；It is logical Cross decoding neutral net to decode the union feature vector, generate the answer of described problem；Wherein, the CNN, first RNN and decoding neutral net are the nerve nets trained in a manner of end to end in the case of three as a whole network Network.

In a kind of possible implementation, the image feature vector of the picture is extracted by CNN to be included：It will receive Picture zoom to fixed dimension, obtain standard picture；The standard picture is inputted in CNN；By the output knot of the CNN Fruit, is determined as the image feature vector of the picture.

In a kind of possible implementation, the coding characteristic vector of described problem is extracted by the first RNN to be included：By institute The problem of stating is segmented, and obtains the corresponding word insertion sequence of described problem；By the first RNN described in institute's predicate insertion sequence inputting In；By the output of the first RNN as a result, being determined as the coding characteristic vector of described problem.

In a kind of possible implementation, the decoding neutral net is multilayer neural network MLP, by decoding nerve Network decodes the union feature vector, generates the answer of described problem and includes：Union feature vector is inputted MLP network, obtains feature set；Classified using grader to the feature set, obtain the classification of the feature set；By described in The classification of feature set, is determined as the answer of described problem.

In a kind of possible implementation, the decoding neutral net is RNN networks, by decoding neutral net to institute State union feature vector to be decoded, generating the answer of described problem includes：By the 2nd RNN of union feature vector input In；The output result of 2nd RNN is determined as to the answer of described problem.

According to the second aspect of the embodiment of the present disclosure, there is provided a kind of picture question and answer system, including：Receiving module, for connecing Receive picture and it is related to the picture the problem of；First extraction module, for extracting the picture by convolutional neural networks CNN Image feature vector；Second extraction module, for extracting the coding characteristic of described problem by first circulation neutral net RNN Vector；Concatenation module, for described image feature vector and the coding characteristic vector to be spliced, obtain union feature to Amount；Generation module, for being decoded by decoding neutral net to the union feature vector, generates answering for described problem Case；Wherein, the CNN, the first RNN and decoding neutral net are in the case of three as a whole network, with end to end The neutral net that mode is trained.

In a kind of possible implementation, the first extraction module includes：Submodule is scaled, for the picture that will be received Fixed dimension is zoomed to, obtains standard picture；Image input submodule, for the standard picture to be inputted in CNN；First is true Stator modules, for exporting the CNN as a result, being determined as the image feature vector of the picture.

In a kind of possible implementation, the second extraction module includes：Submodule is segmented, for described problem to be carried out Participle, obtains the corresponding word insertion sequence of described problem；Sequence inputting submodule, for by described in institute's predicate insertion sequence inputting In first RNN；Second determination sub-module, for exporting the first RNN as a result, being determined as the coding characteristic of described problem Vector.

In a kind of possible implementation, the decoding neutral net is MLP, and generation module includes：Primary vector is defeated Enter submodule, for union feature vector to be inputted MLP network, obtain feature set；Classification submodule, for using classification Device classifies the feature set, obtains the classification of the feature set；3rd determination sub-module, for by the feature set Classification, is determined as the answer of described problem.

In a kind of possible implementation, the decoding neutral net is RNN networks, and generation module includes：Second to Input submodule is measured, for by the 2nd RNN of union feature vector input；4th determination sub-module, for by described The output result of two RNN is determined as the answer of described problem.

The technical scheme provided by this disclosed embodiment can include the following benefits：, will be logical in the embodiment of the present disclosure The image feature vector of the picture of CNN extractions is crossed, coding characteristic vector the problem of with being extracted by the first RNN is spliced, Union feature vector is obtained, and the union feature vector is decoded by decoding neutral net, generation is relevant with picture The answer of problem.The answering the problem of related with picture can be obtained according to the picture answering method and device of the embodiment of the present disclosure Case, and CNN, the first RNN of the use of each step and decoding neutral net are in the case where three is as an entirety, are held What the training to end obtained, cumulative errors are eliminated, improve the accuracy rate of question and answer.

It should be appreciated that the general description and following detailed description of the above are only exemplary and explanatory, not The disclosure can be limited.

Brief description of the drawings

Attached drawing herein is merged in specification and forms the part of this specification, shows the implementation for meeting the disclosure Example, and be used to together with specification to explain the principle of the disclosure.

Fig. 1 is a kind of flow chart of picture answering method according to an exemplary embodiment.

Fig. 2 a are an examples of the picture of the reception according to an exemplary embodiment.

Fig. 2 b are an examples of the picture of the reception according to an exemplary embodiment.

Fig. 2 c are an examples of the neutral net connection according to an exemplary embodiment.

Fig. 3 is a kind of flow chart of picture answering method according to an exemplary embodiment.

Fig. 4 is a kind of flow chart of picture answering method according to an exemplary embodiment.

Fig. 5 is a kind of flow chart of picture answering method according to an exemplary embodiment.

Fig. 6 is a kind of flow chart of picture answering method according to an exemplary embodiment.

Fig. 7 is a kind of block diagram of picture question and answer system according to an exemplary embodiment.

Fig. 8 is a kind of block diagram of picture question and answer system according to an exemplary embodiment.

Fig. 9 is a kind of block diagram of picture question and answer system according to an exemplary embodiment.

Figure 10 is a kind of block diagram of picture question and answer system according to an exemplary embodiment.

Embodiment

Here exemplary embodiment will be illustrated in detail, its example is illustrated in the accompanying drawings.Following description is related to During attached drawing, unless otherwise indicated, the same numbers in different attached drawings represent the same or similar key element.Following exemplary embodiment Described in embodiment do not represent all embodiments consistent with the disclosure.On the contrary, they be only with it is such as appended The example of the consistent apparatus and method of some aspects be described in detail in claims, the disclosure.

Fig. 1 is a kind of flow chart of picture answering method according to an exemplary embodiment, as shown in Figure 1, the figure Piece answering method can be used in terminal device or server, comprise the following steps.

In step s 11, receive picture and it is related to the picture the problem of.

In step s 12, the image feature vector of the picture is extracted by CNN.

In step s 13, the coding characteristic vector of described problem is extracted by the first RNN.

In step S14, described image feature vector and the coding characteristic vector are spliced, obtain union feature Vector.

In step S15, the union feature vector is decoded by decoding neutral net, generates described problem Answer.

Wherein, the CNN (Convolutional Neural Network, convolutional neural networks), the first RNN (Recurrent Neural Network, Recognition with Recurrent Neural Network) and decoding neutral net are the feelings in three's network as a whole Under condition, the neutral net trained in a manner of end to end.

By will be using the image feature vector of the picture of CNN extractions, coding the problem of with using the first RNN extractions be special Sign vector is spliced, and obtains union feature vector, and the union feature vector is decoded using decoding neutral net, raw Into the answer to picture the problem of related, related with picture ask can be obtained according to the picture answering method of the embodiment of the present disclosure The answer of topic, and each step use CNN, the first RNN and decoding neutral net be in the case where three is as an entirety, Carry out what training end to end obtained, eliminate cumulative errors, improve the accuracy rate of question and answer.

The picture answering method of the embodiment of the present disclosure is illustrated by taking terminal device as an example below.What terminal device received Picture can be the picture of arbitrary size, and the disclosure does not limit this.The problem of related to picture can be directed to the picture Any problem, for example, specifying quantity etc. of object in picture.The disclosure does not limit this.In one example, terminal is set The standby picture received shown in Fig. 2 a, it is related to the picture the problem of can have several dogs in picture；Terminal device receives Fig. 2 b Shown picture, it is related to the picture the problem of can be people in picture in What for.

In a kind of possible implementation, terminal device can receive a pictures and a problem.Terminal device can To generate answer of the problem for the picture by step S12 to step S15, so as to complete picture question and answer.

In a kind of possible implementation, terminal device can receive at least two pictures and a problem.Terminal is set It is standby the problem to be generated for the answer per pictures by step S12 to step S15 respectively, so as to carry out based on content Picture retrieval.For example, problem, which is the picture for having dog in picture, several, terminal device can be obtained by step S12 to step S15 obtains the quantity of dog in every pictures, so that it is determined that going out to have the quantity of the picture of dog.

CNN can be used for the processing of picture, carry out the feature extraction of picture.The basic structure of CNN includes two layers, and one is Feature extraction layer, the input of each neuron are connected with the local acceptance region of preceding layer, and extract the local feature.Once should After local feature is extracted, its position relationship between further feature is also decided therewith；The second is Feature Mapping layer, network Each computation layer be made of multiple Feature Mappings, each Feature Mapping is a plane, the weights of all neurons in plane It is equal.User can be with the structure of self-defined CNN (for example, connectivity etc.) between the quantity of layer, the type of layer, layer, Ran Houtong Cross the parameter that training determines every layer.In step s 12, the feature extraction of picture can be carried out by trained CNN, is carried The feature taken can represent to obtain the image feature vector of picture using vector space, so that picture is mapped as low latitudes vector Space.

RNN can be used for natural language processing, carry out the feature extraction of sequence data.In next word of prediction sentence When, it usually needs use word above, in RNN the current output of a sequence also have relation with output above, can be to preceding The information in face is remembered and is applied in the calculating currently exported, is predicted using the association of front and rear word.RNN includes input Unit, implicit unit and output unit.The information flow of an one-way flow is to reach hidden unit from input unit in RNN, At the same time the information flow of another one-way flow returns to hidden unit from output unit, and the input of hidden layer further includes The state of one hidden layer.In one example, to a sentence comprising 5 words, one five layers of RNN can be expanded into, often Layer represents a word.User with the structure of self-defined RNN, then can determine every layer of parameter by training.In step s 13, The feature extraction of problem can be carried out by trained first RNN, the feature of extraction can represent to obtain using vector space The coding characteristic vector of problem.

Decoding neutral net can be being capable of processing sequence data and the neutral net consistent with union feature vector dimension. For example, RNN, MLP (Multi-layer Perceptron, multilayer neural network) etc..User can self-defined decoding nerve net Type, the structure of network, then determine every layer of parameter by training.In step S15, trained decoding nerve can be passed through Network carries out the decoding of union feature vector, so as to generate the answer of problem.

It should be noted that decoding neutral net can use the neutral net that can be led, so according to the picture of the disclosure The neutral net that each step of answering method uses can be led, and several steps can be carried out at the same time and trained end to end, So as to the cumulative errors between removal process, the accuracy rate of question and answer is improved.

Fig. 2 c are an exemplary neutral net connection figures.As shown in Figure 2 c, the output of the output terminal of CNN and the first RNN End is connected with decoding the input terminal of neutral net.Picture is inputted from the input terminal A of CNN, is inputted from the input terminal B of the first RNN Problem, the output of CNN and the output of the first RNN are stitched together as the input of decoding neutral net, decode neutral net Output terminal C obtain be problem answer.When carrying out CNN, the first RNN and decoding the training of neutral net, according to the disclosure The picture answering method of embodiment can using CNN, the first RNN and decoding neutral net as an overall network, using A, B as The input terminal of overall network, by the output terminal of C networks as a whole, the defeated of overall network is carried out using A, B and C to overall network Enter training of the end to output terminal.CNN, the first RNN and decoding neutral net need not be trained respectively, it is single without being concerned about The accuracy of neutral net.The picture answering method of the embodiment of the present disclosure makees CNN, the first RNN and decoding neutral net three In the case of for overall network, it is trained in a manner of end to end, obtained CNN, the first RNN and decoding neutral net conduct Step S11 corresponding neutral nets used into step S15.In this way, it can be made full use of with the cumulative errors between removal process The powerful learning ability of deep learning, the effective accuracy rate for improving question and answer.

Fig. 3 is a kind of flow chart of picture answering method according to an exemplary embodiment, as shown in figure 3, step The image feature vector that S12 extracts the picture by CNN can be implemented as：

In step S121, the picture received is zoomed into fixed dimension, obtains standard picture.

In step S122, the standard picture is inputted in CNN.

In step S123, by the output of the CNN as a result, being determined as the image feature vector of the picture.

CNN is mainly used to identify the X-Y scheme of displacement, scaling and other forms distortion consistency.Due to the feature of CNN Extract layer is learnt by training data, so when using CNN, avoids explicit feature extraction, and implicitly from instruction Practice and learnt in data；Furthermore since the neuron weights on same Feature Mapping face are identical, so network can be learned parallel Practise.The special construction that CNN is shared with its local weight has the superiority of uniqueness in terms of image procossing, and weights, which are shared, to be reduced The complexity of network, the image of particularly more dimensional input vectors can directly input network this feature and avoid feature extraction The complexity of data reconstruction in journey.

After the structure of CNN and every layer of parameter determine, the picture of fixed dimension can be handled.In a kind of possible realization side In formula, the picture received can be zoomed into fixed dimension using space conversion matrices, obtain standard picture.

Standard picture is inputted in CNN, the feature vector of the standard picture can be obtained.In view of standard picture and picture Feature there is similitude, image feature vector that can be using the feature vector of standard picture as picture.

Fig. 4 is a kind of flow chart of picture answering method according to an exemplary embodiment, as shown in figure 4, step The coding characteristic vector that S13 extracts described problem by the first RNN can be implemented as：

In step S131, described problem is segmented, obtains the corresponding word insertion sequence of described problem.

In step S132, by the first RNN described in institute's predicate insertion sequence inputting.

In step S133, by the output of the first RNN as a result, being determined as the coding characteristic vector of described problem.

Natural language is handled due to the use of computer, it is necessary to which natural language is converted into the symbol that machine can identify Number, line number of going forward side by side value represents.In a kind of possible implementation, the vector representation one of a designated length can be used Word.For example, " picture " in " having several dogs in picture ", " in ", " having ", " several " and " dog " vector can be used respectively Represent, these vectors and can represent sentence above.

Word can be mapped to low-dimensional (100-500 dimensions) vector row by word insertion, and similar word may map to it is similar Direction, the Semantic Similarity between so different words is encoded.In one example, if two word neighbours' word distributional classes Seemingly, then the semanteme of two words may be close.Such as：Neighbours' word of cat is pet, owner, feeding, is rubbed and mew, neighbours' word of dog are Pet, owner, feeding, sting and Wang, it is believed that the semantic similarity of cat and dog.In word insertion, building of corpus can be based on Co-occurrence matrix, each unit can be the coupling index weighed between two objects in matrix, compared to being used in correlation technique One-hot coding (One hot encording) is separately encoded to the progress of each word, and the dimension of word embedded coding is low, facilitates engineering Practise model treatment and carry out unsupervised learning, naturally there is the effect after birdsing of the same feather flock together.

After terminal device is segmented problem, word sequence can be obtained, word insertion side is used respectively to the word of word sequence Method (for example, word2vec) can obtain the feature vector of each word, and the combination of eigenvectors of all words forms in word sequence Feature vector can be with the corresponding word insertion sequence of problem of representation.After the first RNN of the corresponding word insertion sequence inputting of problem, obtain The output of the first RNN the coding characteristic of problem can be used as vectorial.

In a kind of possible implementation, RNN can use LSTM, and (Long Short-Term Memory, length is in short-term Memory models) or GRU (Gated Recurrent Unit Recurrent Neural Networks).

LSTM can use different functions to go to calculate the state of hidden layer., can be the unit in matrix in LSTM Regard black box as to preserve the state of the preservation before current input, what these units can with reference to state above, currently Memory determines which unit suppresses which unit excitement with current input.

Influence of the word to the state of current hidden layer in GRU at position different in sequence is different, shadow more above Sound is smaller, i.e., each front position has carried out current influence distance weighted, and apart from more remote, weights are smaller；Producing error When, error is probably to be triggered by some or several words, and corresponding word weight can be updated.

LSTM and GRU can retain the key character of word insertion sequence, ensure that it will not when long Sequence Spread Lose, the effect for extracting the coding characteristic vector of problem is relatively good.

After obtaining the image feature vector of picture and the coding characteristic vector of problem, terminal device can be special by described image Coding characteristic vector described in sign vector sum is spliced, and union feature vector is obtained, so as to establish the association of the two.Can in one kind In the implementation of energy, coding characteristic vector can be directly connected on behind image feature vector by terminal device, be combined Feature vector.

Fig. 5 is a kind of flow chart of picture answering method according to an exemplary embodiment, the decoding nerve net Network is MLP, as shown in figure 5, step S15 decodes the union feature vector by decoding neutral net, described in generation The answer of problem can be implemented as：

In step S151, by union feature vector input MLP network, feature set is obtained.

In step S152, classified using grader to the feature set, obtain the classification of the feature set.

In step S153, by the classification of the feature set, it is determined as the answer of described problem.

Terminal device can obtain a data set, which includes all possible answer, each is possible to answer Case can be considered as a classification.For example, it is red, two, man, cap, sky, animal, dancing etc., so hundreds and thousands of a altogether Classification.Terminal device can carry out further feature abstraction with a MLP network to joint feature vector, obtain feature set (being made of some probable values), afterwards classifies feature set using grader (Softmax), can correspond to maximum probability Classification be determined as the classification of the feature set, so that the classification of the feature set to be determined as to the answer of problem.

The answer for the problem of joint feature vector is decoded, obtained by MLP is a classification, and form is more simple It is single.For example, the picture shown in Fig. 2 b, problem can be to dance in What for, the answer of generation for the people in picture.

Fig. 6 is a kind of flow chart of picture answering method according to an exemplary embodiment, the decoding nerve net Network is RNN networks, as shown in fig. 6, step S15 decodes the union feature vector by decoding neutral net, generation The answer of described problem is also implemented as：

In step S154, by the 2nd RNN of union feature vector input.

In step S155, the output result of the 2nd RNN is determined as to the answer of described problem.

First RNN and the 2nd RNN is two RNN, and the first RNN is consistent with the dimension of image feature vector, the 2nd RNN and connection The dimension for closing feature vector is consistent.Terminal device can be by union feature vector the 2nd RNN networks of input, and input results can be true It is set to the answer of described problem.2nd RNN equally can be LSTM or GRU.

Joint feature vector is decoded by the 2nd RNN, obtained answer is the sentence of indefinite length.For example, figure Picture shown in 2b, problem are that the people in picture can dance in What for, the answer of generation for them.

Fig. 7 is a kind of picture question and answer system block diagram according to an exemplary embodiment.With reference to Fig. 7, which wraps Include receiving module 71, the first extraction module 72, the second extraction module 73, concatenation module 74 and generation module 75.

The receiving module 71 be configured as receive picture and it is related to the picture the problem of.

First extraction module 72 be configured as by convolutional neural networks CNN extract the characteristics of image of the picture to Amount.

Second extraction module 73 is configured as extracting the coding characteristic of described problem by first circulation neutral net RNN Vector.

The concatenation module 74 is configured as being spliced described image feature vector and the coding characteristic vector, obtains Union feature vector.

The generation module 75 is configured as decoding the union feature vector by decoding neutral net, generates institute State the answer of problem.

Wherein, the CNN, the first RNN and decoding neutral net are in the case of three as a whole network, are arrived with end The neutral net that the mode at end is trained.

Fig. 8 is a kind of picture question and answer system block diagram according to an exemplary embodiment.With reference to Fig. 8, a kind of possible Implementation in, the first extraction module 72 includes：Submodule 721 is scaled, image input submodule 722 and first determines submodule Block 723.

The picture that the scaling submodule 721 is configured as receiving zooms to fixed dimension, obtains standard picture.

The image input submodule 722 is configured as inputting the standard picture in CNN.

First determination sub-module 723 is configured as the output of the CNN as a result, the image for being determined as the picture is special Sign vector.

In a kind of possible implementation, the second extraction module 73 includes：Segment submodule 731, sequence inputting submodule 732 and second determination sub-module 733 of block.

The participle submodule 731 is configured as being segmented described problem, obtains the corresponding word insertion sequence of described problem Row.

The sequence inputting submodule 732 is configured as in the first RNN described in institute's predicate insertion sequence inputting.

Second determination sub-module 733 is configured as the output by the first RNN as a result, being determined as the volume of described problem Code feature vector.

In a kind of possible implementation, the decoding neutral net is MLP, and generation module 75 includes：Primary vector Input submodule 751, classification 752 and the 3rd determination sub-module 753 of submodule.

The primary vector input submodule 751 is configured as, by union feature vector input MLP network, obtaining feature Collection.

The classification submodule 752 is configured with grader and classifies to the feature set, obtains the feature set Classification.

3rd determination sub-module 753 is configured as the classification of the feature set, is determined as the answer of described problem.

The decoding neutral net is RNN networks, and generation module 75 further includes：Secondary vector input submodule 754 and Four determination sub-modules 755.

The secondary vector input submodule 754 is configured as in the 2nd RNN of union feature vector input.

4th determination sub-module 755 is configured as the output result of the 2nd RNN being determined as answering for described problem Case.

On the device in above-described embodiment, wherein modules perform the concrete mode of operation in related this method Embodiment in be described in detail, explanation will be not set forth in detail herein.

Fig. 9 is a kind of block diagram of device 800 for picture question and answer according to an exemplary embodiment.For example, dress It can be mobile phone to put 800, computer, digital broadcast terminal, messaging devices, game console, tablet device, medical treatment Equipment, body-building equipment, personal digital assistant etc..

With reference to Fig. 9, device 800 can include following one or more assemblies：Processing component 802, memory 804, power supply Component 806, multimedia component 808, audio component 810, the interface 812 of input/output (I/O), sensor component 814, and Communication component 816.

The integrated operation of the usual control device 800 of processing component 802, such as with display, call, data communication, phase The operation that machine operates and record operation is associated.Processing component 802 can refer to including one or more processors 820 to perform Order, to complete all or part of step of above-mentioned method.In addition, processing component 802 can include one or more modules, just Interaction between processing component 802 and other assemblies.For example, processing component 802 can include multi-media module, it is more to facilitate Interaction between media component 808 and processing component 802.

Memory 804 is configured as storing various types of data to support the operation in device 800.These data are shown Example includes the instruction of any application program or method for being operated on device 800, and contact data, telephone book data, disappears Breath, picture, video etc..Memory 804 can be by any kind of volatibility or non-volatile memory device or their group Close and realize, as static RAM (SRAM), electrically erasable programmable read-only memory (EEPROM) are erasable to compile Journey read-only storage (EPROM), programmable read only memory (PROM), read-only storage (ROM), magnetic memory, flash Device, disk or CD.

Power supply module 806 provides electric power for the various assemblies of device 800.Power supply module 806 can include power management system System, one or more power supplys, and other components associated with generating, managing and distributing electric power for device 800.

Multimedia component 808 is included in the screen of one output interface of offer between described device 800 and user.One In a little embodiments, screen can include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, screen Curtain may be implemented as touch-screen, to receive input signal from the user.Touch panel includes one or more touch sensings Device is to sense the gesture on touch, slip and touch panel.The touch sensor can not only sense touch or sliding action Border, but also detect and the duration and pressure associated with the touch or slide operation.In certain embodiments, more matchmakers Body component 808 includes a front camera and/or rear camera.When device 800 is in operator scheme, such as screening-mode or During video mode, front camera and/or rear camera can receive exterior multi-medium data.Each front camera and Rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.

Audio component 810 is configured as output and/or input audio signal.For example, audio component 810 includes a Mike Wind (MIC), when device 800 is in operator scheme, during such as call model, logging mode and speech recognition mode, microphone by with It is set to reception external audio signal.The received audio signal can be further stored in memory 804 or via communication set Part 816 is sent.In certain embodiments, audio component 810 further includes a loudspeaker, for exports audio signal.

I/O interfaces 812 provide interface between processing component 802 and peripheral interface module, and above-mentioned peripheral interface module can To be keyboard, click wheel, button etc..These buttons may include but be not limited to：Home button, volume button, start button and lock Determine button.

Sensor component 814 includes one or more sensors, and the state for providing various aspects for device 800 is commented Estimate.For example, sensor component 814 can detect opening/closed mode of device 800, and the relative positioning of component, for example, it is described Component is the display and keypad of device 800, and sensor component 814 can be with 800 1 components of detection device 800 or device Position change, the existence or non-existence that user contacts with device 800,800 orientation of device or acceleration/deceleration and device 800 Temperature change.Sensor component 814 can include proximity sensor, be configured to detect without any physical contact Presence of nearby objects.Sensor component 814 can also include optical sensor, such as CMOS or ccd image sensor, for into As being used in application.In certain embodiments, which can also include acceleration transducer, gyro sensors Device, Magnetic Sensor, pressure sensor or temperature sensor.

Communication component 816 is configured to facilitate the communication of wired or wireless way between device 800 and other equipment.Device 800 can access the wireless network based on communication standard, such as WiFi, 2G or 3G, or combinations thereof.In an exemplary implementation In example, communication component 816 receives broadcast singal or broadcast related information from external broadcasting management system via broadcast channel. In one exemplary embodiment, the communication component 816 further includes near-field communication (NFC) module, to promote junction service.Example Such as, in NFC module radio frequency identification (RFID) technology can be based on, Infrared Data Association (IrDA) technology, ultra wide band (UWB) technology, Bluetooth (BT) technology and other technologies are realized.

In the exemplary embodiment, device 800 can be believed by one or more application application-specific integrated circuit (ASIC), numeral Number processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for performing the above method.

In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instructing, example are additionally provided Such as include the memory 804 of instruction, above-metioned instruction can be performed to complete the above method by the processor 820 of device 800.For example, The non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk With optical data storage devices etc..

Figure 10 is a kind of block diagram of device 1900 for picture question and answer according to an exemplary embodiment.For example, Device 1900 may be provided as a server.With reference to Figure 10, device 1900 includes processing component 1922, it further comprises one A or multiple processors, and as the memory resource representated by memory 1932, can be by processing component 1922 for storing The instruction of execution, such as application program.The application program stored in memory 1932 can include one or more every One module for corresponding to one group of instruction.In addition, processing component 1922 is configured as execute instruction, to perform the above method.

Device 1900 can also include a power supply module 1926 and be configured as the power management of executive device 1900, one Wired or wireless network interface 1950 is configured as device 1900 being connected to network, and input and output (I/O) interface 1958.Device 1900 can be operated based on the operating system for being stored in memory 1932, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or similar.In the exemplary embodiment, additionally provide it is a kind of including instruction Non-transitorycomputer readable storage medium, such as the memory 1932 including instruction, above-metioned instruction can be by devices 1900 Reason component 1922 is performed to complete the above method.For example, the non-transitorycomputer readable storage medium can be ROM, with Machine access memory (RAM), CD-ROM, tape, floppy disk and optical data storage devices etc..

Those skilled in the art will readily occur to the disclosure its after considering specification and putting into practice invention disclosed herein Its embodiment.This application is intended to cover any variations, uses, or adaptations of the disclosure, these modifications, purposes or Person's adaptive change follows the general principle of the disclosure and including the undocumented common knowledge in the art of the disclosure Or conventional techniques.Description and embodiments are considered only as exemplary, and the true scope and spirit of the disclosure are by following Claim is pointed out.

It should be appreciated that the present disclosure is not limited to the precise structures that have been described above and shown in the drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present disclosure is only limited by appended claim.

Claims

A kind of 1. picture answering method, it is characterised in that including：

Receive picture and it is related to the picture the problem of；

The image feature vector of the picture is extracted by convolutional neural networks CNN；

The coding characteristic vector of described problem is extracted by first circulation neutral net RNN；

Described image feature vector and the coding characteristic vector are spliced, obtain union feature vector；

The union feature vector is decoded by decoding neutral net, generates the answer of described problem；

Wherein, the CNN, the first RNN and decoding neutral net are in the case of three as a whole network, with end to end The neutral net that mode is trained.
2. picture answering method according to claim 1, it is characterised in that the image that the picture is extracted by CNN is special Sign vector, including：

The picture received is zoomed into fixed dimension, obtains standard picture；

The standard picture is inputted in CNN；

By the output of the CNN as a result, being determined as the image feature vector of the picture.
3. picture problem method according to claim 1, it is characterised in that the volume of described problem is extracted by the first RNN Code feature vector, including：

Described problem is segmented, obtains the corresponding word insertion sequence of described problem；

By in the first RNN described in institute's predicate insertion sequence inputting；

By the output of the first RNN as a result, being determined as the coding characteristic vector of described problem.
4. picture answering method according to claim 1, it is characterised in that the decoding neutral net is multilayer nerve net Network MLP, decodes the union feature vector by decoding neutral net, generates the answer of described problem, including：

By union feature vector input MLP network, feature set is obtained；

Classified using grader to the feature set, obtain the classification of the feature set；

By the classification of the feature set, it is determined as the answer of described problem.
5. picture answering method according to claim 1, it is characterised in that the decoding neutral net is RNN networks, is led to Cross decoding neutral net to decode the union feature vector, generate the answer of described problem, including：

By in the 2nd RNN of union feature vector input；

The output result of 2nd RNN is determined as to the answer of described problem.
A kind of 6. picture question and answer system, it is characterised in that including：

Receiving module, for receive picture and it is related to the picture the problem of；

First extraction module, for extracting the image feature vector of the picture by convolutional neural networks CNN；

Second extraction module, for extracting the coding characteristic vector of described problem by first circulation neutral net RNN；

Concatenation module, for described image feature vector and the coding characteristic vector to be spliced, obtain union feature to Amount；

Generation module, for being decoded by decoding neutral net to the union feature vector, generates answering for described problem Case；

Wherein, the CNN, the first RNN and decoding neutral net are in the case of three as a whole network, with end to end The neutral net that mode is trained.
7. picture question and answer system according to claim 6, it is characterised in that the first extraction module includes：

Submodule is scaled, for the picture received to be zoomed to fixed dimension, obtains standard picture；

Image input submodule, for the standard picture to be inputted in CNN；

First determination sub-module, for exporting the CNN as a result, being determined as the image feature vector of the picture.
8. picture problem device according to claim 6, it is characterised in that the second extraction module includes：

Submodule is segmented, for described problem to be segmented, obtains the corresponding word insertion sequence of described problem；

Sequence inputting submodule, for by the first RNN described in institute's predicate insertion sequence inputting；

Second determination sub-module, for exporting the first RNN as a result, being determined as the coding characteristic vector of described problem.
9. picture question and answer system according to claim 6, it is characterised in that the decoding neutral net is multilayer nerve net Network MLP, generation module include：

Primary vector input submodule, for union feature vector to be inputted MLP network, obtains feature set；

Classification submodule, for classifying using grader to the feature set, obtains the classification of the feature set；

3rd determination sub-module, for by the classification of the feature set, being determined as the answer of described problem.
10. picture question and answer system according to claim 6, it is characterised in that the decoding neutral net is RNN networks, Generation module includes：

Secondary vector input submodule, for by the 2nd RNN of union feature vector input；

4th determination sub-module, for the output result of the 2nd RNN to be determined as to the answer of described problem.
A kind of 11. picture question and answer system, it is characterised in that including：

Processor；

For storing the memory of processor-executable instruction；

Wherein, the processor is configured as performing the method as any one of claim 1 to 5.
12. a kind of non-transitorycomputer readable storage medium, real when the instruction in the storage medium is performed by processor Method any one of existing claim 1 to 5.