WO2019047971A1 - 图像识别方法、终端及存储介质 - Google Patents

图像识别方法、终端及存储介质 Download PDF

Info

Publication number
WO2019047971A1
WO2019047971A1 PCT/CN2018/105009 CN2018105009W WO2019047971A1 WO 2019047971 A1 WO2019047971 A1 WO 2019047971A1 CN 2018105009 W CN2018105009 W CN 2018105009W WO 2019047971 A1 WO2019047971 A1 WO 2019047971A1
Authority
WO
WIPO (PCT)
Prior art keywords
input data
network model
model
annotation
timing
Prior art date
Application number
PCT/CN2018/105009
Other languages
English (en)
French (fr)
Inventor
姜文浩
马林
刘威
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2020514506A priority Critical patent/JP6972319B2/ja
Priority to KR1020197036824A priority patent/KR102270394B1/ko
Priority to EP18853742.7A priority patent/EP3611663A4/en
Publication of WO2019047971A1 publication Critical patent/WO2019047971A1/zh
Priority to US16/552,738 priority patent/US10956771B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/5866Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the embodiments of the present application relate to the field of machine learning, and in particular, to an image recognition method, a terminal, and a storage medium.
  • the system framework of image recognition generally includes an encoder (Encoder) and a decoder (Decoder).
  • an image recognition method is proposed in the related art, including: first, feature extraction of an image by an encoder, The feature vector and the annotation vector (Annotation Vectors) set are obtained, wherein the feature vector is obtained by global feature extraction of the image, and the annotation vector set is obtained by local feature extraction of the image. Then, the feature vector is initialized to obtain initial input data, which is used to indicate the initial state of the decoder, and generally includes initial Hidden State information and initial Memory Cell state information. .
  • the specific information of the artificial design is extracted from the image as the guiding information, and based on the guiding information, the annotation vector set and the initial input data are decoded by the decoder to obtain a description sentence of the image.
  • the guiding information is used to guide the encoding process of the encoder to improve the quality of generating the description sentence, so that the generated description statement can describe the image more accurately and conform to the semantics.
  • the embodiment of the present invention provides an image recognition method, a terminal, and a storage medium, which can solve the problem that the description message that cannot be accurately generated by the specific guidance information artificially designed in the related art is generated, and the quality of the generated description statement is low.
  • the technical solution is as follows:
  • an image recognition method is provided, the method being performed by a terminal, the method comprising:
  • an image recognition apparatus comprising:
  • An extraction module configured to perform feature extraction on the target image to be identified by the encoder, to obtain a feature vector and a first annotation vector set;
  • a processing module configured to perform initialization processing on the feature vector to obtain first initial input data
  • a generating module configured to generate, according to the first set of annotation vectors, first guiding information by using a first guiding network model, where the first guiding network model is configured to generate guiding information according to an identifier vector set of any image;
  • a determining module configured to determine, by the decoder, a description statement of the target image based on the first guiding information, the first annotation vector set, and the first initial input data.
  • a terminal comprising a processor and a memory, wherein the memory stores at least one instruction, at least one program, a code set or a set of instructions, the instruction, the program, the code
  • the set or set of instructions is loaded and executed by the processor to:
  • a computer readable storage medium stores at least one instruction, at least one program, a code set or a set of instructions, the instruction, the program, the code
  • the set or set of instructions is loaded and executed by a processor to implement the image recognition method as described in the first aspect.
  • a guiding network model is added between the encoder and the decoder.
  • the guiding information may be generated by using the guiding network model based on the annotation vector set, because the guiding network model
  • the guidance information of the image can be generated according to the annotation vector set of any image. Therefore, the guidance information generated by the guidance network model can be more applicable to the generation process of the description sentence of the target image, and the accuracy is high, so that the target image can be The encoding process is accurately guided to improve the quality of the generated description statement.
  • FIG. 1 is a schematic diagram of a logical structure of an RNN model provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a logical structure of an LSTM model provided by an embodiment of the present application
  • FIG. 3 is a schematic structural diagram of an image recognition system according to an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of another image recognition system according to an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of still another image recognition system according to an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of still another image recognition system according to an embodiment of the present application.
  • FIG. 7 is a flowchart of an image recognition method according to an embodiment of the present application.
  • FIG. 8 is a flowchart of another image recognition method according to an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a generating module 303 according to an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of another generation module 303 according to an embodiment of the present disclosure.
  • FIG. 12 is a schematic structural diagram of a determining module 304 according to an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of another image recognition apparatus according to an embodiment of the present disclosure.
  • FIG. 14 is a schematic structural diagram of another determining module 304 according to an embodiment of the present disclosure.
  • FIG. 15 is a schematic structural diagram of still another image recognition apparatus according to an embodiment of the present application.
  • FIG. 16 is a schematic structural diagram of a terminal 400 according to an embodiment of the present application.
  • the encoder is used to encode the image to generate a vector, and the encoder usually adopts a CNN (Convolutional Neural Networks) model.
  • CNN Convolutional Neural Networks
  • the decoder is used to decode the vector generated by the encoder to translate the vector generated by the encoder into a description sentence of the image, and the decoder usually adopts a RNN (Recurrent Neural Network) model.
  • RNN Recurrent Neural Network
  • the boot information is information obtained by processing the image, usually expressed as a vector, and can be used as part of the decoder input to guide the decoding process. Introducing the boot information in the decoder can improve the performance of the decoder, ensure that the decoder can generate better description statements, and improve the quality of the generated description statements.
  • the CNN model refers to a neural network model developed for image classification and recognition based on the traditional multi-layer neural network.
  • the CNN model usually includes multiple convolution layers and at least one fully connected layer, which can characterize the image. extract.
  • the traditional neural network Since the traditional neural network has no memory function, that is, for the conventional neural network, its input is independent and has no context-related data. However, in practical applications, the input is usually a serialized input with obvious context features, such as the need to predict the next word in the description statement, at which point the output of the neural network must depend on the last input. That is, the neural network should be required to have a memory function, and the RNN model is a neural network in which nodes are connected in a loop and have a memory function, and the internal memory function can be used to cyclically process the input data.
  • the RNN model includes a three-layer structure of an input layer, an implicit layer, and an output layer, and the hidden layer is a ring structure. .
  • the input layer is connected to the hidden layer, and the hidden layer is connected to the output layer.
  • the structure of the RNN model shown on the left side of FIG. 1 is expanded in time series, and the structure shown in the right side of FIG. 1 can be obtained.
  • the input data received by the input layer of the RNN model is data sorted according to a certain time series, that is, the input data received by the input layer is sequence data, for convenience of description, the sequence data is marked as x 1 , x 2 , ...
  • n the time corresponding to each of the data in the sequence data is denoted by t 1 , t 2 , ..., t i , ..., t n , and the pair x 1 , x 2 , ...,
  • the output data obtained by processing x i , . . . , x n respectively is denoted by f 1 , f 2 , . . . , f i , . . . , f n
  • timing the step of sequentially processing each input data by the RNN model according to timing.
  • the input data received by the input layer at time t 1 is x 1 , and x 1 is transmitted to the hidden layer, and the hidden layer processes x 1 and the transmission of the processed data to the output layer, to obtain output data at time t 1 f 1.
  • the input data received by the input layer is x 2 and x 2 is transmitted to the hidden layer.
  • the hidden layer processes x 2 according to the output data f 1 at time t 1 and processes the processed data. transmitted to the output layer, to obtain the output data of the time t 2 f 2.
  • any time t i in addition to the hidden layer receives the input time point t i-layer transmission data x i, also receives the time point t i-1 output data f i-1, and f i-1 according to the pair of x i, to give the output data f i i t the time.
  • the LSTM network model is a special RNN model that can process and predict important events with relatively long intervals and delays in time series.
  • the LSTM network model includes an LSTM unit that is provided with an input gate, a forgotten gate, and an output gate, and the input data can be processed at each timing step based on the set input gate, forgetting gate, and output gate.
  • the LSTM network model includes an LSTM unit, and the LSTM unit is a ring structure, and any timing performed on the LSTM unit is performed.
  • the LSTM unit can process the input data x t of the timing step t and the output data f t-1 of the previous timing step t-1 to obtain the output data f t of the timing step t .
  • the timing unit receives LSTM t step after input data x 1 1, x 1 may be processed to obtain timing data output step t f 1 of 1 Then, f 1 is re-entered into the LSTM unit.
  • the f 1 and x 2 can be processed to obtain the output data f 2 of the timing step t 2 until the timing step is performed.
  • t n x n input data sequence obtained in step and step timing t n T n-1 of the output data of f n-1 until the output data f n. Where n is the number of times the LSTM network model cyclically processes the input data.
  • the review network is an image recognition network based on an encoder-decoder framework, including a reviewer and decoder. Both the reviewer and the decoder typically use the CNN model. The reviewer can further explore the interaction relationship between the global feature and the local feature extracted by the encoder from the image, and generate initial input data for the decoder based on the interaction relationship between the global feature and the local feature to improve the performance of the decoder.
  • Embodiments of the present application can be applied to early childhood education, image retrieval, blind reading or chat systems, where images are often automatically translated into natural language.
  • the image recognition method provided by the embodiment of the present application can be used to translate the image seen by the young child into a corresponding description sentence, and then the description sentence is converted into a voice playback. So that young children can learn image content in combination with images and voice.
  • the image recognition method provided by the embodiment of the present application may be used to translate the image into a corresponding description statement, so as to accurately classify the image according to the description sentence of the image, or according to the description sentence of the image. Accurately retrieve images.
  • the image may be first translated into a corresponding description sentence, and then the description sentence is converted into a voice, so that the blind person can recognize the image through the voice, or
  • the description statement is converted into Braille so that the blind person can recognize the image by reading Braille.
  • the image in the chat window can be translated into a corresponding description sentence, and the description sentence is displayed.
  • FIG. 3 is a schematic structural diagram of an image recognition system according to an embodiment of the present application. As shown in FIG. 3, the image recognition system includes an encoder 10, a first boot network model 20, and a decoder 30.
  • the encoder 10 is used for encoding the target image to be identified, that is, performing feature extraction on the target image to obtain a feature vector and a first annotation vector set.
  • the feature vector is used to indicate a global feature of the target image
  • the first set of annotation vectors is used to indicate local features of the target image.
  • the encoder 10 can output it to the decoder 30 and the first guided network model 20, respectively.
  • the encoder 10 may perform initialization processing to obtain first initial input data, and then output the first initial input data to the decoder 30; alternatively, the encoder 10 may output the feature vector to other models, The other models perform initialization processing on the feature vector output from the target encoder 10 to obtain first initial input data, and output the first initial input data to the decoder 30.
  • the first boot network model 20 is configured to generate first boot information based on the first annotation vector set output by the encoder 10, and then output the first boot information to the decoder 30, and the first boot network model passes the sample image.
  • the vector collection of annotations is obtained.
  • the decoder 30 is configured to determine a description statement of the target image based on the first boot information, the first annotation vector set, and the first initial input data.
  • the image recognition system shown in FIG. 3 adds a guidance network model between the encoder and the decoder compared to the related art, since the guidance network model can generate the image according to the annotation vector set of any image.
  • Descriptive statement therefore, compared with the artificially designed guidance information, the guidance information generated by the guidance network model can be more applicable to the generation process of the description sentence of the target image, and the accuracy is high, so that the encoding process of the image can be accurately performed.
  • Boot which improves the quality of the generated description statement.
  • FIG. 4 is a schematic structural diagram of another image recognition system according to an embodiment of the present application.
  • the image recognition system includes an encoder 10, a first guidance network model 20, a decoder 30, and a multi-example model 40.
  • the multi-instance model 40 is configured to process the target image to be identified, and obtain attribute information of the target image, where the attribute information is used to indicate a probability of predicting a word appearing in the description sentence of the target image, and the attribute information of the target image is obtained. Output to the first boot network model 20.
  • the first boot network model 20 is configured to generate first boot information based on the first annotation vector set output by the encoder 10 and the attribute information of the target image output by the multi-example model 40.
  • the first guiding network model 20 can comprehensively determine the first guiding information according to the first annotation vector set of the target image and the attribute information, thereby further improving the The accuracy of the generated first boot information.
  • FIG. 5 is a schematic structural diagram of still another image recognition system according to an embodiment of the present application.
  • the image recognition system includes an encoder 10, a first guidance network model 20, a reviewer 50, and a second guidance network model. 60 and decoder 30.
  • the function of the encoder 10 in FIG. 5 is the same as that of the encoder 10 in FIG. 3 .
  • the first boot network model 20 is configured to generate first boot information based on the first set of annotation vectors input by the encoder 10, and output the first boot information to the reviewer 50.
  • the reviewer 50 is configured to determine a second annotation vector set and second initial input data based on the first initial input data, the first annotation vector set, and the first guidance information, and set the second annotation vector set and the second initial input data.
  • the output is output to the decoder 30, and the second set of annotation vectors is output to the second guided network model 60.
  • the second initial input data is the initial input data of the decoder 30 for indicating the initial state of the decoder 30, and may specifically include initial implicit state information and initial memory cell state information.
  • the second guiding network model 60 is configured to generate second guiding information based on the second set of annotation vectors, and output the second guiding information to the decoder 30, and the second guiding network model is also obtained by training the sample image.
  • the decoder 30 is configured to decode the second annotation vector set and the second initial input data based on the second guiding information to obtain a description statement of the target image.
  • the interaction between the local feature and the global feature of the target image can be further excavated by the reviewer, so that the generated second annotation vector set and the second initial input data can be generated. More accurately indicating the characteristics of the target image, further improving the system performance of the image recognition system, thereby improving the quality of the generated description statement.
  • FIG. 6 is a schematic structural diagram of still another image recognition system according to an embodiment of the present application.
  • the image recognition system includes an encoder 10, a first guidance network model 20, a reviewer 50, and a second guidance network model. 60. Decoder 30 and multiple example model 40.
  • the function of the encoder 10, the reviewer 50, and the decoder 30 in FIG. 6 is the same as that of the decoder 30, and the specific description may be referred to FIG. 5, and details are not described herein again.
  • the multi-instance model 40 is used to process the target image to be identified, obtain attribute information of the target image, and output the attribute information of the target image to the first boot network model 20 and the second boot network model 60, respectively.
  • the first boot network model 20 is configured to generate first boot information based on the first annotation vector set output by the encoder 10 and the attribute information of the target image output by the multi-instance model 40, and output the first guidance information to the reviewer 50. .
  • the second boot network model 60 is configured to generate second boot information based on the second annotation vector set output by the reviewer 50 and the attribute information of the target image output by the multi-instance model 40, and output the second guidance information to the decoder 30. So that the encoder 30 encodes the second annotation vector set and the second initial input data based on the second guidance information to obtain a description sentence of the target image.
  • both the first boot network model 20 and the second boot network model 60 can be made according to attribute information and annotation of the target image.
  • the vector set comprehensively determines the guidance information, further improving the accuracy of the generated guidance information.
  • the image recognition systems shown in FIG. 3 to FIG. 6 can be trained based on description sentences of a plurality of sample images and a plurality of sample images, that is, the encoder and the first guide can be obtained through training.
  • the network model, the reviewer, the second boot network model, and the decoder enable the first boot network model and the second boot network model to adaptively learn how to generate accurate boot information during the training process, thereby improving the generation of the boot information. accuracy.
  • FIG. 7 is a flowchart of an image recognition method according to an embodiment of the present disclosure.
  • the method may be performed by a terminal, and the terminal may be a mobile phone, a tablet computer, or a computer.
  • the terminal may include the image recognition system, for example, may be installed.
  • the software carries the above image recognition system. Referring to Figure 7, the method includes:
  • Step 101 Perform feature extraction on the target image to be identified by the encoder, to obtain a feature vector and a first annotation vector set.
  • the target image may be first input into an encoder, and the target image is subjected to feature extraction by an encoder to obtain a feature vector of the target image and a first annotation vector set respectively.
  • the target image may be globally extracted by the encoder to obtain a feature vector, and the target image is extracted by the encoder to obtain a set of annotation vectors.
  • the feature vector is used to indicate a global feature of the target image
  • the annotation vector in the second identification vector set is used to indicate a local feature of the target image.
  • the encoder may adopt a CNN model.
  • the feature vector may be extracted through the last fully connected layer of the CNN model, and the second set of annotation vectors may pass through the CNN.
  • the last convolutional layer of the model is extracted.
  • Step 102 Initialize the feature vector to obtain first initial input data.
  • the first initial input data refers to initial input data to be input to the next processing model of the encoder, and is used to indicate an initial state of the next processing model, which may be a decoder or a reviewer.
  • the first initial input data may include first initial implicit state information and first initial memory cell state information, the first initial implicit state information being used to indicate an initial state of a hidden layer of the next processing model, the first initial memory unit
  • the status information is used to indicate the initial state of the memory unit of the next processing model.
  • the feature vector may be subjected to initialization processing such as linear transformation to obtain first initial input data.
  • initialization processing such as linear transformation
  • the feature vector may be initialized by the encoder to obtain the first initial input data
  • the feature vector output by the encoder may be initialized by other models to obtain the first initial input data, which is used in the embodiment of the present application. Not limited.
  • the encoder may include an RNN model for performing feature extraction on the target image, and an initialization model for initializing the feature vector, and the encoder extracts the feature by the RNN model to obtain the feature vector.
  • the feature vector can be initialized by the initialization model to obtain the first initial input data.
  • the encoder may also be used only for feature extraction on the target image, and an initialization model is added after the encoder.
  • the initialization model is used to initialize the feature vector, and the feature image is extracted by the encoder to obtain the feature vector.
  • the feature vector may be output to the initialization model, and then the feature vector is initialized by the initialization model to obtain first initial input data.
  • Step 103 Generate first guiding information by using a first guiding network model based on the first set of annotation vectors, the first guiding network model for generating guiding information according to the annotation vector set of any image.
  • generating the first boot information by using the first boot network model may be implemented in the following two manners based on the first set of annotation vectors:
  • the first implementation manner is: performing linear transformation on the first annotation vector set based on the first matrix formed by the model parameters in the first guiding network model to obtain a second matrix; determining the first based on a maximum value of each row in the second matrix Boot information.
  • the first boot network model can be trained according to the set of annotation vectors of the sample image.
  • each model in FIG. 3 may be transformed into a model to be trained, and then the transformed image recognition system is trained based on the description sentences of the plurality of sample images and the plurality of sample images, and then the training process is performed.
  • the label vector can be extracted from the plurality of sample images and output to the training network model to be trained, so that after the training of the entire image recognition system is completed, the training network model to be trained can be trained. For the first boot network model.
  • the encoder to be trained may be an untrained encoder or a pre-trained encoder, which is not limited in this embodiment of the present application.
  • the pre-trained encoder to train the training network model By using the pre-trained encoder to train the training network model, the training efficiency of the entire image recognition system can be improved, and the training efficiency of the network model to be trained can be improved.
  • the first set of annotation vectors is also in the form of a matrix, and the first matrix is a matrix composed of model parameters of the first guidance network model and used for linear transformation of the first annotation vector set. Specifically, the first set of annotation vectors may be multiplied by the first matrix to linearly transform the first set of annotation vectors to obtain a second matrix.
  • determining the first guiding information based on a maximum value of each row in the second matrix includes: selecting a maximum value of each row in the second matrix, and then forming the selected maximum value into a column number according to a principle that the number of rows does not change. a matrix, and the composed matrix is determined as the first boot information.
  • the first guiding information can be determined by the following formula (1):
  • the max function refers to a maximum value of each row of the matrix to be processed, and a matrix in which the number of rows is constant and the number of columns is 1.
  • the second implementation manner when the first guiding network model is used to generate guiding information according to the annotation vector set and the attribute information of any image, the target image may be input as a multi-example model, and the multi-example model is used to Processing the target image to obtain attribute information of the target image; linearly transforming the first set of annotation vectors based on a third matrix formed by the model parameters in the first guided network model to obtain a fourth matrix; based on the fourth matrix And the attribute information of the target image, generating a fifth matrix; determining the first guiding information based on a maximum value of each row in the fifth matrix.
  • the attribute information of the sample image is used to refer to a probability of predicting a word appearing in a description sentence of the sample image.
  • the multi-example model is obtained by training a plurality of sample images and description sentences of the plurality of sample images, and is capable of outputting a model of attribute information of the sample image, that is, the multi-example model can describe the image of the image
  • the probability of words that may appear in the prediction is predicted.
  • the attribute information may be MIL (Multi-instance learning) information or the like.
  • the first guiding network model can be obtained by training the annotation vector set of the sample image and the attribute information.
  • each model of FIG. 4 may be transformed into a model to be trained, and then the transformed image recognition system is trained based on description sentences of a plurality of sample images and a plurality of sample images, and then the training code is to be trained during the training process.
  • the annotation vector can be extracted from the sample image and output to the to-be-trained network model, and the multi-example model to be trained can process the image to obtain attribute information, and output the attribute information to the to-be-trained network model to be trained.
  • the model can be trained based on the annotation vector and the attribute information of the sample image, so that after the training of the entire image recognition system is completed, the training network model to be trained can be trained as the first guidance network model.
  • the encoder to be trained may be an untrained encoder or a pre-trained encoder; the multi-example model to be trained may be an untrained multi-example model or a pre-trained multi-example model.
  • This embodiment of the present application does not limit this.
  • the first set of annotation vectors is also in the form of a matrix
  • the third matrix is a matrix composed of model parameters of the first guidance network model and used for linear transformation of the first annotation vector set.
  • the first annotation vector set may be multiplied by the third matrix to linearly transform the first annotation vector set to obtain a fourth matrix, and then generate a fifth matrix based on the fourth matrix and the attribute information of the target image. .
  • determining the first guiding information based on the maximum value of each row in the fifth matrix comprises: selecting a maximum value of each row in the fifth matrix, and then forming the matrix having the column number of 1 according to the principle that the selected maximum value is constant. And determining the composed matrix as the first boot information.
  • the first guidance can be determined by the following formula (2) Information v:
  • the max function refers to a maximum value of each row of the matrix to be processed, and a matrix in which the number of rows is constant and the number of columns is 1.
  • the first guiding network model can be obtained through learning, that is, it can be trained through description sentences of multiple sample images and multiple sample images, and the guiding information can be automatically learned during the training process, and therefore, The first guiding network model generates the first guiding information with high accuracy, and the generated first guiding information can accurately guide the encoded encoding process, thereby improving the quality of the description statement for generating the target image.
  • Step 104 Determine, by the decoder, a description statement of the target image based on the first guiding information, the first annotation vector set, and the first initial input data.
  • determining, by the decoder, the description statement of the target image may include the following two implementation manners:
  • the first implementation manner is: decoding, according to the first guiding information, the first annotation vector set and the first initial input data by the decoder to obtain a description statement of the target image.
  • the decoder typically employs an RNN model, such as an LSTM network model.
  • the first annotation vector set and the first initial input data are decoded by the decoder, and the description statement of the target image may be obtained by the following steps 1)-3):
  • each first timing step performed for the first RNN model is guided based on the first target
  • the information determines the input data for the first sequential step.
  • the M refers to the number of times the first RNN model cyclically processes the input data, and the M is a positive integer, and each first timing step is a processing step of the input data by the first RNN model.
  • determining the input data of the first timing step based on the first guiding information may include determining, according to the first guiding information, input data of the first timing step by using the following formula (3):
  • t is the first timing step
  • x t is the input data of the first timing step
  • E is a word embedding matrix and is a model parameter of the first RNN model
  • y t is a word corresponding to the first timing step
  • the word corresponding to the first timing step is determined based on the output data of the previous first timing step of the first timing step
  • Q is the sixth matrix and is the model parameter of the first RNN model
  • v is the first boot information.
  • the input data of the first sequence step, the first set of annotation vectors, and the output data of the previous first timing step of the first timing step are processed by the first RNN model.
  • the output data of the first timing step is obtained.
  • the output data of the first timing step may include implicit state information and memory unit state information.
  • the output data of the last first timing step of the first timing step is based on the first initial input data. Ok to get.
  • the first initial input data includes the first initial implicit state information h 0 and the first initial memory cell state information c 0
  • the first timing step is the first first timing step
  • the first The output data of the last first timing step of the timing step is h 0 and c 0 .
  • the first RNN model used may be an LSTM network model.
  • LSTM network model based on the input data of the first timing step, the first annotation vector set, and the output data of the previous first timing step of the first timing step, determining that the output data of the first timing step may be
  • the abstract representation is as follows (4):
  • t is the first timing step
  • x t is the input data of the first timing step
  • h t-1 is the implicit state information of the previous timing step of the first timing step
  • LSTM represents the processing of the LSTM network model.
  • the processing of the LSTM network model can be expressed by the following formula:
  • i t , f t , c t and 0 t are the output data of the first timing step at the input gate, the forgetting gate, the memory gate and the output gate, respectively
  • is an activation function of the LSTM network model, such as a sigmoid function, tanh () is a hyperbolic tangent function, T is a matrix for linear transformation, x t is the input data of the first sequential step, and h t-1 is the implicit state information of the last sequential step of the first sequential step, d t is the target data determined based on the first set of annotation vectors, c t is the memory unit state information of the first timing step, and c t-1 is the memory unit state of the last first timing step of the first timing step
  • the information, h t is the implicit state information of the first timing step.
  • the target data d t may be a first annotation vector set, or may be a context vector (Context Vector), which is based on the first annotation vector set and the implicit state information of the previous timing step of the first timing step. , determined by the attention model.
  • Context Vector Context Vector
  • the attention model can be used to determine which region of the target image is noted in the previous first timing step, that is, it can be Each label vector in the calculation calculates a weight value, and the higher the weight of the label vector indicates that the label vector is more noticed.
  • the LSTM network model may be an LSTM network model provided with an attention model, after obtaining the first annotation vector set and the implicit state information of the last timing step of the first timing step, A context vector may be determined by the attention model based on the first annotation vector set and the implicit state information of the previous timing step of the first timing step, and the context vector is used as the target data.
  • the attention model can be calculated Any one of the similarities e i of the vector a i and h t-1 , and then calculate the weight of the attention of a i
  • the output data of all the first timing steps in the M first timing steps may be combined to obtain a description statement of the target image.
  • the output data of each first timing step is usually a word, and then the M words output by the M first timing steps are combined to obtain a description sentence of the target image.
  • all the output data of the M first sequential steps may be boy, give, girl, send, and flower, respectively, and the description sentence of the target image is “the boy sends the girl to the girl. flower”.
  • the first guiding network model capable of accurately generating the guiding information based on the annotation vector set of the target image
  • the first to-be-trained encoder before the feature vector is extracted by the encoder to obtain the feature vector and the first annotation vector set, Combining the first to-be-trained encoder, the first to-be-trained network model, and the first to-be-trained decoder to obtain a first cascaded network model, and then using a gradient based on the plurality of sample images and description statements of the plurality of sample images
  • the descent method trains the first concatenated network model to obtain the encoder, the first boot network model, and the decoder.
  • the first to-be-trained encoder, the first to-be-trained network model, and the first to-be-trained decoder may be first constructed according to the connection manner of FIG. 3 or FIG. 4 to be able to process the image to obtain an image description sentence.
  • the image recognition system trains the image recognition system based on the plurality of sample images and the description statements of the plurality of sample images.
  • the first training network to be trained can be The model is trained so that the first training network model to be trained can adaptively learn the guidance information during the training process, and ensure that the generated guidance information can be more and more accurate.
  • a multi-label margin loss may be used as a loss function of the first training network model to be trained, and the loss function is adopted based on the loss function.
  • the stochastic gradient descent method adjusts the model parameters of the first to-be-trained network model to obtain the first boot network model.
  • training can be performed using an annotated training set, which is a collection of ⁇ sample images, description statements> pairs, such as the MSCOCO data set (a common data set).
  • annotated training set which is a collection of ⁇ sample images, description statements> pairs, such as the MSCOCO data set (a common data set).
  • the first to be trained encoder may be an untrained encoder or a pre-trained encoder, which is not limited in this embodiment of the present application.
  • the first to-be-trained encoder can adopt a pre-trained CNN model on ImageNet (a computer vision system identification project name, which is currently the world's largest image recognition database), and the CNN model can be an inception V3 model (a Kind of CNN model), Resnet model (a CNN model) or VGG model (a CNN model).
  • the training efficiency of the entire first cascaded network model can be improved, and the training efficiency of the first guided network model can be improved.
  • the process of identifying the target image, obtaining the description sentence of the target image, and the process of training the guiding network model may be performed on the same terminal, or may be performed on different terminals. This embodiment of the present application does not limit this.
  • a second implementation manner determining, by the reviewer, a second annotation vector set and a second initial input data, based on the first guidance information, the first annotation vector set, and the first initial input data; and based on the second annotation vector set,
  • the second guiding network model generates second guiding information. Based on the second guiding information, the second annotation vector set and the second initial input data are encoded by the encoder to obtain a description statement of the target image.
  • a guiding network model is added between the encoder and the decoder.
  • the guiding information may be generated by using the guiding network model based on the annotation vector set, because the guiding network model is By training the annotation vector set of the sample image, it is possible to adaptively learn how to accurately generate the guidance information according to the annotation vector set of the image during the training process, so the guidance information generated by the guidance network model has high accuracy and can be The image encoding process is accurately guided, which improves the quality of the generated description statement.
  • FIG. 8 is a flowchart of another image recognition method according to an embodiment of the present application, which is applied to a terminal. Referring to Figure 8, the method includes:
  • Step 201 Perform feature extraction on the target image to be identified by the encoder to obtain a feature vector and a first annotation vector set.
  • Step 202 Perform initialization processing on the feature vector to obtain first initial input data.
  • Step 203 Generate first guiding information by using the first guiding network model based on the first annotation vector set.
  • Step 204 Determine, according to the first guiding information, the first annotation vector set and the first initial input data, the second annotation vector set and the second initial input data by the reviewer.
  • the decoder and the reviewer are generally used in the RNN model. Of course, other models may be used.
  • the reviewer is used to further mine the interaction relationship between the global feature and the local feature extracted by the encoder from the image, and generate initial input data, ie, the second initial input, for the decoder based on the interaction relationship between the global feature and the local feature. Data to improve the performance of the decoder, thereby improving the quality of the generated description statement.
  • the first initial input data refers to the input data to be input to the reviewer, and is used to indicate the initial state of the reviewer, and specifically includes first initial implicit state information and first initial memory unit state information, the first initial The implicit state information is used to indicate an initial state of the hidden layer of the reviewer, and the first initial memory unit state information is used to indicate an initial state of the memory unit of the reviewer.
  • the second initial input data refers to input data to be input to the decoder, and is used to indicate an initial state of the decoder, and specifically includes second initial implicit state information and second initial memory unit state information, and second initial The implicit state information is used to indicate an initial state of the hidden layer of the decoder, and the second initial memory cell state information is used to indicate an initial state of the memory unit of the decoder.
  • determining, by the reviewer, the second annotation vector set and the second initial input data may include the following steps 1)-3):
  • each second timing step performed for the second RNN model is based on the first
  • the target guidance information determines input data for the second timing step.
  • the N is the number of times the second RNN model cyclically processes the input data, and the N is a positive integer, and each second timing step is a processing step of the second RNN model on the input data.
  • the input data of the second timing step may be determined by the following formula (6):
  • t is the second timing step
  • x' t is the input data of the second timing step
  • E' is the word embedding matrix and is the model parameter of the second RNN model
  • Q' is the seventh matrix and is the The model parameter of the second RNN model
  • v' is the second boot information.
  • the output data of the second timing step may include implicit state information and memory cell state information.
  • the second timing step is the first second timing step of the N second timing steps
  • the second The output data of the last second timing step of the timing step is determined based on the first initial input data.
  • the input data of the second sequence step, the second set of label vectors, and the output data of the previous second timing step of the second timing step are processed by the second RNN model.
  • the output data of the second timing step is obtained.
  • the method for determining the output data of the first timing step may be performed according to the input data based on the first timing step, the first annotation vector set, and the output data of the previous first timing step of the first timing step. And determining output data of the second timing step based on the input data of the second timing step, the first annotation vector set, and the output data of the previous second timing step of the second timing step.
  • the output data of the last second timing step may be determined as the second initial input data, for example, the implicit state information and the memory unit state information of the last second timing step may be determined as the second initial input.
  • the data is determined as initial implicit state information and initial memory unit state information of the target encoder.
  • the set of implicit state information of all the timing steps in the N second timing steps may be determined as the second set of annotation vectors.
  • Step 205 Generate second guiding information by using the second target guiding network model, and the second guiding network model is configured to generate guiding information according to the set of annotation vectors, based on the second set of annotation vectors.
  • the method for generating the first guidance information by using the first guidance network model based on the first annotation vector set, and the second guidance network model based on the second annotation vector set may be used according to the foregoing step 103 in the embodiment of FIG. 7 .
  • Generate second boot information The specific implementation may be related to the description of the foregoing step 103, and details are not described herein again.
  • the second boot network model may be obtained by training the sample image together with the first boot network model, and the boot information may be automatically learned during the training process, and thus generated by the first boot network model and the second boot network model
  • the accuracy of the guidance information is high, and the generated guidance information can accurately guide the encoding process of the encoding, thereby improving the quality of the description statement for generating the target image.
  • Step 206 Encode the second annotation vector set and the second initial input data by the encoder based on the second guiding information to obtain a description statement of the target image.
  • the first annotation vector set and the first initial input data are decoded by the decoder according to the first guiding information in step 104 in the foregoing embodiment of FIG. 7 to obtain a description statement of the target image.
  • the second annotation vector set and the second initial input data are encoded by the encoder to obtain a description statement of the target image.
  • the target image is extracted by the encoder, and the second to-be-trained encoder, the second to-be-trained network model, the to-be-trained reviewer, and the third to-be-trained network model can be obtained before the feature vector and the first annotation vector set are obtained.
  • the second to-be-trained encoder, the second to-be-trained network model, the to-be-trained reviewer, the third to-be-trained network model, and the second to-be-trained decoder may be first constructed in accordance with the connection of FIG.
  • An image recognition system capable of processing an image to obtain a description sentence of the image, and then training the image recognition system based on the plurality of sample images and the description sentences of the plurality of sample images, in the process of training the image recognition system,
  • the second training guided network model and the third training guided network model can be trained, so that the second training guided network model and the third training guided network model can adaptively learn and guide during the training process. Information to ensure that the generated guidance information is more and more accurate.
  • the second to-be-trained encoder may be an untrained encoder or a pre-trained encoder
  • the training reviewer may be an untrained reviewer or a pre-trained reviewer.
  • the application embodiment does not limit this.
  • first boot network model and the second boot network model can be improved by using the pre-trained encoder as the second to-be-trained encoder, or by using the pre-trained reviewer as the most-exercised reviewer.
  • the training efficiency of the entire second cascade network model thereby improving the training efficiency of the first boot network model and the second boot network model.
  • the process of identifying the target image, obtaining the description sentence of the target image, and the process of training the guiding network model may be performed on the same terminal, or may be performed on different terminals.
  • the implementation of the present application does not limit this.
  • a guiding network model is added between the encoder and the decoder.
  • the guiding information may be generated by using the guiding network model based on the annotation vector set, because the guiding network model is
  • the guidance information can be adaptively learned during the training process, so the guidance information generated by the guidance network model has high accuracy, and can accurately guide the encoding process of the image, thereby improving the generation of the description statement. the quality of.
  • the interaction between the local feature and the global feature of the target image can be further excavated by the reviewer, so that the generated second annotation vector set and the second initial input data can be further Accurately indicating the characteristics of the target image further improves the system performance of the image recognition system, thereby improving the quality of the generated description statement.
  • FIG. 9 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present application, and the apparatus may be a terminal.
  • the device includes:
  • the extraction module 301 is configured to perform feature extraction on the target image to be identified by the encoder, to obtain a feature vector and a first annotation vector set;
  • the processing module 302 is configured to perform initialization processing on the feature vector to obtain first initial input data.
  • a generating module 303 configured to generate, according to the first set of annotation vectors, first guiding information by using a first guiding network model, where the first guiding network model is configured to generate guiding information according to the annotation vector set of any image;
  • the determining module 304 is configured to determine, by the decoder, a description statement of the target image based on the first guiding information, the first annotation vector set, and the first initial input data.
  • the generating module 303 includes:
  • the first linear transformation unit 3031 is configured to perform linear transformation on the first annotation vector set based on the first matrix formed by the model parameters in the first guidance network model to obtain a second matrix;
  • the first determining unit 3032 is configured to determine the first guiding information based on a maximum value of each row in the second matrix.
  • the first guiding network model is configured to generate guiding information according to an annotation vector set and attribute information of any image, the attribute information being used to indicate a probability of predicting a word appearing in a description sentence of the image;
  • the generating module 303 includes:
  • a processing unit 3033 configured to use the target image as an input of a multi-example model, and process the target image by using the multi-example model to obtain attribute information of the target image;
  • a second linear transformation unit 3034 configured to perform linear transformation on the first annotation vector set based on a third matrix formed by the model parameters in the second guidance network model to obtain a fourth matrix
  • a first generating unit 3035 configured to generate a fifth matrix based on the fourth matrix and attribute information of the target image
  • the second determining unit 3036 is configured to determine the first guiding information based on a maximum value of each row in the fifth matrix.
  • the determination model 304 is used to:
  • the first annotation vector set and the first initial input data are decoded by the decoder to obtain a description statement of the target image.
  • the determining model 304 includes:
  • a third determining unit 3041 configured to: when the decoder adopts a first cyclic neural network RNN model, and the first RNN model is used to perform M first timing steps, each first performed for the first RNN model a timing step of determining input data of the first timing step based on the first boot information;
  • the M is the number of times the first RNN model cyclically processes the input data, and the M is a positive integer, and each first timing step is a processing step of the input data by the first RNN model;
  • the fourth determining unit 3042 is configured to determine output data of the first timing step based on the input data of the first timing step, the first annotation vector set, and the output data of the previous first timing step of the first timing step. ;
  • the output data of the last first timing step of the first timing step is based on the first initial input data Determined to get;
  • the fifth determining unit 3043 is configured to determine a description statement of the target image based on all output data of the M first timing steps.
  • the third determining unit 3041 is configured to:
  • the input data of the first timing step is determined by the following formula:
  • t is the first timing step
  • x t is the input data of the first timing step
  • E is a word embedding matrix and is a model parameter of the first RNN model
  • y t is a word corresponding to the first timing step
  • the word corresponding to the first timing step is determined based on the output data of the previous first timing step of the first timing step
  • Q is the sixth matrix and is the model parameter of the first RNN model
  • v is the first boot information.
  • the apparatus further includes:
  • a first combination module 305 configured to combine the first to-be-trained encoder, the first to-be-trained network model, and the first to-be-trained decoder to obtain a first concatenated network model
  • the first training module 306 is configured to train the first cascade network model by using a gradient descent method based on the plurality of sample images and the description statements of the plurality of sample images to obtain the encoder, the first boot network model, and the decoding. Device.
  • the determining model 304 includes:
  • the sixth determining unit 3044 is configured to determine, according to the first guiding information, the first set of annotation vectors, and the first initial input data, the second annotation vector set and the second initial input data by the reviewer;
  • a second generating unit 3045 configured to generate, according to the second set of annotation vectors, second guiding information by using a second guiding network model, where the second guiding network model is obtained by training sample images;
  • the encoding unit 3046 is configured to encode the second annotation vector set and the second initial input data by the encoder based on the second guiding information to obtain a description statement of the target image.
  • the sixth determining unit 3044 is configured to:
  • each second timing step performed for the second RNN model is guided based on the first target Information determining input data of the second timing step;
  • the N is the number of times the second RNN model cyclically processes the input data, and the N is a positive integer, and each second timing step is a processing step of the input data by the second RNN model;
  • the output data of the last second timing step of the second timing step is based on the first initial input data Determined to get;
  • the second set of annotation vectors is determined based on all of the output data of the N second timing steps.
  • the apparatus further includes:
  • a second combination module 307 configured to combine the second to-be-trained encoder, the second to-be-trained network model, the to-be-trained reviewer, the third to-be-trained network model, and the second to-be-trained decoder to obtain the second level Network model
  • the second training module 308 is configured to train the second cascade network model by using a gradient descent method based on the plurality of sample images and the description statements of the plurality of sample images, to obtain the encoder, the first boot network model, The reviewer, the second boot network model, and the decoder.
  • a guiding network model is added between the encoder and the decoder.
  • the guiding information may be generated by using the guiding network model based on the annotation vector set, because the guiding network model is By training the annotation vector set of the sample image, it is possible to adaptively learn how to accurately generate the guidance information according to the annotation vector set of the image during the training process, so the guidance information generated by the guidance network model has high accuracy and can be The image encoding process is accurately guided, which improves the quality of the generated description statement.
  • FIG. 16 is a schematic structural diagram of a terminal 400 according to an embodiment of the present application.
  • the terminal 400 may include a communication unit 410, a memory 420 including one or more computer readable storage media, an input unit 430, a display unit 440, a sensor 450, an audio circuit 460, and a WIFI (Wireless Fidelity).
  • the module 470 includes a processor 480 having one or more processing cores, and a power supply 490 and the like. It will be understood by those skilled in the art that the terminal structure shown in FIG. 16 does not constitute a limitation to the terminal, and may include more or less components than those illustrated, or a combination of certain components, or different component arrangements. among them:
  • the communication unit 410 can be used for transmitting and receiving information and receiving and transmitting signals during a call.
  • the communication unit 410 can be an RF (Radio Frequency) circuit, a router, a modem, or the like.
  • RF circuits as communication units include, but are not limited to, an antenna, at least one amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, and a LNA (Low Noise Amplifier, low).
  • SIM Subscriber Identity Module
  • communication unit 410 can also communicate with the network and other devices via wireless communication.
  • the wireless communication may use any communication standard or protocol, including but not limited to GSM (Global System of Mobile communication), GPRS (General Packet Radio Service), CDMA (Code Division Multiple Access). , Code Division Multiple Access), WCDMA (Wideband Code Division Multiple Access), LTE (Long Term Evolution), e-mail, SMS (Short Messaging Service), and the like.
  • the memory 420 can be used to store software programs and modules, and the processor 480 executes various functional applications and data processing by running software programs and modules stored in the memory 420.
  • the memory 420 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may be stored according to The data created by the use of the terminal 400 (such as audio data, phone book, etc.) and the like.
  • memory 420 can include high speed random access memory, and can also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, memory 420 may also include a memory controller to provide access to memory 420 by processor 480 and input unit 430.
  • the input unit 430 can be configured to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function controls.
  • input unit 430 can include touch-sensitive surface 431 as well as other input devices 432.
  • a touch-sensitive surface 431, also referred to as a touch display or trackpad, can collect touch operations on or near the user (eg, the user uses a finger, stylus, etc., any suitable object or accessory on the touch-sensitive surface 431 or The operation near the touch-sensitive surface 431) and driving the corresponding connecting device according to a preset program.
  • the touch-sensitive surface 431 can include two portions of a touch detection device and a touch controller.
  • the touch detection device detects the touch orientation of the user, and detects a signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts the touch information into contact coordinates, and sends the touch information.
  • the processor 480 is provided and can receive commands from the processor 480 and execute them.
  • the touch sensitive surface 431 can be implemented in various types such as resistive, capacitive, infrared, and surface acoustic waves.
  • the input unit 430 can also include other input devices 432.
  • other input devices 432 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control buttons, switch buttons, etc.), trackballs, mice, joysticks, and the like.
  • Display unit 440 can be used to display information entered by the user or information provided to the user and various graphical user interfaces of terminal 400, which can be constructed from graphics, text, icons, video, and any combination thereof.
  • the display unit 440 may include a display panel 441.
  • the display panel 441 may be configured in the form of an LCD (Liquid Crystal Display), an OLED (Organic Light-Emitting Diode), or the like.
  • the touch-sensitive surface 431 can cover the display panel 441, and when the touch-sensitive surface 431 detects a touch operation thereon or nearby, it is transmitted to the processor 480 to determine the type of the touch event, and then the processor 480 according to the touch event The type provides a corresponding visual output on display panel 441.
  • touch-sensitive surface 431 and display panel 441 are implemented as two separate components to implement input and input functions, in some embodiments, touch-sensitive surface 431 can be integrated with display panel 441 for input. And output function.
  • Terminal 400 may also include at least one type of sensor 450, such as a light sensor, motion sensor, and other sensors.
  • the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 441 according to the brightness of the ambient light, and the proximity sensor may turn off the display panel 441 and/or the backlight when the terminal 400 moves to the ear.
  • the gravity acceleration sensor can detect the magnitude of acceleration in all directions (usually three axes). When it is stationary, it can detect the magnitude and direction of gravity.
  • the terminal 400 can also be configured with gyroscopes, barometers, hygrometers, thermometers, infrared sensors and other sensors, here Let me repeat.
  • the audio circuit 460, the speaker 461, and the microphone 462 can provide an audio interface between the user and the terminal 400.
  • the audio circuit 460 can transmit the converted electrical data of the received audio data to the speaker 461 for conversion to the sound signal output by the speaker 461; on the other hand, the microphone 462 converts the collected sound signal into an electrical signal by the audio circuit 460. After receiving, it is converted into audio data, and then processed by the audio data output processor 480, transmitted to the terminal, for example, via the communication unit 410, or the audio data is output to the memory 420 for further processing.
  • the audio circuit 460 may also include an earbud jack to provide communication of the peripheral earphones with the terminal 400.
  • the terminal may be configured with a wireless communication unit 470, which may be a WIFI module.
  • WIFI belongs to the short-range wireless transmission technology, and the terminal 400 can help the user to send and receive emails, browse webpages, and access streaming media through the wireless communication unit 470, which provides wireless broadband Internet access for users.
  • the wireless communication unit 470 is shown in the drawings, it can be understood that it does not belong to the essential configuration of the terminal 400, and may be omitted as needed within the scope of not changing the essence of the invention.
  • Processor 480 is the control center of terminal 400, which connects various portions of the entire handset using various interfaces and lines, by running or executing software programs and/or modules stored in memory 420, and recalling data stored in memory 420, The various functions and processing data of the terminal 400 are performed to perform overall monitoring of the mobile phone.
  • the processor 480 may include one or more processing cores; preferably, the processor 480 may integrate an application processor and a modem processor, where the application processor mainly processes an operating system, a user interface, an application, and the like.
  • the modem processor primarily handles wireless communications. It can be understood that the above modem processor may not be integrated into the processor 480.
  • the terminal 400 also includes a power source 490 (such as a battery) that supplies power to the various components.
  • a power source 490 such as a battery
  • the power source can be logically coupled to the processor 480 through a power management system to manage functions such as charging, discharging, and power management through the power management system.
  • Power source 460 may also include any one or more of a DC or AC power source, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.
  • the terminal 400 may further include a camera, a Bluetooth module, and the like, and details are not described herein.
  • the terminal includes a processor and a memory
  • the memory further includes at least one instruction, at least one program, a code set or a set of instructions, where the instruction, the program, the code set or the instruction set is
  • the processor loads and executes to implement the image recognition method described above with respect to the embodiment of FIG. 7 or FIG.
  • a computer readable storage medium stores at least one instruction, at least one program, a code set, or a set of instructions, the instruction, the program, the code
  • the set or set of instructions is loaded and executed by the processor to implement the image recognition method described above with respect to the embodiment of FIG. 7 or FIG.
  • a person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium.
  • the storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Library & Information Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Image Analysis (AREA)

Abstract

一种图像识别方法、终端及存储介质,属于机器学习领域。所述方法包括:通过编码器对待识别的图像进行特征提取,得到特征向量和第一标注向量集合(101);对该特征向量进行初始化处理,得到第一初始输入数据(102);基于该第一标注向量集合,通过第一引导网络模型生成第一引导信息,该第一引导网络模型用于根据任一图像的标注向量集合生成引导信息(103);基于该第一引导信息、该第一标注向量集合和该第一初始输入数据,通过解码器确定该图像的描述语句(104)。本方法在编码器和解码器之间增加了能够根据任一图像的标注向量集合生成引导信息的引导网络模型,因此通过该引导网络模型生成的引导信息较为准确,能够对编码过程进行准确引导,提高了生成描述语句的质量。

Description

图像识别方法、终端及存储介质
本申请要求于2017年09月11日提交的申请号为201710814187.2、发明名称为“图像识别方法、装置及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及机器学习领域,特别涉及一种图像识别方法、终端及存储介质。
背景技术
随着科技的发展,以及人们对便捷的人机交互方式的需求,机器学习在图像识别领域得到了广泛应用。例如,在早期的儿童教育、图像检索和盲人导航等场景中,人们通常希望机器能够自动对图像进行识别,得到能够准确描述图像内容的描述语句,即将图像翻译成自然语言,以便通过自然语言快速理解图像或者对图像进行分类。
目前,图像识别的系统框架通常包括编码器(Encoder)和解码器(Decoder),基于该系统框架,相关技术中提出了一种图像识别方法,包括:首先,通过编码器对图像进行特征提取,得到特征向量和标注向量(Annotation Vectors)集合,其中,特征向量是对图像进行全局特征提取得到,标注向量集合是对图像进行局部特征提取得到。然后,对特征向量进行初始化处理,得到初始输入数据,该初始输入数据用于指示解码器的初始状态,通常包括初始的隐含状态(Hidden State)信息和初始的记忆单元(Memory Cell)状态信息。之后,从图像中提取人为设计的特定信息作为引导信息,并基于该引导信息,通过解码器对该标注向量集合和初始输入数据进行解码,得到图像的描述语句。其中,该引导信息用于对编码器的编码过程进行引导,以提高生成描述语句的质量,使得所生成的描述语句能够较为准确地描述图像且符合语义。
发明内容
本申请实施例提供了一种图像识别方法、终端及存储介质,能够解决相关技术中存在的通过人为设计的特定引导信息不能准确生成图像的描述语句,导致生成的描述语句的质量较低的问题。所述技术方案如下:
第一方面,提供了一种图像识别方法,所述方法由终端执行,所述方法包括:
通过编码器对待识别的目标图像进行特征提取,得到特征向量和第一标注向量集合;
对所述特征向量进行初始化处理,得到第一初始输入数据;
基于所述第一标注向量集合,通过第一引导网络模型生成第一引导信息,所述第一引导网络模型用于根据任一图像的标识向量集合生成引导信息;
基于所述第一引导信息、所述第一标注向量集合和所述第一初始输入数据,通过解码器确定所述目标图像的描述语句。
第二方面,提供了一种图像识别装置,所述装置包括:
提取模块,用于通过编码器对待识别的目标图像进行特征提取,得到特征向量和第一标注向量集合;
处理模块,用于对所述特征向量进行初始化处理,得到第一初始输入数据;
生成模块,用于基于所述第一标注向量集合,通过第一引导网络模型生成第一引导信息,所述第一引导网络模型用于根据任一图像的标识向量集合生成引导信息;
确定模块,用于基于所述第一引导信息、所述第一标注向量集合和所述第一初始输入数据,通过解码器确定所述目标图像的描述语句。
第三方面,提供了一种终端,所述终端包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述指令、所述程序、所述代码集或所述指令集由所述处理器加载并执行以实现如下操作:
通过编码器对待识别的目标图像进行特征提取,得到特征向量和第一标注向量集合;
对所述特征向量进行初始化处理,得到第一初始输入数据;
基于所述第一标注向量集合,通过第一引导网络模型生成第一引导信息,所述第一引导网络模型用于根据任一图像的标注向量集合生成引导信息;
基于所述第一引导信息、所述第一标注向量集合和所述第一初始输入数据,通过解码器确定所述目标图像的描述语句。
第四方面,提供了一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述指令、所述程序、所述代码集或所述指令集由处理器加载并执行以实现如第一方面所述的图像识别方法。
本申请实施例提供的技术方案带来的有益效果是:
本申请实施例中,在编码器和解码器之间增加了引导网络模型,从目标图像中提取标注向量集合之后,可以基于该标注向量集合通过该引导网络模型生成引导信息,由于该引导网络模型能够根据任一图像的标注向量集合生成该图像的引导信息,因此,通过该引导网络模型所生成的引导信息能够更适用目标图像的描述语句的生成过程,准确度较高,从而能够对目标图像的编码过程进行准确引导,提高了生成描述语句的质量。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种RNN模型的逻辑结构示意图;
图2是本申请实施例提供的一种LSTM模型的逻辑结构示意图;
图3是本申请实施例提供的一种图像识别系统的结构示意图;
图4是本申请实施例提供的另一种图像识别系统的结构示意图;
图5是本申请实施例提供的又一种图像识别系统的结构示意图;
图6是本申请实施例提供的又一种图像识别系统的结构示意图;
图7是本申请实施例提供的一种图像识别方法流程图;
图8是本申请实施例提供的另一种图像识别方法流程图;
图9是本申请实施例提供的一种图像识别装置的结构示意图;
图10是本申请实施例提供的一种生成模块303的结构示意图;
图11是本申请实施例提供的另一种生成模块303的结构示意图;
图12是本申请实施例提供的一种确定模块304的结构示意图;
图13是本申请实施例提供的另一种图像识别装置的结构示意图;
图14是本申请实施例提供的另一种确定模块304的结构示意图;
图15是本申请实施例提供的又一种图像识别装置的结构示意图;
图16是本申请实施例提供的一种终端400的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
在对本申请实施例进行详细地解释说明之前,先对本申请实施例涉及的名词进行解释说明。
编码器
编码器用于对图像进行编码生成向量,编码器通常采用CNN(Convolutional Neural Networks,卷积神经网络)模型。
解码器
解码器用于对编码器生成的向量进行解码,以将编码器生成的向量翻译成图像的描述语句,解码器通常采用RNN(Recurrent Neural Network,循环神经网络)模型。
引导信息
引导信息是对图像进行处理得到的信息,通常表示为向量,能够作为解码器输入的一部分来对解码过程进行引导。在解码器中引入引导信息可以提高解码器的性能,保证解码器能够生成更好的描述语句,提高生成描述语句的质量。
CNN模型
CNN模型是指在传统的多层神经网络的基础上发展起来的一种针对图像分类和识别的神经网络模型,CNN模型通常包括多个卷积层和至少一个全连接层,能够对图像进行特征提取。
RNN模型
由于传统的神经网络没有记忆功能,也即,对于传统的神经网络而言,其输入为独立的没有上下文关联的数据。但是实际应用中,输入通常为一些有明显上下文特征的序列化输入,比如需要预测描述语句中的下一个词语,此时神经网络的输出必须依赖上一次的输入。也即,要求神经网络应具有记忆功能,而RNN模型即为一种节点定向连接成环且具有记忆功能的神经网络,可以利用内部的记忆功能循环处理输入数据。
图1是本申请实施例提供的一种RNN模型的逻辑结构示意图,如图1左侧所示,该RNN模型包括输入层、隐含层和输出层三层结构,且隐含层为环形结构。其中,输入层和隐含层相连,隐含层和输出层相连。
为了便于说明该RNN模型的功能,将图1左侧所示的RNN模型的结构按照时序进行展开,可以得到如图1右侧所示的结构。由于RNN模型的输入层接收到的输入数据为按照一定时间序列排序的数据,也即输入层接收到的输入数据为序列数据,为了便于说明,将该序列数据标记为x 1、x 2、…、x i、…、x n,将该序列数据中的各个数据分别对应的时刻标记为t 1、t 2、…、t i、…、t n,将对x 1、x 2、…、、x i、…、x n分别进行处理得到的输出数据标记为f 1、f 2、…、f i、…、f n,而RNN模型按照时序对各个输入数据依次进行处理的步骤可以称为时序步骤。其中,n为RNN模型循环处理输入数据的次数。
如图1右侧所示,在展开之后的RNN模型中,t 1时刻输入层接收到的输入数据为x 1,并将x 1传输至隐含层,隐含层对x 1进行处理,并将处理后的数据传输至输出层,得到t 1时刻的输出数据f 1。t 2时刻输入层接收到的输入数据为x 2,并将x 2传输至隐含层,此时隐含层根据t 1时刻的输出数据f 1对x 2进行处理,并将处理后的数据传输至输出层,得到t 2时刻的输出数据f 2。也即,在任意时刻t i,隐含层除了接收到t i时刻输入层传输的输入数据x i,还接收到t i-1时刻的输出数据f i-1,并根据f i-1对x i进行处理,得到t i时刻的输出数据f i
LSTM(Long Short-Term Memory,长短期记忆)网络模型
LSTM网络模型是一种特殊的RNN模型,能够处理和预测时间序列中间隔和延迟相对较长的重要事件。LSTM网络模型包括LSTM单元,LSTM单元设置有输入门、遗忘门和输出门,在每个时序步骤可以基于设置的输入门、遗忘门和输出门对输入数据进行处理。
图2是本申请实施例提供的一种LSTM网络模型的逻辑结构示意图,如图2左侧所示,该LSTM网络模型包括LSTM单元,且LSTM单元为环形结构,对于LSTM单元执行的任一时序步骤t来说,该LSTM单元可以对时序步骤t的输入数据x t和上一个时序步骤t-1的输出数据f t-1进行处理,得到时序步骤t的输出数据f t
如图2右侧所示,在按照时序展开之后的LSTM网络模型中,LSTM单元接收到时序步骤t 1的输入数据x 1之后,可以对x 1进行处理得到时序步骤t 1的输出数据f 1,然后将f 1再输入LSTM单元,LSTM单元接收到时序步骤t 2的输入数据x 2之后,可以对f 1和x 2进行处理,得到时序步骤t 2的输出数据f 2,直至基于时序步骤t n的输入数据x n和时序步骤t n-1的输出数据f n-1得到时序步骤t n的输出数据f n为止。其中,n为LSTM网络模型循环处理输入数据的次数。
审阅网络(Review-net)
审阅网络是一种基于编码器-解码器框架的图像识别网络,包括审阅器(reviewer)和解码器。审阅器和解码器通常都采用CNN模型。审阅器可以进一步挖掘编码器从图像中提取的全局特征和局部特征之间的交互关系,并基于全局特征和局部特征之间的交互关系为解码器生成初始输入数据,以提高解码器的性能。
接下来对本申请实施例的应用场景予以说明。
本申请实施例可以应用于早期的儿童教育、图像检索、盲人阅读或聊天系统等场景中,在这些场景中通常需要将图像自动翻译成自然语言。
例如,为了提高幼龄儿童的看图识物能力,可以利用本申请实施例提供的图像识别方法,将幼龄儿童看到的图像翻译成对应的描述语句,然后将描述语句转换成语音播放出来,以便幼龄儿童能够结合图像和语音学习图像内容。
再例如,对于数据库中存储的大量图像,可以利用本申请实施例提供的图像识别方法,将图像翻译成对应的描述语句,以便根据图像的描述语句对图像进行准确分类,或者根据图像的描述语句对图像进行准确检索。
再例如,对于盲人待识别的一张图像来说,可以先将这张图像翻译成对应的描述语句,然后将描述语句转换成语音播放出来,以便盲人通过听到的语音识别图像,或者,将描述语句转换成盲文,以便盲人通过阅读盲文识别图像等。
再例如,在聊天系统中,可以将聊天窗口中的图像翻译成对应的描述语句,并对描述语句进行显示。
需要说明的是,本申请实施例仅是以上述几种应用场景为例进行说明,而实际应用中,本申请实施例提供的图像识别方法还可以应用于其他场景中,本申请实施例在此不做一一列举。
接下来,对本申请实施例涉及的系统架构进行介绍。
图3是本申请实施例提供的一种图像识别系统的结构示意图,如图3所示,该图像识别系统包括编码器10、第一引导网络模型20和解码器30。
其中,编码器10用于对待识别的目标图像进行编码,即对目标图像进行特征提取,得到特征向量和第一标注向量集合。特征向量用于指示目标图像的全局特征,第一标注向量集合用于指示目标图像的局部特征。
对于第一标注向量集合,编码器10可以将其分别输出给解码器30和第一引导网络模型20。对于特征向量,编码器10可以对其进行初始化处理,得到第一初始输入数据,然后将第一初始输入数据输出给解码器30;或者,编码器10也可以将特征向量输出给其他模型,由其他模型对目标编码器10输出的特征向量进行初始化处理,得到第一初始输入数据,并将第一初始输入数据输出给解码器30。
其中,第一引导网络模型20用于基于编码器10输出的第一标注向量集合生成第一引导信息,然后将第一引导信息输出给解码器30,且该第一引导网络模型是通过样本图像的标注向量集合训练得到。
其中,解码器30用于基于第一引导信息、第一标注向量集合和第一初始输入数据确定该目标图像的描述语句。
由上可知,图3所示的图像识别系统与相关技术相比,在编码器和解码器之间增加了引导网络模型,由于该引导网络模型能够根据任一图像的标注向量集合生成该图像的描述语句,因此,与人为设计的引导信息相比,通过该引导网络模型所生成的引导信息能够更适用目标图像的描述语句的生成过程,准确度较高,从而能够对图像的编码过程进行准确引导,从而提高了生成描述语句的质量。
图4是本申请实施例提供的另一种图像识别系统的结构示意图,如图4所 示,该图像识别系统包括编码器10、第一引导网络模型20、解码器30和多示例模型40。
其中,图4与图3中的编码器10和解码器30的作用相同,具体描述可以参考图3,在此不再详细赘述。
其中,多示例模型40用于对待识别的目标图像进行处理,得到目标图像的属性信息,该属性信息用于指示该目标图像的描述语句中预测出现的词语的概率,并将目标图像的属性信息输出给第一引导网络模型20。
其中,第一引导网络模型20用于基于编码器10输出的第一标注向量集合和多示例模型40输出的目标图像的属性信息生成第一引导信息。
图4中,通过在第一引导网络模型20之前增加多示例模型40,使得第一引导网络模型20可以根据目标图像的第一标注向量集合和属性信息综合确定第一引导信息,进一步提高了所生成的第一引导信息的准确性。
图5是本申请实施例提供的又一种图像识别系统的结构示意图,如图5所示,该图像识别系统包括编码器10、第一引导网络模型20、审阅器50、第二引导网络模型60和解码器30。
其中,图5与图3中编码器10的作用相同,具体描述可以参考图3,在此不再详细赘述。
其中,第一引导网络模型20用于基于编码器10输入的第一标注向量集合生成第一引导信息,并将第一引导信息输出给审阅器50。
其中,审阅器50用于基于第一初始输入数据、第一标注向量集合和第一引导信息确定第二标注向量集合和第二初始输入数据,并将第二标注向量集合和第二初始输入数据输出给解码器30,以及将第二标注向量集合输出给第二引导网络模型60。第二初始输入数据为解码器30的初始输入数据,用于指示解码器30的初始状态,具体可以包括初始的隐含状态信息和初始的记忆单元状态信息。
其中,第二引导网络模型60用于基于第二标注向量集合生成第二引导信息,并将第二引导信息输出给解码器30,且该第二引导网络模型也是通过样本图像训练得到。
其中,解码器30用于基于第二引导信息,对第二标注向量集合和第二初始输入数据进行解码,得到该目标图像的描述语句。
图5中,通过在编码器和解码器之间增加审阅器,可以通过审阅器进一步 挖掘目标图像的局部特征和全局特征的交互关系,使得生成的第二标注向量集合和第二初始输入数据能够更准确地指示目标图像的特征,进一步提高了图像识别系统的系统性能,进而提高了生成描述语句的质量。
图6是本申请实施例提供的又一种图像识别系统的结构示意图,如图6所示,该图像识别系统包括编码器10、第一引导网络模型20、审阅器50、第二引导网络模型60、解码器30和多示例模型40。
其中,图6与图5中编码器10、审阅器50和解码器30的作用相同,具体描述可以参考图5,在此不再赘述。
其中,多示例模型40用于对待识别的目标图像进行处理,得到目标图像的属性信息,并将目标图像的属性信息分别输出给第一引导网络模型20和第二引导网络模型60。
其中,第一引导网络模型20用于基于编码器10输出的第一标注向量集合和多示例模型40输出的目标图像的属性信息生成第一引导信息,并将第一引导信息输出给审阅器50。
其中,第二引导网络模型60用于基于审阅器50输出的第二标注向量集合和多示例模型40输出的目标图像的属性信息生成第二引导信息,并将第二引导信息输出给解码器30,以便编码器30基于第二引导信息,对第二标注向量集合和第二初始输入数据进行编码,得到目标图像的描述语句。
图6中,通过在第一引导网络模型20和第二引导网络模型60之前增加多示例模型40,使得第一引导网络模型20和第二引导网络模型60均可以根据目标图像的属性信息和标注向量集合综合确定引导信息,进一步提高了所生成的引导信息的准确性。
需要说明的是,上述图3-图6所示的图像识别系统均可以基于多个样本图像和多个样本图像的描述语句训练得到,也即是,可以通过训练得到上述编码器、第一引导网络模型、审阅器、第二引导网络模型和解码器,使得第一引导网络模型和第二引导网络模型可以在训练的过程中自适应的学习如何生成准确的引导信息,从而提高生成引导信息的准确性。
接下来,将结合上述图3-图6所示图像识别系统的结构示意图,对本申请实施例提供的图像识别方法进行详细介绍。图7是本申请实施例提供的一种图 像识别方法流程图,该方法可以由终端执行,该终端可以为手机、平板电脑或计算机等,该终端可以包括上述图像识别系统,例如可以通过安装的软件承载上述图像识别系统。参见图7,该方法包括:
步骤101:通过编码器对待识别的目标图像进行特征提取,得到特征向量和第一标注向量集合。
在对待识别的目标图像进行识别时,可以先将目标图像输入编码器,通过编码器对目标图像进行特征提取,分别得到目标图像的特征向量和第一标注向量集合。
具体地,可以通过编码器对目标图像进行全局特征提取,得到特征向量,通过编码器对目标图像进行局部特征提取,得到标注向量集合。其中,特征向量用于指示目标图像的全局特征,第二标识向量集合中的标注向量用于指示目标图像的局部特征。
可选地,编码器可以采用CNN模型,当编码器采用CNN模型对目标图像进行特征提取时,该特征向量可以通过CNN模型的最后一个全连接层提取得到,该第二标注向量集合可以通过CNN模型的最后一个卷积层提取得到。
步骤102:对特征向量进行初始化处理,得到第一初始输入数据。
其中,第一初始输入数据是指待输入给编码器的下一个处理模型的初始输入数据,用于指示下一个处理模型的初始状态,该下一个处理模型可以为解码器或者审阅器。第一初始输入数据可以包括第一初始隐含状态信息和第一初始记忆单元状态信息,第一初始隐含状态信息用于指示下一个处理模型的隐含层的初始状态,第一初始记忆单元状态信息用于指示下一个处理模型的记忆单元的初始状态。
具体地,可以对特征向量进行线性变换等初始化处理,得到第一初始输入数据。而且,可以通过编码器对该特征向量进行初始化处理,得到第一初始输入数据,也可以通过其他模型对编码器输出的特征向量进行初始化处理,得到第一初始输入数据,本申请实施例对此不做限定。
例如,该编码器可以包括RNN模型和初始化模型,RNN模型用于对目标图像进行特征提取,初始化模型用于对特征向量进行初始化处理,该编码器通过RNN模型对图像进行特征提取得到特征向量之后,可以再通过初始化模型对特征向量进行初始化处理,得到第一初始输入数据。
或者,编码器也可以仅用于对目标图像进行特征提取,并在编码器之后增加初始化模型,该初始化模型用于对特征向量进行初始化处理,通过编码器对目标图像进行特征提取得到特征向量之后,可以将特征向量输出给该初始化模型,然后通过该初始化模型对该特征向量进行初始化处理,得到第一初始输入数据。
步骤103:基于第一标注向量集合,通过第一引导网络模型生成第一引导信息,该第一引导网络模型用于根据任一图像的标注向量集合生成引导信息。
具体地,基于第一标注向量集合,通过第一引导网络模型生成第一引导信息可以包括以下两种方式实现:
第一种实现方式:基于第一引导网络模型中的模型参数构成的第一矩阵对第一标注向量集合进行线性变换,得到第二矩阵;基于第二矩阵中每一行的最大值确定该第一引导信息。
其中,第一引导网络模型可以根据样本图像的标注向量集合训练得到。在一个实施例中,可以将图3中的各个模型变换为待训练的模型,然后基于多个样本图像和多个样本图像的描述语句对变换后的图像识别系统进行训练,则在训练的过程中,待训练编码器即可分别从多个样本图像中提取标注向量,并输出给待训练引导网络模型进行训练,如此,对整个图像识别系统训练完成之后,即可将待训练引导网络模型训练为第一引导网络模型。
其中,待训练编码器可以为未训练过的编码器,也可以为预训练好的编码器,本申请实施例对此不做限定。通过使用预训练好的编码器对待训练引导网络模型进行训练,可以提高整个图像识别系统的训练效率,进而提高其中的待训练引导网络模型的训练效率。
其中,第一标注向量集合也是矩阵形式,第一矩阵为第一引导网络模型的模型参数构成的且用于对第一标注向量集合进行线性变换的矩阵。具体地,可以将第一标注向量集合与第一矩阵进行相乘,以对第一标注向量集合进行线性变换,得到第二矩阵。
具体地,基于第二矩阵中每一行的最大值确定该第一引导信息包括:选取第二矩阵中每一行的最大值,然后将选取的最大值按照行数不变的原则组成列数为1的矩阵,并将组成的矩阵确定为该第一引导信息。
例如,假设第一标注向量集合为
Figure PCTCN2018105009-appb-000001
a 1-a k为从目标图像 中提取的各个标注向量,第一矩阵为P 1,第一引导信息为v,则可以采用如下公式(1)确定第一引导信息:
v=max([P 1a 1,P 1a 2,…,P 1a k])                (1)
其中,max函数是指对待处理的矩阵的每一行取最大值,并组成行数不变且列数为1的矩阵。
第二种实现方式:当该第一引导网络模型用于根据任一图像的标注向量集合和属性信息生成引导信息时,可以将该目标图像作为多示例模型的输入,通过该多示例模型对该目标图像进行处理,得到该目标图像的属性信息;基于该第一引导网络模型中的模型参数构成的第三矩阵对该第一标注向量集合进行线性变换,得到第四矩阵;基于该第四矩阵和该目标图像的属性信息,生成第五矩阵;基于该第五矩阵中每一行的最大值确定该第一引导信息。其中,样本图像的属性信息用于指该样本图像的描述语句中预测出现的词语的概率。
其中,该多示例模型是通过多个样本图像和该多个样本图像的描述语句训练得到的,且能够输出样本图像的属性信息的模型,也即是,该多示例模型能够对图像的描述语句中可能出现的词语的概率进行预测。示例的,该属性信息可以为MIL(Multi-instance learning,多示例学习)信息等。
其中,该第一引导网络模型可以通过样本图像的标注向量集合和属性信息进行训练得到。例如,可以将图4的各个模型变换为待训练的模型,然后基于多个样本图像和多个样本图像的描述语句对变换后的图像识别系统进行训练,则在训练的过程中,待训练编码器可以从样本图像中提取标注向量并输出给待训练引导网络模型,且待训练多示例模型可以对图像进行处理得到属性信息,并将属性信息输出给待训练引导网络模型,待训练的引导网络模型即可基于样本图像的标注向量和属性信息进行训练,如此,对整个图像识别系统训练完成之后,即可将待训练引导网络模型训练为该第一引导网络模型。
其中,待训练编码器可以为未训练过的编码器,也可以为预训练好的编码器;待训练多示例模型可以为未训练过的多示例模型,也可以为预训练好的多示例模型,本申请实施例对此不做限定。通过使用预训练好的编码器和/或预训练好的多示例模型来对待训练引导网络模型进行训练,可以提高整个图像识别系统的训练效率,进而提高其中的待训练引导网络模型的训练效率。
其中,第一标注向量集合也是矩阵形式,第三矩阵为该第一引导网络模型 的模型参数构成的且用于对第一标注向量集合进行线性变换的矩阵。具体地,可以将第一标注向量集合与第三矩阵进行相乘,以对第一标注向量集合进行线性变换,得到第四矩阵,然后基于第四矩阵和目标图像的属性信息,生成第五矩阵。
其中,基于第五矩阵中每一行的最大值确定第一引导信息包括:选取第五矩阵中每一行的最大值,然后将选取的最大值按照行数不变的原则组成列数为1的矩阵,并将组成的矩阵确定为该第一引导信息。
具体地,假设第一标注向量集合为
Figure PCTCN2018105009-appb-000002
a 1-a k为从目标图像中提取的各个标注向量,第三矩阵为P 2,目标图像的属性信息为e,第一引导信息为v,则可以采用如下公式(2)确定第一引导信息v:
v=max([e,P 2a 1,P 2a 2,…,P 2a k])          (2)
其中,max函数是指对待处理的矩阵的每一行取最大值,并组成行数不变且列数为1的矩阵。
由上可知,第一引导网络模型可以通过学习得到,也即是,可以通过多个样本图像和多个样本图像的描述语句训练得到,且在训练的过程中可以自动学习引导信息,因此,通过该第一引导网络模型生成第一引导信息的准确度较高,所生成的第一引导信息能够对编码的编码过程进行准确引导,进而可以提高生成目标图像的描述语句的质量。
步骤104:基于第一引导信息、第一标注向量集合和第一初始输入数据,通过解码器确定该目标图像的描述语句。
本申请实施例中,基于第一引导信息、第一标注向量集合和第一初始输入数据,通过解码器确定该目标图像的描述语句可以包括以下两种实现方式:
第一种实现方式:基于第一引导信息,通过解码器对第一标注向量集合和第一初始输入数据进行解码,得到该目标图像的描述语句。
可选地,该解码器通常采用RNN模型,比如可以采用LSTM网络模型。
具体地,基于第一引导信息,通过解码器对第一标注向量集合和第一初始输入数据进行解码,得到该目标图像的描述语句可以包括以下步骤1)-3):
1)当该解码器采用第一RNN模型,且该第一RNN模型用于执行M个第一时序步骤时,对于该第一RNN模型执行的每个第一时序步骤,基于该第一目标引导信息确定该第一时序步骤的输入数据。
其中,所述M是指该第一RNN模型循环处理输入数据的次数,且该M为正整数,每个第一时序步骤为该第一RNN模型对输入数据的处理步骤。
其中,基于第一引导信息确定该第一时序步骤的输入数据可以包括基于该第一引导信息,通过以下公式(3)确定该第一时序步骤的输入数据:
x t=Ey t+Qv             (3)
其中,t为该第一时序步骤,x t为该第一时序步骤的输入数据,E为词语嵌入矩阵且为该第一RNN模型的模型参数,y t是该第一时序步骤对应的词语的独热one-hot向量,该第一时序步骤对应的词语是基于该第一时序步骤的上一个第一时序步骤的输出数据确定得到,Q为第六矩阵且为该第一RNN模型的模型参数,v为该第一引导信息。
2)基于该第一时序步骤的输入数据、该第一标注向量集合和该第一时序步骤的上一个第一时序步骤的输出数据,确定该第一时序步骤的输出数据。
本申请实施例中,通过该第一RNN模型,对该第一时序步骤的输入数据、该第一标注向量集合和该第一时序步骤的上一个第一时序步骤的输出数据进行处理,即可得到该第一时序步骤的输出数据。
其中,该第一时序步骤的输出数据可以包括隐含状态信息和记忆单元状态信息。而且,当该第一时序步骤为该M个第一时序步骤中的第一个第一时序步骤时,该第一时序步骤的上一个第一时序步骤的输出数据是基于该第一初始输入数据确定得到。例如,当该第一初始输入数据包括第一初始隐含状态信息h 0和第一初始记忆单元状态信息c 0,且该第一时序步骤为第一个第一时序步骤时,则该第一时序步骤的上一个第一时序步骤的输出数据即为h 0和c 0
本申请实施例中,为了提高所生成的描述语句的质量,所使用的第一RNN模型可以为LSTM网络模型。以LSTM网络模型为例,基于该第一时序步骤的输入数据、该第一标注向量集合和该第一时序步骤的上一个第一时序步骤的输出数据,确定该第一时序步骤的输出数据可以抽象表示为如下公式(4):
Figure PCTCN2018105009-appb-000003
其中,t为该第一时序步骤,x t为该第一时序步骤的输入数据,h t-1为该第一时序步骤的上一个时序步骤的隐含状态信息,
Figure PCTCN2018105009-appb-000004
为第一标注向量集合,h t为该第一时序步骤的隐含状态信息,LSTM表示LSTM网络模型的处理过程。
具体地,LSTM网络模型的处理过程可以采用如下公式表示:
Figure PCTCN2018105009-appb-000005
其中,i t、f t、c t和0 t分别为该第一时序步骤在输入门、遗忘门、记忆门和输出门的输出数据,σ是LSTM网络模型的激活函数,如sigmoid函数,tanh()是双曲正切函数,T是用于线性变换的矩阵,x t为该第一时序步骤的输入数据,h t-1为该第一时序步骤的上一个时序步骤的隐含状态信息,d t为基于第一标注向量集合确定得到的目标数据,c t为该第一时序步骤的记忆单元状态信息,c t-1为该第一时序步骤的上一个第一时序步骤的记忆单元状态信息,h t为该第一时序步骤的隐含状态信息。
其中,目标数据d t可以为第一标注向量集合,也可以为上下文向量(Context Vector),该上下文向量是基于第一标注向量集合和该第一时序步骤的上一个时序步骤的隐含状态信息,通过注意力模型确定得到的。
其中,注意力模型可以用来确定上一个第一时序步骤注意的是目标图像的哪个区域,也即是可以为
Figure PCTCN2018105009-appb-000006
中的每个标注向量计算一个权重值,标注向量的权重越高表示该标注向量越被注意。
在一种可能的实现方式中,该LSTM网络模型可以为设置有注意力模型的LSTM网络模型,在得到第一标注向量集合和该第一时序步骤的上一个时序步骤的隐含状态信息之后,可以基于该第一标注向量集合和该第一时序步骤的上一个时序步骤的隐含状态信息,通过注意力模型确定上下文向量,并将该上下文向量作为该目标数据。
具体地,该注意力模型可以计算
Figure PCTCN2018105009-appb-000007
中任一个标注向量a i和h t-1的相似度e i,然后计算a i的注意力的权重
Figure PCTCN2018105009-appb-000008
之后使用每个标注向量的权重即可生成上下文向量z t=∑w ia i
3)基于该M个第一时序步骤的所有输出数据,确定该目标图像的描述语句。
具体地,可以对该M个第一时序步骤中所有第一时序步骤的输出数据进行组合处理,得到该目标图像的描述语句。实际应用中,每个第一时序步骤的输出数据通常是一个词语,然后将该M个第一时序步骤输出的M个词语进行组合,即可得到该目标图像的描述语句。
以图3中所示的目标图像为例,该M个第一时序步骤的所有输出数据可能 分别为男孩、给、女孩、送、花,则该目标图像的描述语句即为“男孩给女孩送花”。
进一步地,为了得到上述能够基于目标图像的标注向量集合准确生成引导信息的第一引导网络模型,在通过编码器对目标图像进行特征提取,得到特征向量和第一标注向量集合之前,还可以将第一待训练编码器、第一待训练引导网络模型和第一待训练解码器进行组合,得到第一级联网络模型,然后基于多个样本图像和该多个样本图像的描述语句,采用梯度下降法对该第一级联网络模型进行训练,得到该编码器、该第一引导网络模型和该解码器。
也即是,可以先将第一待训练编码器、第一待训练引导网络模型和第一待训练解码器按照图3或图4的连接方式构建成能够对图像进行处理,得到图像的描述语句的图像识别系统,然后基于多个样本图像和该多个样本图像的描述语句对该图像识别系统进行训练,在对图像识别系统进行训练的过程中,即可对其中的第一待训练引导网络模型进行训练,使得第一待训练引导网络模型能够在训练的过程中自适应地学习引导信息,保证生成的引导信息能够越来越准确。
其中,在训练第一待训练引导网络模型的过程中,可以使用Multi-label margin loss(基于间隔的多标记损失函数)作为该第一待训练引导网络模型的损失函数,并基于该损失函数采用随机梯度下降法对该第一待训练引导网络模型的模型参数进行调整,以得到该第一引导网络模型。
实际训练中,可以使用已标注的训练集进行训练,该训练集是<样本图像,描述语句>对的集合,比如MSCOCO数据集(一种常用数据集)等。
其中,第一待训练编码器可以为未训练过的编码器,也可以为预训练好的编码器,本申请实施例对此不做限定。例如,该第一待训练编码器可以采用在ImageNet(一个计算机视觉系统识别项目名称,是目前世界上图像识别最大的数据库)上预训练好的CNN模型,该CNN模型可以为inception V3模型(一种CNN模型)、Resnet模型(一种CNN模型)或者VGG模型(一种CNN模型)等。
通过使用预训练好的编码器作为第一待训练编码器来训练第一引导网络模型,可以提高整个第一级联网络模型的训练效率,进而提高其中的第一引导网络模型的训练效率。
需要说明的是,本申请实施例中,对目标图像进行识别,得到目标图像的描述语句的过程和对引导网络模型进行训练的过程可以在相同的终端上执行,也可以在不同的终端上执行,本申请实施例对此不做限定。
第二种实现方式:基于第一引导信息、第一标注向量集合和第一初始输入数据,通过审阅器确定第二标注向量集合和第二初始输入数据;基于该第二标注向量集合,通过第二引导网络模型生成第二引导信息;基于该第二引导信息,通过该编码器对该第二标注向量集合和该第二初始输入数据进行编码,得到该目标图像的描述语句。
需要说明的是,该第二种实现方式将在下述图8实施例中进行详细说明,本申请实施例在此不做详细赘述。
本申请实施例中,在编码器和解码器之间增加了引导网络模型,从图像中提取标注向量集合之后,可以基于该标注向量集合通过该引导网络模型生成引导信息,由于该引导网络模型是通过样本图像的标注向量集合训练得到,可以在训练过程中自适应地学习如何根据图像的标注向量集合准确地生成引导信息,因此通过该引导网络模型所生成的引导信息准确度较高,能够对图像的编码过程进行准确引导,从而提高了生成描述语句的质量。
接下来将结合上述图5和图6所示的图像识别系统的结构示意图,对本申请实施例提供的图像识别方法进行详细介绍。图8是本申请实施例提供的另一种图像识别方法流程图,该方法应用于终端中。参见图8,该方法包括:
步骤201:通过编码器对待识别的目标图像进行特征提取,得到特征向量和第一标注向量集合。
步骤202:对特征向量进行初始化处理,得到第一初始输入数据。
步骤203:基于第一标注向量集合,通过第一引导网络模型生成第一引导信息。
其中,步骤201-步骤203的具体实现方式可以参考上述步骤101-步骤103的相关描述,本申请实施例在此不再赘述。
步骤204:基于第一引导信息、第一标注向量集合和第一初始输入数据,通过审阅器确定第二标注向量集合和第二初始输入数据。
本申请实施例中,解码器和审阅器通常均采用RNN模型,当然也可以采用 其他模型,本申请实施例对此不做限定。
其中,审阅器用于进一步挖掘编码器从图像中提取的全局特征和局部特征之间的交互关系,并基于全局特征和局部特征之间的交互关系为解码器生成初始输入数据,即第二初始输入数据,以提高解码器的性能,进而提高生成描述语句的质量。
其中,第一初始输入数据是指待输入给审阅器的输入数据,用于指示该审阅器的初始状态,具体可以包括第一初始隐含状态信息和第一初始记忆单元状态信息,第一初始隐含状态信息用于指示审阅器的隐含层的初始状态,第一初始记忆单元状态信息用于指示审阅器的记忆单元的初始状态。
其中,第二初始输入数据是指待输入给解码器的输入数据,用于指示该解码器的初始状态,具体可以包括第二初始隐含状态信息和第二初始记忆单元状态信息,第二初始隐含状态信息用于指示解码器的隐含层的初始状态,第二初始记忆单元状态信息用于指示解码器的记忆单元的初始状态。
具体地,基于该第一引导信息、该第一标注向量集合和该第一初始输入数据,通过审阅器确定第二标注向量集合和第二初始输入数据可以包括如下步骤1)-3):
1)当该第一审阅器采用第二RNN模型,且该第二RNN模型用于执行N个第二时序步骤时,对于该第二RNN模型执行的每个第二时序步骤,基于该第一目标引导信息确定该第二时序步骤的输入数据。
其中,该N是指该第二RNN模型循环处理输入数据的次数,且该N为正整数,每个第二时序步骤为该第二RNN模型对输入数据的处理步骤。
具体地,可以基于该第二引导信息,通过以下公式(6)确定该第二时序步骤的输入数据:
x' t=E'y' t+Q'v'           (6)
其中,t为该第二时序步骤,x' t为该第二时序步骤的输入数据,E'为词语嵌入矩阵且为该第二RNN模型的模型参数,Q'为第七矩阵且为该第二RNN模型的模型参数,v'为该第二引导信息。
2)基于该第二时序步骤的输入数据、该第一标注向量集合和该第二时序步骤的上一个第二时序步骤的输出数据,确定该第二时序步骤的输出数据。
其中,该第二时序步骤的输出数据可以包括隐含状态信息和记忆单元状态 信息,当该第二时序步骤为该N个第二时序步骤中的第一个第二时序步骤时,该第二时序步骤的上一个第二时序步骤的输出数据是基于该第一初始输入数据确定得到。
本申请实施例中,通过该第二RNN模型,对该第二时序步骤的输入数据、该第二标注向量集合和该第二时序步骤的上一个第二时序步骤的输出数据进行处理,即可得到该第二时序步骤的输出数据。
具体地,可以按照上述基于该第一时序步骤的输入数据、该第一标注向量集合和该第一时序步骤的上一个第一时序步骤的输出数据,确定该第一时序步骤的输出数据的方法,基于该第二时序步骤的输入数据、该第一标注向量集合和该第二时序步骤的上一个第二时序步骤的输出数据,确定该第二时序步骤的输出数据,具体实现方式可以参考上述相关描述,在此不再详细赘述。
3)基于该N个第二时序步骤中最后一个第二时序步骤的输出数据,确定该第二初始输入数据。
具体地,可以将最后一个第二时序步骤的输出数据确定为该第二初始输入数据,例如,可以将最后一个第二时序步骤的隐含状态信息和记忆单元状态信息确定为该第二初始输入数据,即确定为该目标编码器的初始隐含状态信息和初始记忆单元状态信息。
4)基于该N个第二时序步骤的所有输出数据,确定该第二标注向量集合。
具体地,可以将该N个第二时序步骤中所有时序步骤的隐含状态信息的集合确定为该第二标注向量集合。
步骤205:基于该第二标注向量集合,通过第二目标引导网络模型生成第二引导信息,该第二引导网络模型用于根据标注向量集合生成引导信息。
具体地,可以按照上述图7实施例中步骤103所述的基于第一标注向量集合,通过第一引导网络模型生成第一引导信息的方法,基于第二标注向量集合,通过第二引导网络模型生成第二引导信息。具体实现方式可以参数上述步骤103的相关描述,此处不再详细赘述。
其中,第二引导网络模型可以与第一引导网络模型一起通过样本图像进行训练得到,且在训练的过程中可以自动学习引导信息,因此,通过该第一引导网络模型和第二引导网络模型生成的引导信息的准确度都较高,所生成的引导信息能够对编码的编码过程进行准确引导,进而可以提高生成目标图像的描述 语句的质量。
步骤206:基于该第二引导信息,通过该编码器对该第二标注向量集合和该第二初始输入数据进行编码,得到该目标图像的描述语句。
具体地,可以按照上述图7实施例中步骤104所述的基于第一引导信息,通过解码器对第一标注向量集合和第一初始输入数据进行解码,得到该目标图像的描述语句的方法,基于该第二引导信息,通过该编码器对该第二标注向量集合和该第二初始输入数据进行编码,得到该目标图像的描述语句。具体实现方式可以参考上述步骤104中第一种实现方式的相关描述,此处不再详细赘述。
进一步地,为了得到上述能够基于目标图像的第一标注向量集合准确生成第一引导信息的第一引导网络模型,以及基于第二标注向量集合准确生成第二引导信息的第二引导网络模型,在通过编码器对目标图像进行特征提取,得到特征向量和第一标注向量集合之前还可以将第二待训练编码器、第二待训练引导网络模型、待训练审阅器、第三待训练引导网络模型和第二待训练解码器进行组合,得到第二级联网络模型,然后基于多个样本图像和该多个样本图像的描述语句,采用梯度下降法对该第二级联网络模型进行训练,得到该编码器、该第一引导网络模型、该审阅器、该第二引导网络模型和该解码器。
也即是,可以先将第二待训练编码器、第二待训练引导网络模型、待训练审阅器、第三待训练引导网络模型和第二待训练解码器按照图5的连接的方式构建成能够对图像进行处理,得到图像的描述语句的图像识别系统,然后基于多个样本图像和该多个样本图像的描述语句对该图像识别系统进行训练,在对图像识别系统进行训练的过程中,即可对其中的第二待训练引导网络模型和第三待训练引导网络模型进行训练,使得第二待训练引导网络模型和第三待训练引导网络模型能够在训练的过程中自适应地学习引导信息,保证生成的引导信息能够越来越准确。
其中,第二待训练编码器可以为未训练过的编码器,也可以为预训练好的编码器,训练审阅器可以为未训练过的审阅器,也可以为预训练好的审阅器,本申请实施例对此不做限定。
需要说明的是,通过使用预训练好的编码器作为第二待训练编码器,或者使用预训练好的审阅器最为待训练审阅器来训练第一引导网络模型和第二引导网络模型,可以提高整个第二级联网络模型的训练效率,进而提高其中的第一 引导网络模型和第二引导网络模型的训练效率。
还需要说明的是,本申请实施例中,对目标图像进行识别,得到目标图像的描述语句的过程和对引导网络模型进行训练的过程可以在相同的终端上执行,也可以在不同的终端上执行,本申请实施例对此不做限定。
本申请实施例中,在编码器和解码器之间增加了引导网络模型,从图像中提取标注向量集合之后,可以基于该标注向量集合通过该引导网络模型生成引导信息,由于该引导网络模型是通过样本图像训练得到,可以在训练过程中自适应地学习引导信息,因此通过该引导网络模型所生成的引导信息准确度较高,能够对图像的编码过程进行准确引导,从而提高了生成描述语句的质量。
进一步地,通过在编码器和解码器之间增加审阅器,可以通过审阅器进一步挖掘目标图像的局部特征和全局特征的交互关系,使得生成的第二标注向量集合和第二初始输入数据能够更准确地指示目标图像的特征,进一步提高了图像识别系统的系统性能,进而提高了生成描述语句的质量。
图9是本申请实施例提供的一种图像识别装置的结构示意图,该装置可以为终端。参见图9,该装置包括:
提取模块301,用于通过编码器对待识别的目标图像进行特征提取,得到特征向量和第一标注向量集合;
处理模块302,用于对该特征向量进行初始化处理,得到第一初始输入数据;
生成模块303,用于基于该第一标注向量集合,通过第一引导网络模型生成第一引导信息,该第一引导网络模型用于根据任一图像的标注向量集合生成引导信息;
确定模块304,用于基于该第一引导信息、该第一标注向量集合和该第一初始输入数据,通过解码器确定该目标图像的描述语句。
可选地,参见图10,该生成模块303包括:
第一线性变换单元3031,用于基于该第一引导网络模型中的模型参数构成的第一矩阵对该第一标注向量集合进行线性变换,得到第二矩阵;
第一确定单元3032,用于基于该第二矩阵中每一行的最大值确定该第一引导信息。
可选地,参见图11,该第一引导网络模型用于根据任一图像的标注向量集 合和属性信息生成引导信息,该属性信息用于指示该图像的描述语句中预测出现的词语的概率;
该生成模块303包括:
处理单元3033,用于将该目标图像作为多示例模型的输入,通过该多示例模型对该目标图像进行处理,得到该目标图像的属性信息;
第二线性变换单元3034,用于基于该第二引导网络模型中的模型参数构成的第三矩阵对该第一标注向量集合进行线性变换,得到第四矩阵;
第一生成单元3035,用于基于该第四矩阵和该目标图像的属性信息,生成第五矩阵;
第二确定单元3036,用于基于该第五矩阵中每一行的最大值确定该第一引导信息。
可选地,该确定模型304用于:
基于该第一引导信息,通过该解码器对该第一标注向量集合和该第一初始输入数据进行解码,得到该目标图像的描述语句。
可选地,参见图12,该确定模型304包括:
第三确定单元3041,用于当该解码器采用第一循环神经网络RNN模型,且该第一RNN模型用于执行M个第一时序步骤时,对于该第一RNN模型执行的每个第一时序步骤,基于该第一引导信息确定该第一时序步骤的输入数据;
其中,该M是指该第一RNN模型循环处理输入数据的次数,且该M为正整数,每个第一时序步骤为该第一RNN模型对输入数据的处理步骤;
第四确定单元3042,用于基于该第一时序步骤的输入数据、该第一标注向量集合和该第一时序步骤的上一个第一时序步骤的输出数据,确定该第一时序步骤的输出数据;
其中,当该第一时序步骤为该M个第一时序步骤中的第一个第一时序步骤时,该第一时序步骤的上一个第一时序步骤的输出数据是基于该第一初始输入数据确定得到;
第五确定单元3043,用于基于该M个第一时序步骤的所有输出数据,确定该目标图像的描述语句。
可选地,该第三确定单元3041用于:
基于该第一引导信息,通过以下公式确定该第一时序步骤的输入数据:
x t=Ey t+Qv
其中,t为该第一时序步骤,x t为该第一时序步骤的输入数据,E为词语嵌入矩阵且为该第一RNN模型的模型参数,y t是该第一时序步骤对应的词语的独热one-hot向量,该第一时序步骤对应的词语是基于该第一时序步骤的上一个第一时序步骤的输出数据确定得到,Q为第六矩阵且为该第一RNN模型的模型参数,v为该第一引导信息。
可选地,参见图13,该装置还包括:
第一组合模块305,用于将第一待训练编码器、第一待训练引导网络模型和第一待训练解码器进行组合,得到第一级联网络模型;
第一训练模块306,基于多个样本图像和该多个样本图像的描述语句,采用梯度下降法对该第一级联网络模型进行训练,得到该编码器、该第一引导网络模型和该解码器。
可选地,参见图14,该确定模型304包括:
第六确定单元3044,用于基于该第一引导信息、该第一标注向量集合和该第一初始输入数据,通过审阅器确定第二标注向量集合和第二初始输入数据;
第二生成单元3045,用于基于该第二标注向量集合,通过第二引导网络模型生成第二引导信息,该第二引导网络模型是通过样本图像训练得到;
编码单元3046,用于基于该第二引导信息,通过该编码器对该第二标注向量集合和该第二初始输入数据进行编码,得到该目标图像的描述语句。
可选地,该第六确定单元3044用于:
当该第一审阅器采用第二RNN模型,且该第二RNN模型用于执行N个第二时序步骤时,对于该第二RNN模型执行的每个第二时序步骤,基于该第一目标引导信息确定该第二时序步骤的输入数据;
其中,该N是指该第二RNN模型循环处理输入数据的次数,且该N为正整数,每个第二时序步骤为该第二RNN模型对输入数据的处理步骤;
基于该第二时序步骤的输入数据、该第一标注向量集合和该第二时序步骤的上一个第二时序步骤的输出数据,确定该第二时序步骤的输出数据;
其中,当该第二时序步骤为该N个第二时序步骤中的第一个第二时序步骤时,该第二时序步骤的上一个第二时序步骤的输出数据是基于该第一初始输入数据确定得到;
基于该N个第二时序步骤中最后一个第二时序步骤的输出数据,确定该第二初始输入数据;
基于该N个第二时序步骤的所有输出数据,确定该第二标注向量集合。
可选地,参见图15,该装置还包括:
第二组合模块307,用于将第二待训练编码器、第二待训练引导网络模型、待训练审阅器、第三待训练引导网络模型和第二待训练解码器进行组合,得到第二级联网络模型;
第二训练模块308,用于基于多个样本图像和该多个样本图像的描述语句,采用梯度下降法对该第二级联网络模型进行训练,得到该编码器、该第一引导网络模型、该审阅器、该第二引导网络模型和该解码器。
本申请实施例中,在编码器和解码器之间增加了引导网络模型,从图像中提取标注向量集合之后,可以基于该标注向量集合通过该引导网络模型生成引导信息,由于该引导网络模型是通过样本图像的标注向量集合训练得到,可以在训练过程中自适应地学习如何根据图像的标注向量集合准确地生成引导信息,因此通过该引导网络模型所生成的引导信息准确度较高,能够对图像的编码过程进行准确引导,从而提高了生成描述语句的质量。
需要说明的是:上述实施例提供的图像识别装置在进行图像识别时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的图像识别装置与图像识别方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
图16是本申请实施例提供的一种终端400的结构示意图。参见图16,终端400可以包括通信单元410、包括有一个或一个以上计算机可读存储介质的存储器420、输入单元430、显示单元440、传感器450、音频电路460、WIFI(Wireless Fidelity,无线保真)模块470、包括有一个或者一个以上处理核心的处理器480、以及电源490等部件。本领域技术人员可以理解,图16中示出的终端结构并不构成对终端的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。其中:
通信单元410可用于收发信息或通话过程中,信号的接收和发送,该通信单元410可以为RF(Radio Frequency,射频)电路、路由器、调制解调器、等网络通信设备。特别地,当通信单元410为RF电路时,将基站的下行信息接收后,交由一个或者一个以上处理器480处理;另外,将涉及上行的数据发送给基站。通常,作为通信单元的RF电路包括但不限于天线、至少一个放大器、调谐器、一个或多个振荡器、用户身份模块(SIM)卡、收发信机、耦合器、LNA(Low Noise Amplifier,低噪声放大器)、双工器等。此外,通信单元410还可以通过无线通信与网络和其他设备通信。所述无线通信可以使用任一通信标准或协议,包括但不限于GSM(Global System of Mobile communication,全球移动通讯系统)、GPRS(General Packet Radio Service,通用分组无线服务)、CDMA(Code Division Multiple Access,码分多址)、WCDMA(Wideband Code Division Multiple Access,宽带码分多址)、LTE(Long Term Evolution,长期演进)、电子邮件、SMS(Short Messaging Service,短消息服务)等。存储器420可用于存储软件程序以及模块,处理器480通过运行存储在存储器420的软件程序以及模块,从而执行各种功能应用以及数据处理。存储器420可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据终端400的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器420可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地,存储器420还可以包括存储器控制器,以提供处理器480和输入单元430对存储器420的访问。
输入单元430可用于接收输入的数字或字符信息,以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。优选地,输入单元430可包括触敏表面431以及其他输入设备432。触敏表面431,也称为触摸显示屏或者触控板,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触敏表面431上或在触敏表面431附近的操作),并根据预先设定的程式驱动相应的连接装置。可选的,触敏表面431可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控 制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器480,并能接收处理器480发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触敏表面431。除了触敏表面431,输入单元430还可以包括其他输入设备432。优选地,其他输入设备432可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。
显示单元440可用于显示由用户输入的信息或提供给用户的信息以及终端400的各种图形用户接口,这些图形用户接口可以由图形、文本、图标、视频和其任意组合来构成。显示单元440可包括显示面板441,可选的,可以采用LCD(Liquid Crystal Display,液晶显示器)、OLED(Organic Light-Emitting Diode,有机发光二极管)等形式来配置显示面板441。进一步的,触敏表面431可覆盖显示面板441,当触敏表面431检测到在其上或附近的触摸操作后,传送给处理器480以确定触摸事件的类型,随后处理器480根据触摸事件的类型在显示面板441上提供相应的视觉输出。虽然在图16中,触敏表面431与显示面板441是作为两个独立的部件来实现输入和输入功能,但是在某些实施例中,可以将触敏表面431与显示面板441集成而实现输入和输出功能。
终端400还可包括至少一种传感器450,比如光传感器、运动传感器以及其他传感器。光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板441的亮度,接近传感器可在终端400移动到耳边时,关闭显示面板441和/或背光。作为运动传感器的一种,重力加速度传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别手机姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;至于终端400还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器,在此不再赘述。
音频电路460、扬声器461,传声器462可提供用户与终端400之间的音频接口。音频电路460可将接收到的音频数据转换后的电信号,传输到扬声器461,由扬声器461转换为声音信号输出;另一方面,传声器462将收集的声音信号转换为电信号,由音频电路460接收后转换为音频数据,再将音频数据输出处理器480处理后,经通信单元410以发送给比如另一终端,或者将音频数据输 出至存储器420以便进一步处理。音频电路460还可能包括耳塞插孔,以提供外设耳机与终端400的通信。
为了实现无线通信,该终端上可以配置有无线通信单元470,该无线通信单元470可以为WIFI模块。WIFI属于短距离无线传输技术,终端400通过无线通信单元470可以帮助用户收发电子邮件、浏览网页和访问流式媒体等,它为用户提供了无线的宽带互联网访问。虽然图中示出了无线通信单元470,但是可以理解的是,其并不属于终端400的必须构成,完全可以根据需要在不改变发明的本质的范围内而省略。
处理器480是终端400的控制中心,利用各种接口和线路连接整个手机的各个部分,通过运行或执行存储在存储器420内的软件程序和/或模块,以及调用存储在存储器420内的数据,执行终端400的各种功能和处理数据,从而对手机进行整体监控。可选的,处理器480可包括一个或多个处理核心;优选的,处理器480可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器480中。
终端400还包括给各个部件供电的电源490(比如电池),优选的,电源可以通过电源管理系统与处理器480逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。电源460还可以包括一个或一个以上的直流或交流电源、再充电系统、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。
尽管未示出,终端400还可以包括摄像头、蓝牙模块等,在此不再赘述。
在本实施例中,终端包括处理器和存储器,存储器中还存储有至少一条指令、至少一段程序、代码集或指令集,所述指令、所述程序、所述代码集或所述指令集由所述处理器加载并执行以实现上述图7或图8实施例所述的图像识别方法。
在另一实施例中,还提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述指令、所述程序、所述代码集或所述指令集由处理器加载并执行以实现上述图7或图8实施例所述的图像识别方法。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本申请实施例的较佳实施例,并不用以限制本申请实施例,凡在本申请实施例的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请实施例的保护范围之内。

Claims (20)

  1. 一种图像识别方法,所述方法由终端执行,其特征在于,所述方法包括:
    通过编码器对待识别的目标图像进行特征提取,得到特征向量和第一标注向量集合;
    对所述特征向量进行初始化处理,得到第一初始输入数据;
    基于所述第一标注向量集合,通过第一引导网络模型生成第一引导信息,所述第一引导网络模型用于根据任一图像的标注向量集合生成引导信息;
    基于所述第一引导信息、所述第一标注向量集合和所述第一初始输入数据,通过解码器确定所述目标图像的描述语句。
  2. 如权利要求1所述的方法,其特征在于,所述基于所述第一标注向量集合,通过第一引导网络模型生成第一引导信息,包括:
    基于所述第一引导网络模型中的模型参数构成的第一矩阵,对所述第一标注向量集合进行线性变换,得到第二矩阵;
    基于所述第二矩阵中每一行的最大值确定所述第一引导信息。
  3. 如权利要求1所述的方法,其特征在于,所述第一引导网络模型用于根据任一图像的标注向量集合和属性信息生成引导信息,所述属性信息用于指示所述图像的描述语句中预测出现的词语的概率;
    所述基于所述第一标注向量集合,通过第一引导网络模型生成第一引导信息,包括:
    将所述目标图像作为多示例模型的输入,通过所述多示例模型对所述目标图像进行处理,得到所述目标图像的属性信息;
    基于所述第一引导网络模型中的模型参数构成的第三矩阵,对所述第一标注向量集合进行线性变换,得到第四矩阵;
    基于所述第四矩阵和所述目标图像的属性信息,生成第五矩阵;
    基于所述第五矩阵中每一行的最大值确定所述第一引导信息。
  4. 如权利要求1所述的方法,其特征在于,所述基于所述第一引导信息、所述第一标注向量集合和所述第一初始输入数据,通过解码器确定所述目标图 像的描述语句,包括:
    基于所述第一引导信息,通过所述解码器对所述第一标注向量集合和所述第一初始输入数据进行解码,得到所述目标图像的描述语句。
  5. 如权利要求4所述的方法,其特征在于,所述基于所述第一引导信息,通过所述解码器对所述第一标注向量集合和所述第一初始输入数据进行解码,得到所述目标图像的描述语句,包括:
    当所述解码器采用第一循环神经网络RNN模型,且所述第一RNN模型用于执行M个第一时序步骤时,对于所述第一RNN模型执行的每个第一时序步骤,基于所述第一引导信息确定所述第一时序步骤的输入数据;
    其中,所述M是指所述第一RNN模型循环处理输入数据的次数,且所述M为正整数,每个第一时序步骤为所述第一RNN模型对输入数据的处理步骤;
    基于所述第一时序步骤的输入数据、所述第一标注向量集合和所述第一时序步骤的上一个第一时序步骤的输出数据,确定所述第一时序步骤的输出数据;
    其中,当所述第一时序步骤为所述M个第一时序步骤中的第一个第一时序步骤时,所述第一时序步骤的上一个第一时序步骤的输出数据是基于所述第一初始输入数据确定得到;
    基于所述M个第一时序步骤的所有输出数据,确定所述目标图像的描述语句。
  6. 如权利要求5所述的方法,其特征在于,所述基于所述第一引导信息确定所述第一时序步骤的输入数据,包括:
    基于所述第一引导信息,通过以下公式确定所述第一时序步骤的输入数据:
    x t=Ey t+Qv
    其中,t为所述第一时序步骤,x t为所述第一时序步骤的输入数据,E为词语嵌入矩阵且为所述第一RNN模型的模型参数,y t是所述第一时序步骤对应的词语的独热one-hot向量,所述第一时序步骤对应的词语是基于所述第一时序步骤的上一个第一时序步骤的输出数据确定得到,Q为第六矩阵且为所述第一RNN模型的模型参数,v为所述第一引导信息。
  7. 如权利要求1-6任一所述的方法,其特征在于,所述通过编码器对目标 图像进行特征提取,得到特征向量和第一标注向量集合之前,还包括:
    将第一待训练编码器、第一待训练引导网络模型和第一待训练解码器进行组合,得到第一级联网络模型;
    基于多个样本图像和所述多个样本图像的描述语句,采用梯度下降法对所述第一级联网络模型进行训练,得到所述编码器、所述第一引导网络模型和所述解码器。
  8. 如权利要求1所述的方法,其特征在于,所述基于所述第一引导信息、所述第一标注向量集合和所述第一初始输入数据,通过所述解码器确定所述目标图像的描述语句,包括:
    基于所述第一引导信息、所述第一标注向量集合和所述第一初始输入数据,通过审阅器确定第二标注向量集合和第二初始输入数据;
    基于所述第二标注向量集合,通过第二引导网络模型生成第二引导信息,所述第二引导网络模型用于根据标注向量集合生成引导信息;
    基于所述第二引导信息,通过所述编码器对所述第二标注向量集合和所述第二初始输入数据进行编码,得到所述目标图像的描述语句。
  9. 如权利要求8所述的方法,其特征在于,所述基于所述第一引导信息、所述第一标注向量集合和所述第一初始输入数据,通过审阅器确定第二标注向量集合和第二初始输入数据,包括:
    当所述第一审阅器采用第二RNN模型,且所述第二RNN模型用于执行N个第二时序步骤时,对于所述第二RNN模型执行的每个第二时序步骤,基于所述第一引导信息确定所述第二时序步骤的输入数据;
    其中,所述N是指所述第二RNN模型循环处理输入数据的次数,且所述N为正整数,每个第二时序步骤为所述第二RNN模型对输入数据的处理步骤;
    基于所述第二时序步骤的输入数据、所述第一标注向量集合和所述第二时序步骤的上一个第二时序步骤的输出数据,确定所述第二时序步骤的输出数据;
    其中,当所述第二时序步骤为所述N个第二时序步骤中的第一个第二时序步骤时,所述第二时序步骤的上一个第二时序步骤的输出数据是基于所述第一初始输入数据确定得到;
    基于所述N个第二时序步骤中最后一个第二时序步骤的输出数据,确定所 述第二初始输入数据;
    基于所述N个第二时序步骤的所有输出数据,确定所述第二标注向量集合。
  10. 如权利要求8或9所述的方法,其特征在于,所述通过编码器对目标图像进行特征提取,得到特征向量和第一标注向量集合之前,还包括:
    将第二待训练编码器、第二待训练引导网络模型、待训练审阅器、第三待训练引导网络模型和第二待训练解码器进行组合,得到第二级联网络模型;
    基于多个样本图像和所述多个样本图像的描述语句,采用梯度下降法对所述第二级联网络模型进行训练,得到所述编码器、所述第一引导网络模型、所述审阅器、所述第二引导网络模型和所述解码器。
  11. 一种终端,其特征在于,所述终端包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述指令、所述程序、所述代码集或所述指令集由所述处理器加载并执行以实现如下操作:
    通过编码器对待识别的目标图像进行特征提取,得到特征向量和第一标注向量集合;
    对所述特征向量进行初始化处理,得到第一初始输入数据;
    基于所述第一标注向量集合,通过第一引导网络模型生成第一引导信息,所述第一引导网络模型用于根据任一图像的标注向量集合生成引导信息;
    基于所述第一引导信息、所述第一标注向量集合和所述第一初始输入数据,通过解码器确定所述目标图像的描述语句。
  12. 如权利要求11所述的终端,其特征在于,所述指令、所述程序、所述代码集或所述指令集由所述处理器加载并执行以实现如下操作:
    基于所述第一引导网络模型中的模型参数构成的第一矩阵,对所述第一标注向量集合进行线性变换,得到第二矩阵;
    基于所述第二矩阵中每一行的最大值确定所述第一引导信息。
  13. 如权利要求11所述的终端,其特征在于,所述第一引导网络模型用于根据任一图像的标注向量集合和属性信息生成引导信息,所述属性信息用于指示所述图像的描述语句中预测出现的词语的概率;
    所述指令、所述程序、所述代码集或所述指令集由所述处理器加载并执行以实现如下操作:
    将所述目标图像作为多示例模型的输入,通过所述多示例模型对所述目标图像进行处理,得到所述目标图像的属性信息;
    基于所述第一引导网络模型中的模型参数构成的第三矩阵,对所述第一标注向量集合进行线性变换,得到第四矩阵;
    基于所述第四矩阵和所述目标图像的属性信息,生成第五矩阵;
    基于所述第五矩阵中每一行的最大值确定所述第一引导信息。
  14. 如权利要求11所述的终端,其特征在于,所述指令、所述程序、所述代码集或所述指令集由所述处理器加载并执行以实现如下操作:
    基于所述第一引导信息,通过所述解码器对所述第一标注向量集合和所述第一初始输入数据进行解码,得到所述目标图像的描述语句。
  15. 如权利要求14所述的终端,其特征在于,所述指令、所述程序、所述代码集或所述指令集由所述处理器加载并执行以实现如下操作:
    当所述第一审阅器采用第二RNN模型,且所述第二RNN模型用于执行N个第二时序步骤时,对于所述第二RNN模型执行的每个第二时序步骤,基于所述第一引导信息确定所述第二时序步骤的输入数据;
    其中,所述N是指所述第二RNN模型循环处理输入数据的次数,且所述N为正整数,每个第二时序步骤为所述第二RNN模型对输入数据的处理步骤;
    基于所述第二时序步骤的输入数据、所述第一标注向量集合和所述第二时序步骤的上一个第二时序步骤的输出数据,确定所述第二时序步骤的输出数据;
    其中,当所述第二时序步骤为所述N个第二时序步骤中的第一个第二时序步骤时,所述第二时序步骤的上一个第二时序步骤的输出数据是基于所述第一初始输入数据确定得到;
    基于所述N个第二时序步骤中最后一个第二时序步骤的输出数据,确定所述第二初始输入数据;
    基于所述N个第二时序步骤的所有输出数据,确定所述第二标注向量集合。
  16. 如权利要求11-15任一所述的终端,其特征在于,所述指令、所述程序、 所述代码集或所述指令集由所述处理器加载并执行以实现如下操作:
    将第一待训练编码器、第一待训练引导网络模型和第一待训练解码器进行组合,得到第一级联网络模型;
    基于多个样本图像和所述多个样本图像的描述语句,采用梯度下降法对所述第一级联网络模型进行训练,得到所述编码器、所述第一引导网络模型和所述解码器。
  17. 如权利要求11所述的终端,其特征在于,所述指令、所述程序、所述代码集或所述指令集由所述处理器加载并执行以实现如下操作:
    基于所述第一引导信息、所述第一标注向量集合和所述第一初始输入数据,通过审阅器确定第二标注向量集合和第二初始输入数据;
    基于所述第二标注向量集合,通过第二引导网络模型生成第二引导信息,所述第二引导网络模型用于根据标注向量集合生成引导信息;
    基于所述第二引导信息,通过所述编码器对所述第二标注向量集合和所述第二初始输入数据进行编码,得到所述目标图像的描述语句。
  18. 如权利要求17所述的终端,其特征在于,所述指令、所述程序、所述代码集或所述指令集由所述处理器加载并执行以实现如下操作:
    当所述第一审阅器采用第二RNN模型,且所述第二RNN模型用于执行N个第二时序步骤时,对于所述第二RNN模型执行的每个第二时序步骤,基于所述第一引导信息确定所述第二时序步骤的输入数据;
    其中,所述N是指所述第二RNN模型循环处理输入数据的次数,且所述N为正整数,每个第二时序步骤为所述第二RNN模型对输入数据的处理步骤;
    基于所述第二时序步骤的输入数据、所述第一标注向量集合和所述第二时序步骤的上一个第二时序步骤的输出数据,确定所述第二时序步骤的输出数据;
    其中,当所述第二时序步骤为所述N个第二时序步骤中的第一个第二时序步骤时,所述第二时序步骤的上一个第二时序步骤的输出数据是基于所述第一初始输入数据确定得到;
    基于所述N个第二时序步骤中最后一个第二时序步骤的输出数据,确定所述第二初始输入数据;
    基于所述N个第二时序步骤的所有输出数据,确定所述第二标注向量集合。
  19. 如权利要求17或18所述的终端,其特征在于,所述指令、所述程序、所述代码集或所述指令集由所述处理器加载并执行以实现如下操作:
    将第二待训练编码器、第二待训练引导网络模型、待训练审阅器、第三待训练引导网络模型和第二待训练解码器进行组合,得到第二级联网络模型;
    基于多个样本图像和所述多个样本图像的描述语句,采用梯度下降法对所述第二级联网络模型进行训练,得到所述编码器、所述第一引导网络模型、所述审阅器、所述第二引导网络模型和所述解码器。
  20. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述指令、所述程序、所述代码集或所述指令集由处理器加载并执行以实现如权利要求1-10任一项所述的图像识别方法。
PCT/CN2018/105009 2017-09-11 2018-09-11 图像识别方法、终端及存储介质 WO2019047971A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
JP2020514506A JP6972319B2 (ja) 2017-09-11 2018-09-11 画像認識方法、端末及び記憶媒体
KR1020197036824A KR102270394B1 (ko) 2017-09-11 2018-09-11 이미지를 인식하기 위한 방법, 단말, 및 저장 매체
EP18853742.7A EP3611663A4 (en) 2017-09-11 2018-09-11 IMAGE RECOGNITION PROCESS, TERMINAL AND STORAGE MEDIA
US16/552,738 US10956771B2 (en) 2017-09-11 2019-08-27 Image recognition method, terminal, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710814187.2 2017-09-11
CN201710814187.2A CN108304846B (zh) 2017-09-11 2017-09-11 图像识别方法、装置及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/552,738 Continuation US10956771B2 (en) 2017-09-11 2019-08-27 Image recognition method, terminal, and storage medium

Publications (1)

Publication Number Publication Date
WO2019047971A1 true WO2019047971A1 (zh) 2019-03-14

Family

ID=62869573

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/105009 WO2019047971A1 (zh) 2017-09-11 2018-09-11 图像识别方法、终端及存储介质

Country Status (6)

Country Link
US (1) US10956771B2 (zh)
EP (1) EP3611663A4 (zh)
JP (1) JP6972319B2 (zh)
KR (1) KR102270394B1 (zh)
CN (2) CN110490213B (zh)
WO (1) WO2019047971A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102134893B1 (ko) * 2019-11-07 2020-07-16 국방과학연구소 사전 압축된 텍스트 데이터의 압축 방식을 식별하는 시스템 및 방법
CN112785494A (zh) * 2021-01-26 2021-05-11 网易(杭州)网络有限公司 一种三维模型构建方法、装置、电子设备和存储介质

Families Citing this family (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110490213B (zh) * 2017-09-11 2021-10-29 腾讯科技(深圳)有限公司 图像识别方法、装置及存储介质
CN109146156B (zh) * 2018-08-03 2021-12-03 大连理工大学 一种用于预测充电桩系统充电量的方法
JP7415922B2 (ja) * 2018-10-19 2024-01-17 ソニーグループ株式会社 情報処理方法、情報処理装置及び情報処理プログラム
CN109559576B (zh) * 2018-11-16 2020-07-28 中南大学 一种儿童伴学机器人及其早教系统自学习方法
CN109495214B (zh) * 2018-11-26 2020-03-24 电子科技大学 基于一维Inception结构的信道编码类型识别方法
CN109902852A (zh) * 2018-11-28 2019-06-18 北京三快在线科技有限公司 商品组合方法、装置、电子设备及可读存储介质
US10726062B2 (en) * 2018-11-30 2020-07-28 Sony Interactive Entertainment Inc. System and method for converting image data into a natural language description
CN109670548B (zh) * 2018-12-20 2023-01-06 电子科技大学 基于改进lstm-cnn的多尺寸输入har算法
CN109711546B (zh) * 2018-12-21 2021-04-06 深圳市商汤科技有限公司 神经网络训练方法及装置、电子设备和存储介质
CN111476838A (zh) * 2019-01-23 2020-07-31 华为技术有限公司 图像分析方法以及系统
CN110009018B (zh) * 2019-03-25 2023-04-18 腾讯科技(深圳)有限公司 一种图像生成方法、装置以及相关设备
CN110222840B (zh) * 2019-05-17 2023-05-05 中山大学 一种基于注意力机制的集群资源预测方法和装置
CN110427870B (zh) * 2019-06-10 2024-06-18 腾讯医疗健康(深圳)有限公司 眼部图片识别方法、目标识别模型训练方法及装置
CN110478204A (zh) * 2019-07-25 2019-11-22 李高轩 一种结合图像识别的导盲眼镜及其构成的导盲系统
CN110517759B (zh) * 2019-08-29 2022-03-25 腾讯医疗健康(深圳)有限公司 一种待标注图像确定的方法、模型训练的方法及装置
CN111275110B (zh) * 2020-01-20 2023-06-09 北京百度网讯科技有限公司 图像描述的方法、装置、电子设备及存储介质
CN111310647A (zh) * 2020-02-12 2020-06-19 北京云住养科技有限公司 自动识别跌倒模型的生成方法和装置
US11093794B1 (en) * 2020-02-13 2021-08-17 United States Of America As Represented By The Secretary Of The Navy Noise-driven coupled dynamic pattern recognition device for low power applications
CN111753825A (zh) 2020-03-27 2020-10-09 北京京东尚科信息技术有限公司 图像描述生成方法、装置、系统、介质及电子设备
EP3916633A1 (de) * 2020-05-25 2021-12-01 Sick Ag Kamera und verfahren zum verarbeiten von bilddaten
CN111723729B (zh) * 2020-06-18 2022-08-05 四川千图禾科技有限公司 基于知识图谱的监控视频犬类姿态和行为智能识别方法
US11455146B2 (en) * 2020-06-22 2022-09-27 Bank Of America Corporation Generating a pseudo-code from a text summarization based on a convolutional neural network
CN111767727B (zh) * 2020-06-24 2024-02-06 北京奇艺世纪科技有限公司 数据处理方法及装置
WO2022006621A1 (en) * 2020-07-06 2022-01-13 Harrison-Ai Pty Ltd Method and system for automated generation of text captions from medical images
CN112016400B (zh) * 2020-08-04 2021-06-29 香港理工大学深圳研究院 一种基于深度学习的单类目标检测方法、设备及存储介质
CN112614175B (zh) * 2020-12-21 2024-09-06 滕州市东大矿业有限责任公司 基于特征去相关的用于封孔剂注射器的注射参数确定方法
CN112800247B (zh) * 2021-04-09 2021-06-18 华中科技大学 基于知识图谱共享的语义编/解码方法、设备和通信系统
CN113205051B (zh) * 2021-05-10 2022-01-25 中国科学院空天信息创新研究院 基于高空间分辨率遥感影像的储油罐提取方法
CN113569868B (zh) * 2021-06-11 2023-09-19 北京旷视科技有限公司 一种目标检测方法、装置及电子设备
CN113486868B (zh) * 2021-09-07 2022-02-11 中南大学 一种电机故障诊断方法及系统
CN113743517A (zh) * 2021-09-08 2021-12-03 Oppo广东移动通信有限公司 模型训练方法、图像深度预测方法及装置、设备、介质
CN114821560B (zh) * 2022-04-11 2024-08-02 深圳市星桐科技有限公司 文本识别方法和装置
CN116167990B (zh) * 2023-01-28 2024-06-25 阿里巴巴(中国)有限公司 基于图像的目标识别、神经网络模型处理方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8165354B1 (en) * 2008-03-18 2012-04-24 Google Inc. Face recognition with discriminative face alignment
CN106446782A (zh) * 2016-08-29 2017-02-22 北京小米移动软件有限公司 图像识别方法及装置
CN106845411A (zh) * 2017-01-19 2017-06-13 清华大学 一种基于深度学习和概率图模型的视频描述生成方法
CN107038221A (zh) * 2017-03-22 2017-08-11 杭州电子科技大学 一种基于语义信息引导的视频内容描述方法
CN108304846A (zh) * 2017-09-11 2018-07-20 腾讯科技(深圳)有限公司 图像识别方法、装置及存储介质

Family Cites Families (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9743078B2 (en) * 2004-07-30 2017-08-22 Euclid Discoveries, Llc Standards-compliant model-based video encoding and decoding
RU2461977C2 (ru) * 2006-12-18 2012-09-20 Конинклейке Филипс Электроникс Н.В. Сжатие и снятие сжатия изображения
US8254444B2 (en) * 2007-05-14 2012-08-28 Samsung Electronics Co., Ltd. System and method for phase adaptive occlusion detection based on motion vector field in digital video
JPWO2009110160A1 (ja) * 2008-03-07 2011-07-14 株式会社東芝 動画像符号化/復号化方法及び装置
CN102577393B (zh) * 2009-10-20 2015-03-25 夏普株式会社 运动图像编码装置、运动图像解码装置、运动图像编码/解码系统、运动图像编码方法及运动图像解码方法
US9369718B2 (en) * 2009-10-30 2016-06-14 Sun Patent Trust Decoding method, decoding apparatus, coding method, and coding apparatus using a quantization matrix
US9582431B2 (en) * 2010-03-22 2017-02-28 Seagate Technology Llc Storage address space to NVM address, span, and length mapping/converting
KR101420957B1 (ko) * 2010-03-31 2014-07-30 미쓰비시덴키 가부시키가이샤 화상 부호화 장치, 화상 복호 장치, 화상 부호화 방법 및 화상 복호 방법
JP2012253482A (ja) * 2011-06-01 2012-12-20 Sony Corp 画像処理装置および方法、記録媒体、並びにプログラム
US8918320B2 (en) * 2012-01-03 2014-12-23 Nokia Corporation Methods, apparatuses and computer program products for joint use of speech and text-based features for sentiment detection
EP2842106B1 (en) * 2012-04-23 2019-11-13 Telecom Italia S.p.A. Method and system for image analysis
US9183460B2 (en) * 2012-11-30 2015-11-10 Google Inc. Detecting modified images
CN102982799A (zh) * 2012-12-20 2013-03-20 中国科学院自动化研究所 一种融合引导概率的语音识别优化解码方法
US9349072B2 (en) * 2013-03-11 2016-05-24 Microsoft Technology Licensing, Llc Local feature based image compression
CN104918046B (zh) * 2014-03-13 2019-11-05 中兴通讯股份有限公司 一种局部描述子压缩方法和装置
US10909329B2 (en) 2015-05-21 2021-02-02 Baidu Usa Llc Multilingual image question answering
CN105139385B (zh) * 2015-08-12 2018-04-17 西安电子科技大学 基于深层自动编码器重构的图像视觉显著性区域检测方法
ITUB20153724A1 (it) * 2015-09-18 2017-03-18 Sisvel Tech S R L Metodi e apparati per codificare e decodificare immagini o flussi video digitali
US10423874B2 (en) * 2015-10-02 2019-09-24 Baidu Usa Llc Intelligent image captioning
US10402697B2 (en) * 2016-08-01 2019-09-03 Nvidia Corporation Fusing multilayer and multimodal deep neural networks for video classification
CN106548145A (zh) * 2016-10-31 2017-03-29 北京小米移动软件有限公司 图像识别方法及装置
IT201600122898A1 (it) * 2016-12-02 2018-06-02 Ecole Polytechnique Fed Lausanne Epfl Metodi e apparati per codificare e decodificare immagini o flussi video digitali
US10783393B2 (en) * 2017-06-20 2020-09-22 Nvidia Corporation Semi-supervised learning for landmark localization
US11966839B2 (en) * 2017-10-25 2024-04-23 Deepmind Technologies Limited Auto-regressive neural network systems with a soft attention mechanism using support data patches
KR102174777B1 (ko) * 2018-01-23 2020-11-06 주식회사 날비컴퍼니 이미지의 품질 향상을 위하여 이미지를 처리하는 방법 및 장치
CN110072142B (zh) * 2018-01-24 2020-06-02 腾讯科技(深圳)有限公司 视频描述生成方法、装置、视频播放方法、装置和存储介质
US10671855B2 (en) * 2018-04-10 2020-06-02 Adobe Inc. Video object segmentation by reference-guided mask propagation
US10824909B2 (en) * 2018-05-15 2020-11-03 Toyota Research Institute, Inc. Systems and methods for conditional image translation
CN110163048B (zh) * 2018-07-10 2023-06-02 腾讯科技(深圳)有限公司 手部关键点的识别模型训练方法、识别方法及设备
US20200104940A1 (en) * 2018-10-01 2020-04-02 Ramanathan Krishnan Artificial intelligence enabled assessment of damage to automobiles

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8165354B1 (en) * 2008-03-18 2012-04-24 Google Inc. Face recognition with discriminative face alignment
CN106446782A (zh) * 2016-08-29 2017-02-22 北京小米移动软件有限公司 图像识别方法及装置
CN106845411A (zh) * 2017-01-19 2017-06-13 清华大学 一种基于深度学习和概率图模型的视频描述生成方法
CN107038221A (zh) * 2017-03-22 2017-08-11 杭州电子科技大学 一种基于语义信息引导的视频内容描述方法
CN108304846A (zh) * 2017-09-11 2018-07-20 腾讯科技(深圳)有限公司 图像识别方法、装置及存储介质

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102134893B1 (ko) * 2019-11-07 2020-07-16 국방과학연구소 사전 압축된 텍스트 데이터의 압축 방식을 식별하는 시스템 및 방법
CN112785494A (zh) * 2021-01-26 2021-05-11 网易(杭州)网络有限公司 一种三维模型构建方法、装置、电子设备和存储介质
CN112785494B (zh) * 2021-01-26 2023-06-16 网易(杭州)网络有限公司 一种三维模型构建方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN110490213B (zh) 2021-10-29
CN108304846A (zh) 2018-07-20
JP2020533696A (ja) 2020-11-19
CN110490213A (zh) 2019-11-22
US20190385004A1 (en) 2019-12-19
KR102270394B1 (ko) 2021-06-30
US10956771B2 (en) 2021-03-23
EP3611663A1 (en) 2020-02-19
EP3611663A4 (en) 2020-12-23
KR20200007022A (ko) 2020-01-21
JP6972319B2 (ja) 2021-11-24
CN108304846B (zh) 2021-10-22

Similar Documents

Publication Publication Date Title
WO2019047971A1 (zh) 图像识别方法、终端及存储介质
CN110599557B (zh) 图像描述生成方法、模型训练方法、设备和存储介质
KR102646667B1 (ko) 이미지 영역을 찾기 위한 방법, 모델 훈련 방법 및 관련 장치
US11416681B2 (en) Method and apparatus for determining a reply statement to a statement based on a sum of a probability of the reply statement being output in response to the statement and a second probability in which the statement is output in response to the statement and further based on a terminator
CN110472251B (zh) 翻译模型训练的方法、语句翻译的方法、设备及存储介质
CN110334360B (zh) 机器翻译方法及装置、电子设备及存储介质
WO2020103721A1 (zh) 信息处理的方法、装置及存储介质
KR20190130636A (ko) 기계번역 방법, 장치, 컴퓨터 기기 및 기억매체
CN110570840B (zh) 一种基于人工智能的智能设备唤醒方法和装置
CN110890093A (zh) 一种基于人工智能的智能设备唤醒方法和装置
WO2020147369A1 (zh) 自然语言处理方法、训练方法及数据处理设备
CN112820299B (zh) 声纹识别模型训练方法、装置及相关设备
CN111539212A (zh) 文本信息处理方法、装置、存储介质及电子设备
CN113821589B (zh) 一种文本标签的确定方法及装置、计算机设备和存储介质
CN111597804B (zh) 一种实体识别模型训练的方法以及相关装置
CN112214605A (zh) 一种文本分类方法和相关装置
CN109543014B (zh) 人机对话方法、装置、终端及服务器
CN113761122A (zh) 一种事件抽取方法、相关装置、设备及存储介质
CN111723783B (zh) 一种内容识别方法和相关装置
CN113569043A (zh) 一种文本类别确定方法和相关装置
CN113505596A (zh) 话题切换标记方法、装置及计算机设备
CN113806532B (zh) 比喻句式判断模型的训练方法、装置、介质及设备
CN117057345B (zh) 一种角色关系的获取方法及相关产品
CN116959407A (zh) 一种读音预测方法、装置及相关产品
CN113590832A (zh) 一种基于位置信息的文本识别方法以及相关装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18853742

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2018853742

Country of ref document: EP

Effective date: 20191112

ENP Entry into the national phase

Ref document number: 20197036824

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2020514506

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE