WO2019047971A1 - 图像识别方法、终端及存储介质 - Google Patents
图像识别方法、终端及存储介质 Download PDFInfo
- Publication number
- WO2019047971A1 WO2019047971A1 PCT/CN2018/105009 CN2018105009W WO2019047971A1 WO 2019047971 A1 WO2019047971 A1 WO 2019047971A1 CN 2018105009 W CN2018105009 W CN 2018105009W WO 2019047971 A1 WO2019047971 A1 WO 2019047971A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- input data
- network model
- model
- annotation
- timing
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/5866—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Definitions
- the embodiments of the present application relate to the field of machine learning, and in particular, to an image recognition method, a terminal, and a storage medium.
- the system framework of image recognition generally includes an encoder (Encoder) and a decoder (Decoder).
- an image recognition method is proposed in the related art, including: first, feature extraction of an image by an encoder, The feature vector and the annotation vector (Annotation Vectors) set are obtained, wherein the feature vector is obtained by global feature extraction of the image, and the annotation vector set is obtained by local feature extraction of the image. Then, the feature vector is initialized to obtain initial input data, which is used to indicate the initial state of the decoder, and generally includes initial Hidden State information and initial Memory Cell state information. .
- the specific information of the artificial design is extracted from the image as the guiding information, and based on the guiding information, the annotation vector set and the initial input data are decoded by the decoder to obtain a description sentence of the image.
- the guiding information is used to guide the encoding process of the encoder to improve the quality of generating the description sentence, so that the generated description statement can describe the image more accurately and conform to the semantics.
- the embodiment of the present invention provides an image recognition method, a terminal, and a storage medium, which can solve the problem that the description message that cannot be accurately generated by the specific guidance information artificially designed in the related art is generated, and the quality of the generated description statement is low.
- the technical solution is as follows:
- an image recognition method is provided, the method being performed by a terminal, the method comprising:
- an image recognition apparatus comprising:
- An extraction module configured to perform feature extraction on the target image to be identified by the encoder, to obtain a feature vector and a first annotation vector set;
- a processing module configured to perform initialization processing on the feature vector to obtain first initial input data
- a generating module configured to generate, according to the first set of annotation vectors, first guiding information by using a first guiding network model, where the first guiding network model is configured to generate guiding information according to an identifier vector set of any image;
- a determining module configured to determine, by the decoder, a description statement of the target image based on the first guiding information, the first annotation vector set, and the first initial input data.
- a terminal comprising a processor and a memory, wherein the memory stores at least one instruction, at least one program, a code set or a set of instructions, the instruction, the program, the code
- the set or set of instructions is loaded and executed by the processor to:
- a computer readable storage medium stores at least one instruction, at least one program, a code set or a set of instructions, the instruction, the program, the code
- the set or set of instructions is loaded and executed by a processor to implement the image recognition method as described in the first aspect.
- a guiding network model is added between the encoder and the decoder.
- the guiding information may be generated by using the guiding network model based on the annotation vector set, because the guiding network model
- the guidance information of the image can be generated according to the annotation vector set of any image. Therefore, the guidance information generated by the guidance network model can be more applicable to the generation process of the description sentence of the target image, and the accuracy is high, so that the target image can be The encoding process is accurately guided to improve the quality of the generated description statement.
- FIG. 1 is a schematic diagram of a logical structure of an RNN model provided by an embodiment of the present application.
- FIG. 2 is a schematic diagram of a logical structure of an LSTM model provided by an embodiment of the present application
- FIG. 3 is a schematic structural diagram of an image recognition system according to an embodiment of the present application.
- FIG. 4 is a schematic structural diagram of another image recognition system according to an embodiment of the present application.
- FIG. 5 is a schematic structural diagram of still another image recognition system according to an embodiment of the present application.
- FIG. 6 is a schematic structural diagram of still another image recognition system according to an embodiment of the present application.
- FIG. 7 is a flowchart of an image recognition method according to an embodiment of the present application.
- FIG. 8 is a flowchart of another image recognition method according to an embodiment of the present application.
- FIG. 9 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present application.
- FIG. 10 is a schematic structural diagram of a generating module 303 according to an embodiment of the present application.
- FIG. 11 is a schematic structural diagram of another generation module 303 according to an embodiment of the present disclosure.
- FIG. 12 is a schematic structural diagram of a determining module 304 according to an embodiment of the present application.
- FIG. 13 is a schematic structural diagram of another image recognition apparatus according to an embodiment of the present disclosure.
- FIG. 14 is a schematic structural diagram of another determining module 304 according to an embodiment of the present disclosure.
- FIG. 15 is a schematic structural diagram of still another image recognition apparatus according to an embodiment of the present application.
- FIG. 16 is a schematic structural diagram of a terminal 400 according to an embodiment of the present application.
- the encoder is used to encode the image to generate a vector, and the encoder usually adopts a CNN (Convolutional Neural Networks) model.
- CNN Convolutional Neural Networks
- the decoder is used to decode the vector generated by the encoder to translate the vector generated by the encoder into a description sentence of the image, and the decoder usually adopts a RNN (Recurrent Neural Network) model.
- RNN Recurrent Neural Network
- the boot information is information obtained by processing the image, usually expressed as a vector, and can be used as part of the decoder input to guide the decoding process. Introducing the boot information in the decoder can improve the performance of the decoder, ensure that the decoder can generate better description statements, and improve the quality of the generated description statements.
- the CNN model refers to a neural network model developed for image classification and recognition based on the traditional multi-layer neural network.
- the CNN model usually includes multiple convolution layers and at least one fully connected layer, which can characterize the image. extract.
- the traditional neural network Since the traditional neural network has no memory function, that is, for the conventional neural network, its input is independent and has no context-related data. However, in practical applications, the input is usually a serialized input with obvious context features, such as the need to predict the next word in the description statement, at which point the output of the neural network must depend on the last input. That is, the neural network should be required to have a memory function, and the RNN model is a neural network in which nodes are connected in a loop and have a memory function, and the internal memory function can be used to cyclically process the input data.
- the RNN model includes a three-layer structure of an input layer, an implicit layer, and an output layer, and the hidden layer is a ring structure. .
- the input layer is connected to the hidden layer, and the hidden layer is connected to the output layer.
- the structure of the RNN model shown on the left side of FIG. 1 is expanded in time series, and the structure shown in the right side of FIG. 1 can be obtained.
- the input data received by the input layer of the RNN model is data sorted according to a certain time series, that is, the input data received by the input layer is sequence data, for convenience of description, the sequence data is marked as x 1 , x 2 , ...
- n the time corresponding to each of the data in the sequence data is denoted by t 1 , t 2 , ..., t i , ..., t n , and the pair x 1 , x 2 , ...,
- the output data obtained by processing x i , . . . , x n respectively is denoted by f 1 , f 2 , . . . , f i , . . . , f n
- timing the step of sequentially processing each input data by the RNN model according to timing.
- the input data received by the input layer at time t 1 is x 1 , and x 1 is transmitted to the hidden layer, and the hidden layer processes x 1 and the transmission of the processed data to the output layer, to obtain output data at time t 1 f 1.
- the input data received by the input layer is x 2 and x 2 is transmitted to the hidden layer.
- the hidden layer processes x 2 according to the output data f 1 at time t 1 and processes the processed data. transmitted to the output layer, to obtain the output data of the time t 2 f 2.
- any time t i in addition to the hidden layer receives the input time point t i-layer transmission data x i, also receives the time point t i-1 output data f i-1, and f i-1 according to the pair of x i, to give the output data f i i t the time.
- the LSTM network model is a special RNN model that can process and predict important events with relatively long intervals and delays in time series.
- the LSTM network model includes an LSTM unit that is provided with an input gate, a forgotten gate, and an output gate, and the input data can be processed at each timing step based on the set input gate, forgetting gate, and output gate.
- the LSTM network model includes an LSTM unit, and the LSTM unit is a ring structure, and any timing performed on the LSTM unit is performed.
- the LSTM unit can process the input data x t of the timing step t and the output data f t-1 of the previous timing step t-1 to obtain the output data f t of the timing step t .
- the timing unit receives LSTM t step after input data x 1 1, x 1 may be processed to obtain timing data output step t f 1 of 1 Then, f 1 is re-entered into the LSTM unit.
- the f 1 and x 2 can be processed to obtain the output data f 2 of the timing step t 2 until the timing step is performed.
- t n x n input data sequence obtained in step and step timing t n T n-1 of the output data of f n-1 until the output data f n. Where n is the number of times the LSTM network model cyclically processes the input data.
- the review network is an image recognition network based on an encoder-decoder framework, including a reviewer and decoder. Both the reviewer and the decoder typically use the CNN model. The reviewer can further explore the interaction relationship between the global feature and the local feature extracted by the encoder from the image, and generate initial input data for the decoder based on the interaction relationship between the global feature and the local feature to improve the performance of the decoder.
- Embodiments of the present application can be applied to early childhood education, image retrieval, blind reading or chat systems, where images are often automatically translated into natural language.
- the image recognition method provided by the embodiment of the present application can be used to translate the image seen by the young child into a corresponding description sentence, and then the description sentence is converted into a voice playback. So that young children can learn image content in combination with images and voice.
- the image recognition method provided by the embodiment of the present application may be used to translate the image into a corresponding description statement, so as to accurately classify the image according to the description sentence of the image, or according to the description sentence of the image. Accurately retrieve images.
- the image may be first translated into a corresponding description sentence, and then the description sentence is converted into a voice, so that the blind person can recognize the image through the voice, or
- the description statement is converted into Braille so that the blind person can recognize the image by reading Braille.
- the image in the chat window can be translated into a corresponding description sentence, and the description sentence is displayed.
- FIG. 3 is a schematic structural diagram of an image recognition system according to an embodiment of the present application. As shown in FIG. 3, the image recognition system includes an encoder 10, a first boot network model 20, and a decoder 30.
- the encoder 10 is used for encoding the target image to be identified, that is, performing feature extraction on the target image to obtain a feature vector and a first annotation vector set.
- the feature vector is used to indicate a global feature of the target image
- the first set of annotation vectors is used to indicate local features of the target image.
- the encoder 10 can output it to the decoder 30 and the first guided network model 20, respectively.
- the encoder 10 may perform initialization processing to obtain first initial input data, and then output the first initial input data to the decoder 30; alternatively, the encoder 10 may output the feature vector to other models, The other models perform initialization processing on the feature vector output from the target encoder 10 to obtain first initial input data, and output the first initial input data to the decoder 30.
- the first boot network model 20 is configured to generate first boot information based on the first annotation vector set output by the encoder 10, and then output the first boot information to the decoder 30, and the first boot network model passes the sample image.
- the vector collection of annotations is obtained.
- the decoder 30 is configured to determine a description statement of the target image based on the first boot information, the first annotation vector set, and the first initial input data.
- the image recognition system shown in FIG. 3 adds a guidance network model between the encoder and the decoder compared to the related art, since the guidance network model can generate the image according to the annotation vector set of any image.
- Descriptive statement therefore, compared with the artificially designed guidance information, the guidance information generated by the guidance network model can be more applicable to the generation process of the description sentence of the target image, and the accuracy is high, so that the encoding process of the image can be accurately performed.
- Boot which improves the quality of the generated description statement.
- FIG. 4 is a schematic structural diagram of another image recognition system according to an embodiment of the present application.
- the image recognition system includes an encoder 10, a first guidance network model 20, a decoder 30, and a multi-example model 40.
- the multi-instance model 40 is configured to process the target image to be identified, and obtain attribute information of the target image, where the attribute information is used to indicate a probability of predicting a word appearing in the description sentence of the target image, and the attribute information of the target image is obtained. Output to the first boot network model 20.
- the first boot network model 20 is configured to generate first boot information based on the first annotation vector set output by the encoder 10 and the attribute information of the target image output by the multi-example model 40.
- the first guiding network model 20 can comprehensively determine the first guiding information according to the first annotation vector set of the target image and the attribute information, thereby further improving the The accuracy of the generated first boot information.
- FIG. 5 is a schematic structural diagram of still another image recognition system according to an embodiment of the present application.
- the image recognition system includes an encoder 10, a first guidance network model 20, a reviewer 50, and a second guidance network model. 60 and decoder 30.
- the function of the encoder 10 in FIG. 5 is the same as that of the encoder 10 in FIG. 3 .
- the first boot network model 20 is configured to generate first boot information based on the first set of annotation vectors input by the encoder 10, and output the first boot information to the reviewer 50.
- the reviewer 50 is configured to determine a second annotation vector set and second initial input data based on the first initial input data, the first annotation vector set, and the first guidance information, and set the second annotation vector set and the second initial input data.
- the output is output to the decoder 30, and the second set of annotation vectors is output to the second guided network model 60.
- the second initial input data is the initial input data of the decoder 30 for indicating the initial state of the decoder 30, and may specifically include initial implicit state information and initial memory cell state information.
- the second guiding network model 60 is configured to generate second guiding information based on the second set of annotation vectors, and output the second guiding information to the decoder 30, and the second guiding network model is also obtained by training the sample image.
- the decoder 30 is configured to decode the second annotation vector set and the second initial input data based on the second guiding information to obtain a description statement of the target image.
- the interaction between the local feature and the global feature of the target image can be further excavated by the reviewer, so that the generated second annotation vector set and the second initial input data can be generated. More accurately indicating the characteristics of the target image, further improving the system performance of the image recognition system, thereby improving the quality of the generated description statement.
- FIG. 6 is a schematic structural diagram of still another image recognition system according to an embodiment of the present application.
- the image recognition system includes an encoder 10, a first guidance network model 20, a reviewer 50, and a second guidance network model. 60. Decoder 30 and multiple example model 40.
- the function of the encoder 10, the reviewer 50, and the decoder 30 in FIG. 6 is the same as that of the decoder 30, and the specific description may be referred to FIG. 5, and details are not described herein again.
- the multi-instance model 40 is used to process the target image to be identified, obtain attribute information of the target image, and output the attribute information of the target image to the first boot network model 20 and the second boot network model 60, respectively.
- the first boot network model 20 is configured to generate first boot information based on the first annotation vector set output by the encoder 10 and the attribute information of the target image output by the multi-instance model 40, and output the first guidance information to the reviewer 50. .
- the second boot network model 60 is configured to generate second boot information based on the second annotation vector set output by the reviewer 50 and the attribute information of the target image output by the multi-instance model 40, and output the second guidance information to the decoder 30. So that the encoder 30 encodes the second annotation vector set and the second initial input data based on the second guidance information to obtain a description sentence of the target image.
- both the first boot network model 20 and the second boot network model 60 can be made according to attribute information and annotation of the target image.
- the vector set comprehensively determines the guidance information, further improving the accuracy of the generated guidance information.
- the image recognition systems shown in FIG. 3 to FIG. 6 can be trained based on description sentences of a plurality of sample images and a plurality of sample images, that is, the encoder and the first guide can be obtained through training.
- the network model, the reviewer, the second boot network model, and the decoder enable the first boot network model and the second boot network model to adaptively learn how to generate accurate boot information during the training process, thereby improving the generation of the boot information. accuracy.
- FIG. 7 is a flowchart of an image recognition method according to an embodiment of the present disclosure.
- the method may be performed by a terminal, and the terminal may be a mobile phone, a tablet computer, or a computer.
- the terminal may include the image recognition system, for example, may be installed.
- the software carries the above image recognition system. Referring to Figure 7, the method includes:
- Step 101 Perform feature extraction on the target image to be identified by the encoder, to obtain a feature vector and a first annotation vector set.
- the target image may be first input into an encoder, and the target image is subjected to feature extraction by an encoder to obtain a feature vector of the target image and a first annotation vector set respectively.
- the target image may be globally extracted by the encoder to obtain a feature vector, and the target image is extracted by the encoder to obtain a set of annotation vectors.
- the feature vector is used to indicate a global feature of the target image
- the annotation vector in the second identification vector set is used to indicate a local feature of the target image.
- the encoder may adopt a CNN model.
- the feature vector may be extracted through the last fully connected layer of the CNN model, and the second set of annotation vectors may pass through the CNN.
- the last convolutional layer of the model is extracted.
- Step 102 Initialize the feature vector to obtain first initial input data.
- the first initial input data refers to initial input data to be input to the next processing model of the encoder, and is used to indicate an initial state of the next processing model, which may be a decoder or a reviewer.
- the first initial input data may include first initial implicit state information and first initial memory cell state information, the first initial implicit state information being used to indicate an initial state of a hidden layer of the next processing model, the first initial memory unit
- the status information is used to indicate the initial state of the memory unit of the next processing model.
- the feature vector may be subjected to initialization processing such as linear transformation to obtain first initial input data.
- initialization processing such as linear transformation
- the feature vector may be initialized by the encoder to obtain the first initial input data
- the feature vector output by the encoder may be initialized by other models to obtain the first initial input data, which is used in the embodiment of the present application. Not limited.
- the encoder may include an RNN model for performing feature extraction on the target image, and an initialization model for initializing the feature vector, and the encoder extracts the feature by the RNN model to obtain the feature vector.
- the feature vector can be initialized by the initialization model to obtain the first initial input data.
- the encoder may also be used only for feature extraction on the target image, and an initialization model is added after the encoder.
- the initialization model is used to initialize the feature vector, and the feature image is extracted by the encoder to obtain the feature vector.
- the feature vector may be output to the initialization model, and then the feature vector is initialized by the initialization model to obtain first initial input data.
- Step 103 Generate first guiding information by using a first guiding network model based on the first set of annotation vectors, the first guiding network model for generating guiding information according to the annotation vector set of any image.
- generating the first boot information by using the first boot network model may be implemented in the following two manners based on the first set of annotation vectors:
- the first implementation manner is: performing linear transformation on the first annotation vector set based on the first matrix formed by the model parameters in the first guiding network model to obtain a second matrix; determining the first based on a maximum value of each row in the second matrix Boot information.
- the first boot network model can be trained according to the set of annotation vectors of the sample image.
- each model in FIG. 3 may be transformed into a model to be trained, and then the transformed image recognition system is trained based on the description sentences of the plurality of sample images and the plurality of sample images, and then the training process is performed.
- the label vector can be extracted from the plurality of sample images and output to the training network model to be trained, so that after the training of the entire image recognition system is completed, the training network model to be trained can be trained. For the first boot network model.
- the encoder to be trained may be an untrained encoder or a pre-trained encoder, which is not limited in this embodiment of the present application.
- the pre-trained encoder to train the training network model By using the pre-trained encoder to train the training network model, the training efficiency of the entire image recognition system can be improved, and the training efficiency of the network model to be trained can be improved.
- the first set of annotation vectors is also in the form of a matrix, and the first matrix is a matrix composed of model parameters of the first guidance network model and used for linear transformation of the first annotation vector set. Specifically, the first set of annotation vectors may be multiplied by the first matrix to linearly transform the first set of annotation vectors to obtain a second matrix.
- determining the first guiding information based on a maximum value of each row in the second matrix includes: selecting a maximum value of each row in the second matrix, and then forming the selected maximum value into a column number according to a principle that the number of rows does not change. a matrix, and the composed matrix is determined as the first boot information.
- the first guiding information can be determined by the following formula (1):
- the max function refers to a maximum value of each row of the matrix to be processed, and a matrix in which the number of rows is constant and the number of columns is 1.
- the second implementation manner when the first guiding network model is used to generate guiding information according to the annotation vector set and the attribute information of any image, the target image may be input as a multi-example model, and the multi-example model is used to Processing the target image to obtain attribute information of the target image; linearly transforming the first set of annotation vectors based on a third matrix formed by the model parameters in the first guided network model to obtain a fourth matrix; based on the fourth matrix And the attribute information of the target image, generating a fifth matrix; determining the first guiding information based on a maximum value of each row in the fifth matrix.
- the attribute information of the sample image is used to refer to a probability of predicting a word appearing in a description sentence of the sample image.
- the multi-example model is obtained by training a plurality of sample images and description sentences of the plurality of sample images, and is capable of outputting a model of attribute information of the sample image, that is, the multi-example model can describe the image of the image
- the probability of words that may appear in the prediction is predicted.
- the attribute information may be MIL (Multi-instance learning) information or the like.
- the first guiding network model can be obtained by training the annotation vector set of the sample image and the attribute information.
- each model of FIG. 4 may be transformed into a model to be trained, and then the transformed image recognition system is trained based on description sentences of a plurality of sample images and a plurality of sample images, and then the training code is to be trained during the training process.
- the annotation vector can be extracted from the sample image and output to the to-be-trained network model, and the multi-example model to be trained can process the image to obtain attribute information, and output the attribute information to the to-be-trained network model to be trained.
- the model can be trained based on the annotation vector and the attribute information of the sample image, so that after the training of the entire image recognition system is completed, the training network model to be trained can be trained as the first guidance network model.
- the encoder to be trained may be an untrained encoder or a pre-trained encoder; the multi-example model to be trained may be an untrained multi-example model or a pre-trained multi-example model.
- This embodiment of the present application does not limit this.
- the first set of annotation vectors is also in the form of a matrix
- the third matrix is a matrix composed of model parameters of the first guidance network model and used for linear transformation of the first annotation vector set.
- the first annotation vector set may be multiplied by the third matrix to linearly transform the first annotation vector set to obtain a fourth matrix, and then generate a fifth matrix based on the fourth matrix and the attribute information of the target image. .
- determining the first guiding information based on the maximum value of each row in the fifth matrix comprises: selecting a maximum value of each row in the fifth matrix, and then forming the matrix having the column number of 1 according to the principle that the selected maximum value is constant. And determining the composed matrix as the first boot information.
- the first guidance can be determined by the following formula (2) Information v:
- the max function refers to a maximum value of each row of the matrix to be processed, and a matrix in which the number of rows is constant and the number of columns is 1.
- the first guiding network model can be obtained through learning, that is, it can be trained through description sentences of multiple sample images and multiple sample images, and the guiding information can be automatically learned during the training process, and therefore, The first guiding network model generates the first guiding information with high accuracy, and the generated first guiding information can accurately guide the encoded encoding process, thereby improving the quality of the description statement for generating the target image.
- Step 104 Determine, by the decoder, a description statement of the target image based on the first guiding information, the first annotation vector set, and the first initial input data.
- determining, by the decoder, the description statement of the target image may include the following two implementation manners:
- the first implementation manner is: decoding, according to the first guiding information, the first annotation vector set and the first initial input data by the decoder to obtain a description statement of the target image.
- the decoder typically employs an RNN model, such as an LSTM network model.
- the first annotation vector set and the first initial input data are decoded by the decoder, and the description statement of the target image may be obtained by the following steps 1)-3):
- each first timing step performed for the first RNN model is guided based on the first target
- the information determines the input data for the first sequential step.
- the M refers to the number of times the first RNN model cyclically processes the input data, and the M is a positive integer, and each first timing step is a processing step of the input data by the first RNN model.
- determining the input data of the first timing step based on the first guiding information may include determining, according to the first guiding information, input data of the first timing step by using the following formula (3):
- t is the first timing step
- x t is the input data of the first timing step
- E is a word embedding matrix and is a model parameter of the first RNN model
- y t is a word corresponding to the first timing step
- the word corresponding to the first timing step is determined based on the output data of the previous first timing step of the first timing step
- Q is the sixth matrix and is the model parameter of the first RNN model
- v is the first boot information.
- the input data of the first sequence step, the first set of annotation vectors, and the output data of the previous first timing step of the first timing step are processed by the first RNN model.
- the output data of the first timing step is obtained.
- the output data of the first timing step may include implicit state information and memory unit state information.
- the output data of the last first timing step of the first timing step is based on the first initial input data. Ok to get.
- the first initial input data includes the first initial implicit state information h 0 and the first initial memory cell state information c 0
- the first timing step is the first first timing step
- the first The output data of the last first timing step of the timing step is h 0 and c 0 .
- the first RNN model used may be an LSTM network model.
- LSTM network model based on the input data of the first timing step, the first annotation vector set, and the output data of the previous first timing step of the first timing step, determining that the output data of the first timing step may be
- the abstract representation is as follows (4):
- t is the first timing step
- x t is the input data of the first timing step
- h t-1 is the implicit state information of the previous timing step of the first timing step
- LSTM represents the processing of the LSTM network model.
- the processing of the LSTM network model can be expressed by the following formula:
- i t , f t , c t and 0 t are the output data of the first timing step at the input gate, the forgetting gate, the memory gate and the output gate, respectively
- ⁇ is an activation function of the LSTM network model, such as a sigmoid function, tanh () is a hyperbolic tangent function, T is a matrix for linear transformation, x t is the input data of the first sequential step, and h t-1 is the implicit state information of the last sequential step of the first sequential step, d t is the target data determined based on the first set of annotation vectors, c t is the memory unit state information of the first timing step, and c t-1 is the memory unit state of the last first timing step of the first timing step
- the information, h t is the implicit state information of the first timing step.
- the target data d t may be a first annotation vector set, or may be a context vector (Context Vector), which is based on the first annotation vector set and the implicit state information of the previous timing step of the first timing step. , determined by the attention model.
- Context Vector Context Vector
- the attention model can be used to determine which region of the target image is noted in the previous first timing step, that is, it can be Each label vector in the calculation calculates a weight value, and the higher the weight of the label vector indicates that the label vector is more noticed.
- the LSTM network model may be an LSTM network model provided with an attention model, after obtaining the first annotation vector set and the implicit state information of the last timing step of the first timing step, A context vector may be determined by the attention model based on the first annotation vector set and the implicit state information of the previous timing step of the first timing step, and the context vector is used as the target data.
- the attention model can be calculated Any one of the similarities e i of the vector a i and h t-1 , and then calculate the weight of the attention of a i
- the output data of all the first timing steps in the M first timing steps may be combined to obtain a description statement of the target image.
- the output data of each first timing step is usually a word, and then the M words output by the M first timing steps are combined to obtain a description sentence of the target image.
- all the output data of the M first sequential steps may be boy, give, girl, send, and flower, respectively, and the description sentence of the target image is “the boy sends the girl to the girl. flower”.
- the first guiding network model capable of accurately generating the guiding information based on the annotation vector set of the target image
- the first to-be-trained encoder before the feature vector is extracted by the encoder to obtain the feature vector and the first annotation vector set, Combining the first to-be-trained encoder, the first to-be-trained network model, and the first to-be-trained decoder to obtain a first cascaded network model, and then using a gradient based on the plurality of sample images and description statements of the plurality of sample images
- the descent method trains the first concatenated network model to obtain the encoder, the first boot network model, and the decoder.
- the first to-be-trained encoder, the first to-be-trained network model, and the first to-be-trained decoder may be first constructed according to the connection manner of FIG. 3 or FIG. 4 to be able to process the image to obtain an image description sentence.
- the image recognition system trains the image recognition system based on the plurality of sample images and the description statements of the plurality of sample images.
- the first training network to be trained can be The model is trained so that the first training network model to be trained can adaptively learn the guidance information during the training process, and ensure that the generated guidance information can be more and more accurate.
- a multi-label margin loss may be used as a loss function of the first training network model to be trained, and the loss function is adopted based on the loss function.
- the stochastic gradient descent method adjusts the model parameters of the first to-be-trained network model to obtain the first boot network model.
- training can be performed using an annotated training set, which is a collection of ⁇ sample images, description statements> pairs, such as the MSCOCO data set (a common data set).
- annotated training set which is a collection of ⁇ sample images, description statements> pairs, such as the MSCOCO data set (a common data set).
- the first to be trained encoder may be an untrained encoder or a pre-trained encoder, which is not limited in this embodiment of the present application.
- the first to-be-trained encoder can adopt a pre-trained CNN model on ImageNet (a computer vision system identification project name, which is currently the world's largest image recognition database), and the CNN model can be an inception V3 model (a Kind of CNN model), Resnet model (a CNN model) or VGG model (a CNN model).
- the training efficiency of the entire first cascaded network model can be improved, and the training efficiency of the first guided network model can be improved.
- the process of identifying the target image, obtaining the description sentence of the target image, and the process of training the guiding network model may be performed on the same terminal, or may be performed on different terminals. This embodiment of the present application does not limit this.
- a second implementation manner determining, by the reviewer, a second annotation vector set and a second initial input data, based on the first guidance information, the first annotation vector set, and the first initial input data; and based on the second annotation vector set,
- the second guiding network model generates second guiding information. Based on the second guiding information, the second annotation vector set and the second initial input data are encoded by the encoder to obtain a description statement of the target image.
- a guiding network model is added between the encoder and the decoder.
- the guiding information may be generated by using the guiding network model based on the annotation vector set, because the guiding network model is By training the annotation vector set of the sample image, it is possible to adaptively learn how to accurately generate the guidance information according to the annotation vector set of the image during the training process, so the guidance information generated by the guidance network model has high accuracy and can be The image encoding process is accurately guided, which improves the quality of the generated description statement.
- FIG. 8 is a flowchart of another image recognition method according to an embodiment of the present application, which is applied to a terminal. Referring to Figure 8, the method includes:
- Step 201 Perform feature extraction on the target image to be identified by the encoder to obtain a feature vector and a first annotation vector set.
- Step 202 Perform initialization processing on the feature vector to obtain first initial input data.
- Step 203 Generate first guiding information by using the first guiding network model based on the first annotation vector set.
- Step 204 Determine, according to the first guiding information, the first annotation vector set and the first initial input data, the second annotation vector set and the second initial input data by the reviewer.
- the decoder and the reviewer are generally used in the RNN model. Of course, other models may be used.
- the reviewer is used to further mine the interaction relationship between the global feature and the local feature extracted by the encoder from the image, and generate initial input data, ie, the second initial input, for the decoder based on the interaction relationship between the global feature and the local feature. Data to improve the performance of the decoder, thereby improving the quality of the generated description statement.
- the first initial input data refers to the input data to be input to the reviewer, and is used to indicate the initial state of the reviewer, and specifically includes first initial implicit state information and first initial memory unit state information, the first initial The implicit state information is used to indicate an initial state of the hidden layer of the reviewer, and the first initial memory unit state information is used to indicate an initial state of the memory unit of the reviewer.
- the second initial input data refers to input data to be input to the decoder, and is used to indicate an initial state of the decoder, and specifically includes second initial implicit state information and second initial memory unit state information, and second initial The implicit state information is used to indicate an initial state of the hidden layer of the decoder, and the second initial memory cell state information is used to indicate an initial state of the memory unit of the decoder.
- determining, by the reviewer, the second annotation vector set and the second initial input data may include the following steps 1)-3):
- each second timing step performed for the second RNN model is based on the first
- the target guidance information determines input data for the second timing step.
- the N is the number of times the second RNN model cyclically processes the input data, and the N is a positive integer, and each second timing step is a processing step of the second RNN model on the input data.
- the input data of the second timing step may be determined by the following formula (6):
- t is the second timing step
- x' t is the input data of the second timing step
- E' is the word embedding matrix and is the model parameter of the second RNN model
- Q' is the seventh matrix and is the The model parameter of the second RNN model
- v' is the second boot information.
- the output data of the second timing step may include implicit state information and memory cell state information.
- the second timing step is the first second timing step of the N second timing steps
- the second The output data of the last second timing step of the timing step is determined based on the first initial input data.
- the input data of the second sequence step, the second set of label vectors, and the output data of the previous second timing step of the second timing step are processed by the second RNN model.
- the output data of the second timing step is obtained.
- the method for determining the output data of the first timing step may be performed according to the input data based on the first timing step, the first annotation vector set, and the output data of the previous first timing step of the first timing step. And determining output data of the second timing step based on the input data of the second timing step, the first annotation vector set, and the output data of the previous second timing step of the second timing step.
- the output data of the last second timing step may be determined as the second initial input data, for example, the implicit state information and the memory unit state information of the last second timing step may be determined as the second initial input.
- the data is determined as initial implicit state information and initial memory unit state information of the target encoder.
- the set of implicit state information of all the timing steps in the N second timing steps may be determined as the second set of annotation vectors.
- Step 205 Generate second guiding information by using the second target guiding network model, and the second guiding network model is configured to generate guiding information according to the set of annotation vectors, based on the second set of annotation vectors.
- the method for generating the first guidance information by using the first guidance network model based on the first annotation vector set, and the second guidance network model based on the second annotation vector set may be used according to the foregoing step 103 in the embodiment of FIG. 7 .
- Generate second boot information The specific implementation may be related to the description of the foregoing step 103, and details are not described herein again.
- the second boot network model may be obtained by training the sample image together with the first boot network model, and the boot information may be automatically learned during the training process, and thus generated by the first boot network model and the second boot network model
- the accuracy of the guidance information is high, and the generated guidance information can accurately guide the encoding process of the encoding, thereby improving the quality of the description statement for generating the target image.
- Step 206 Encode the second annotation vector set and the second initial input data by the encoder based on the second guiding information to obtain a description statement of the target image.
- the first annotation vector set and the first initial input data are decoded by the decoder according to the first guiding information in step 104 in the foregoing embodiment of FIG. 7 to obtain a description statement of the target image.
- the second annotation vector set and the second initial input data are encoded by the encoder to obtain a description statement of the target image.
- the target image is extracted by the encoder, and the second to-be-trained encoder, the second to-be-trained network model, the to-be-trained reviewer, and the third to-be-trained network model can be obtained before the feature vector and the first annotation vector set are obtained.
- the second to-be-trained encoder, the second to-be-trained network model, the to-be-trained reviewer, the third to-be-trained network model, and the second to-be-trained decoder may be first constructed in accordance with the connection of FIG.
- An image recognition system capable of processing an image to obtain a description sentence of the image, and then training the image recognition system based on the plurality of sample images and the description sentences of the plurality of sample images, in the process of training the image recognition system,
- the second training guided network model and the third training guided network model can be trained, so that the second training guided network model and the third training guided network model can adaptively learn and guide during the training process. Information to ensure that the generated guidance information is more and more accurate.
- the second to-be-trained encoder may be an untrained encoder or a pre-trained encoder
- the training reviewer may be an untrained reviewer or a pre-trained reviewer.
- the application embodiment does not limit this.
- first boot network model and the second boot network model can be improved by using the pre-trained encoder as the second to-be-trained encoder, or by using the pre-trained reviewer as the most-exercised reviewer.
- the training efficiency of the entire second cascade network model thereby improving the training efficiency of the first boot network model and the second boot network model.
- the process of identifying the target image, obtaining the description sentence of the target image, and the process of training the guiding network model may be performed on the same terminal, or may be performed on different terminals.
- the implementation of the present application does not limit this.
- a guiding network model is added between the encoder and the decoder.
- the guiding information may be generated by using the guiding network model based on the annotation vector set, because the guiding network model is
- the guidance information can be adaptively learned during the training process, so the guidance information generated by the guidance network model has high accuracy, and can accurately guide the encoding process of the image, thereby improving the generation of the description statement. the quality of.
- the interaction between the local feature and the global feature of the target image can be further excavated by the reviewer, so that the generated second annotation vector set and the second initial input data can be further Accurately indicating the characteristics of the target image further improves the system performance of the image recognition system, thereby improving the quality of the generated description statement.
- FIG. 9 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present application, and the apparatus may be a terminal.
- the device includes:
- the extraction module 301 is configured to perform feature extraction on the target image to be identified by the encoder, to obtain a feature vector and a first annotation vector set;
- the processing module 302 is configured to perform initialization processing on the feature vector to obtain first initial input data.
- a generating module 303 configured to generate, according to the first set of annotation vectors, first guiding information by using a first guiding network model, where the first guiding network model is configured to generate guiding information according to the annotation vector set of any image;
- the determining module 304 is configured to determine, by the decoder, a description statement of the target image based on the first guiding information, the first annotation vector set, and the first initial input data.
- the generating module 303 includes:
- the first linear transformation unit 3031 is configured to perform linear transformation on the first annotation vector set based on the first matrix formed by the model parameters in the first guidance network model to obtain a second matrix;
- the first determining unit 3032 is configured to determine the first guiding information based on a maximum value of each row in the second matrix.
- the first guiding network model is configured to generate guiding information according to an annotation vector set and attribute information of any image, the attribute information being used to indicate a probability of predicting a word appearing in a description sentence of the image;
- the generating module 303 includes:
- a processing unit 3033 configured to use the target image as an input of a multi-example model, and process the target image by using the multi-example model to obtain attribute information of the target image;
- a second linear transformation unit 3034 configured to perform linear transformation on the first annotation vector set based on a third matrix formed by the model parameters in the second guidance network model to obtain a fourth matrix
- a first generating unit 3035 configured to generate a fifth matrix based on the fourth matrix and attribute information of the target image
- the second determining unit 3036 is configured to determine the first guiding information based on a maximum value of each row in the fifth matrix.
- the determination model 304 is used to:
- the first annotation vector set and the first initial input data are decoded by the decoder to obtain a description statement of the target image.
- the determining model 304 includes:
- a third determining unit 3041 configured to: when the decoder adopts a first cyclic neural network RNN model, and the first RNN model is used to perform M first timing steps, each first performed for the first RNN model a timing step of determining input data of the first timing step based on the first boot information;
- the M is the number of times the first RNN model cyclically processes the input data, and the M is a positive integer, and each first timing step is a processing step of the input data by the first RNN model;
- the fourth determining unit 3042 is configured to determine output data of the first timing step based on the input data of the first timing step, the first annotation vector set, and the output data of the previous first timing step of the first timing step. ;
- the output data of the last first timing step of the first timing step is based on the first initial input data Determined to get;
- the fifth determining unit 3043 is configured to determine a description statement of the target image based on all output data of the M first timing steps.
- the third determining unit 3041 is configured to:
- the input data of the first timing step is determined by the following formula:
- t is the first timing step
- x t is the input data of the first timing step
- E is a word embedding matrix and is a model parameter of the first RNN model
- y t is a word corresponding to the first timing step
- the word corresponding to the first timing step is determined based on the output data of the previous first timing step of the first timing step
- Q is the sixth matrix and is the model parameter of the first RNN model
- v is the first boot information.
- the apparatus further includes:
- a first combination module 305 configured to combine the first to-be-trained encoder, the first to-be-trained network model, and the first to-be-trained decoder to obtain a first concatenated network model
- the first training module 306 is configured to train the first cascade network model by using a gradient descent method based on the plurality of sample images and the description statements of the plurality of sample images to obtain the encoder, the first boot network model, and the decoding. Device.
- the determining model 304 includes:
- the sixth determining unit 3044 is configured to determine, according to the first guiding information, the first set of annotation vectors, and the first initial input data, the second annotation vector set and the second initial input data by the reviewer;
- a second generating unit 3045 configured to generate, according to the second set of annotation vectors, second guiding information by using a second guiding network model, where the second guiding network model is obtained by training sample images;
- the encoding unit 3046 is configured to encode the second annotation vector set and the second initial input data by the encoder based on the second guiding information to obtain a description statement of the target image.
- the sixth determining unit 3044 is configured to:
- each second timing step performed for the second RNN model is guided based on the first target Information determining input data of the second timing step;
- the N is the number of times the second RNN model cyclically processes the input data, and the N is a positive integer, and each second timing step is a processing step of the input data by the second RNN model;
- the output data of the last second timing step of the second timing step is based on the first initial input data Determined to get;
- the second set of annotation vectors is determined based on all of the output data of the N second timing steps.
- the apparatus further includes:
- a second combination module 307 configured to combine the second to-be-trained encoder, the second to-be-trained network model, the to-be-trained reviewer, the third to-be-trained network model, and the second to-be-trained decoder to obtain the second level Network model
- the second training module 308 is configured to train the second cascade network model by using a gradient descent method based on the plurality of sample images and the description statements of the plurality of sample images, to obtain the encoder, the first boot network model, The reviewer, the second boot network model, and the decoder.
- a guiding network model is added between the encoder and the decoder.
- the guiding information may be generated by using the guiding network model based on the annotation vector set, because the guiding network model is By training the annotation vector set of the sample image, it is possible to adaptively learn how to accurately generate the guidance information according to the annotation vector set of the image during the training process, so the guidance information generated by the guidance network model has high accuracy and can be The image encoding process is accurately guided, which improves the quality of the generated description statement.
- FIG. 16 is a schematic structural diagram of a terminal 400 according to an embodiment of the present application.
- the terminal 400 may include a communication unit 410, a memory 420 including one or more computer readable storage media, an input unit 430, a display unit 440, a sensor 450, an audio circuit 460, and a WIFI (Wireless Fidelity).
- the module 470 includes a processor 480 having one or more processing cores, and a power supply 490 and the like. It will be understood by those skilled in the art that the terminal structure shown in FIG. 16 does not constitute a limitation to the terminal, and may include more or less components than those illustrated, or a combination of certain components, or different component arrangements. among them:
- the communication unit 410 can be used for transmitting and receiving information and receiving and transmitting signals during a call.
- the communication unit 410 can be an RF (Radio Frequency) circuit, a router, a modem, or the like.
- RF circuits as communication units include, but are not limited to, an antenna, at least one amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, and a LNA (Low Noise Amplifier, low).
- SIM Subscriber Identity Module
- communication unit 410 can also communicate with the network and other devices via wireless communication.
- the wireless communication may use any communication standard or protocol, including but not limited to GSM (Global System of Mobile communication), GPRS (General Packet Radio Service), CDMA (Code Division Multiple Access). , Code Division Multiple Access), WCDMA (Wideband Code Division Multiple Access), LTE (Long Term Evolution), e-mail, SMS (Short Messaging Service), and the like.
- the memory 420 can be used to store software programs and modules, and the processor 480 executes various functional applications and data processing by running software programs and modules stored in the memory 420.
- the memory 420 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may be stored according to The data created by the use of the terminal 400 (such as audio data, phone book, etc.) and the like.
- memory 420 can include high speed random access memory, and can also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, memory 420 may also include a memory controller to provide access to memory 420 by processor 480 and input unit 430.
- the input unit 430 can be configured to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function controls.
- input unit 430 can include touch-sensitive surface 431 as well as other input devices 432.
- a touch-sensitive surface 431, also referred to as a touch display or trackpad, can collect touch operations on or near the user (eg, the user uses a finger, stylus, etc., any suitable object or accessory on the touch-sensitive surface 431 or The operation near the touch-sensitive surface 431) and driving the corresponding connecting device according to a preset program.
- the touch-sensitive surface 431 can include two portions of a touch detection device and a touch controller.
- the touch detection device detects the touch orientation of the user, and detects a signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts the touch information into contact coordinates, and sends the touch information.
- the processor 480 is provided and can receive commands from the processor 480 and execute them.
- the touch sensitive surface 431 can be implemented in various types such as resistive, capacitive, infrared, and surface acoustic waves.
- the input unit 430 can also include other input devices 432.
- other input devices 432 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control buttons, switch buttons, etc.), trackballs, mice, joysticks, and the like.
- Display unit 440 can be used to display information entered by the user or information provided to the user and various graphical user interfaces of terminal 400, which can be constructed from graphics, text, icons, video, and any combination thereof.
- the display unit 440 may include a display panel 441.
- the display panel 441 may be configured in the form of an LCD (Liquid Crystal Display), an OLED (Organic Light-Emitting Diode), or the like.
- the touch-sensitive surface 431 can cover the display panel 441, and when the touch-sensitive surface 431 detects a touch operation thereon or nearby, it is transmitted to the processor 480 to determine the type of the touch event, and then the processor 480 according to the touch event The type provides a corresponding visual output on display panel 441.
- touch-sensitive surface 431 and display panel 441 are implemented as two separate components to implement input and input functions, in some embodiments, touch-sensitive surface 431 can be integrated with display panel 441 for input. And output function.
- Terminal 400 may also include at least one type of sensor 450, such as a light sensor, motion sensor, and other sensors.
- the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 441 according to the brightness of the ambient light, and the proximity sensor may turn off the display panel 441 and/or the backlight when the terminal 400 moves to the ear.
- the gravity acceleration sensor can detect the magnitude of acceleration in all directions (usually three axes). When it is stationary, it can detect the magnitude and direction of gravity.
- the terminal 400 can also be configured with gyroscopes, barometers, hygrometers, thermometers, infrared sensors and other sensors, here Let me repeat.
- the audio circuit 460, the speaker 461, and the microphone 462 can provide an audio interface between the user and the terminal 400.
- the audio circuit 460 can transmit the converted electrical data of the received audio data to the speaker 461 for conversion to the sound signal output by the speaker 461; on the other hand, the microphone 462 converts the collected sound signal into an electrical signal by the audio circuit 460. After receiving, it is converted into audio data, and then processed by the audio data output processor 480, transmitted to the terminal, for example, via the communication unit 410, or the audio data is output to the memory 420 for further processing.
- the audio circuit 460 may also include an earbud jack to provide communication of the peripheral earphones with the terminal 400.
- the terminal may be configured with a wireless communication unit 470, which may be a WIFI module.
- WIFI belongs to the short-range wireless transmission technology, and the terminal 400 can help the user to send and receive emails, browse webpages, and access streaming media through the wireless communication unit 470, which provides wireless broadband Internet access for users.
- the wireless communication unit 470 is shown in the drawings, it can be understood that it does not belong to the essential configuration of the terminal 400, and may be omitted as needed within the scope of not changing the essence of the invention.
- Processor 480 is the control center of terminal 400, which connects various portions of the entire handset using various interfaces and lines, by running or executing software programs and/or modules stored in memory 420, and recalling data stored in memory 420, The various functions and processing data of the terminal 400 are performed to perform overall monitoring of the mobile phone.
- the processor 480 may include one or more processing cores; preferably, the processor 480 may integrate an application processor and a modem processor, where the application processor mainly processes an operating system, a user interface, an application, and the like.
- the modem processor primarily handles wireless communications. It can be understood that the above modem processor may not be integrated into the processor 480.
- the terminal 400 also includes a power source 490 (such as a battery) that supplies power to the various components.
- a power source 490 such as a battery
- the power source can be logically coupled to the processor 480 through a power management system to manage functions such as charging, discharging, and power management through the power management system.
- Power source 460 may also include any one or more of a DC or AC power source, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.
- the terminal 400 may further include a camera, a Bluetooth module, and the like, and details are not described herein.
- the terminal includes a processor and a memory
- the memory further includes at least one instruction, at least one program, a code set or a set of instructions, where the instruction, the program, the code set or the instruction set is
- the processor loads and executes to implement the image recognition method described above with respect to the embodiment of FIG. 7 or FIG.
- a computer readable storage medium stores at least one instruction, at least one program, a code set, or a set of instructions, the instruction, the program, the code
- the set or set of instructions is loaded and executed by the processor to implement the image recognition method described above with respect to the embodiment of FIG. 7 or FIG.
- a person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium.
- the storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Library & Information Science (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Biodiversity & Conservation Biology (AREA)
- Image Analysis (AREA)
Abstract
Description
Claims (20)
- 一种图像识别方法,所述方法由终端执行,其特征在于,所述方法包括:通过编码器对待识别的目标图像进行特征提取,得到特征向量和第一标注向量集合;对所述特征向量进行初始化处理,得到第一初始输入数据;基于所述第一标注向量集合,通过第一引导网络模型生成第一引导信息,所述第一引导网络模型用于根据任一图像的标注向量集合生成引导信息;基于所述第一引导信息、所述第一标注向量集合和所述第一初始输入数据,通过解码器确定所述目标图像的描述语句。
- 如权利要求1所述的方法,其特征在于,所述基于所述第一标注向量集合,通过第一引导网络模型生成第一引导信息,包括:基于所述第一引导网络模型中的模型参数构成的第一矩阵,对所述第一标注向量集合进行线性变换,得到第二矩阵;基于所述第二矩阵中每一行的最大值确定所述第一引导信息。
- 如权利要求1所述的方法,其特征在于,所述第一引导网络模型用于根据任一图像的标注向量集合和属性信息生成引导信息,所述属性信息用于指示所述图像的描述语句中预测出现的词语的概率;所述基于所述第一标注向量集合,通过第一引导网络模型生成第一引导信息,包括:将所述目标图像作为多示例模型的输入,通过所述多示例模型对所述目标图像进行处理,得到所述目标图像的属性信息;基于所述第一引导网络模型中的模型参数构成的第三矩阵,对所述第一标注向量集合进行线性变换,得到第四矩阵;基于所述第四矩阵和所述目标图像的属性信息,生成第五矩阵;基于所述第五矩阵中每一行的最大值确定所述第一引导信息。
- 如权利要求1所述的方法,其特征在于,所述基于所述第一引导信息、所述第一标注向量集合和所述第一初始输入数据,通过解码器确定所述目标图 像的描述语句,包括:基于所述第一引导信息,通过所述解码器对所述第一标注向量集合和所述第一初始输入数据进行解码,得到所述目标图像的描述语句。
- 如权利要求4所述的方法,其特征在于,所述基于所述第一引导信息,通过所述解码器对所述第一标注向量集合和所述第一初始输入数据进行解码,得到所述目标图像的描述语句,包括:当所述解码器采用第一循环神经网络RNN模型,且所述第一RNN模型用于执行M个第一时序步骤时,对于所述第一RNN模型执行的每个第一时序步骤,基于所述第一引导信息确定所述第一时序步骤的输入数据;其中,所述M是指所述第一RNN模型循环处理输入数据的次数,且所述M为正整数,每个第一时序步骤为所述第一RNN模型对输入数据的处理步骤;基于所述第一时序步骤的输入数据、所述第一标注向量集合和所述第一时序步骤的上一个第一时序步骤的输出数据,确定所述第一时序步骤的输出数据;其中,当所述第一时序步骤为所述M个第一时序步骤中的第一个第一时序步骤时,所述第一时序步骤的上一个第一时序步骤的输出数据是基于所述第一初始输入数据确定得到;基于所述M个第一时序步骤的所有输出数据,确定所述目标图像的描述语句。
- 如权利要求5所述的方法,其特征在于,所述基于所述第一引导信息确定所述第一时序步骤的输入数据,包括:基于所述第一引导信息,通过以下公式确定所述第一时序步骤的输入数据:x t=Ey t+Qv其中,t为所述第一时序步骤,x t为所述第一时序步骤的输入数据,E为词语嵌入矩阵且为所述第一RNN模型的模型参数,y t是所述第一时序步骤对应的词语的独热one-hot向量,所述第一时序步骤对应的词语是基于所述第一时序步骤的上一个第一时序步骤的输出数据确定得到,Q为第六矩阵且为所述第一RNN模型的模型参数,v为所述第一引导信息。
- 如权利要求1-6任一所述的方法,其特征在于,所述通过编码器对目标 图像进行特征提取,得到特征向量和第一标注向量集合之前,还包括:将第一待训练编码器、第一待训练引导网络模型和第一待训练解码器进行组合,得到第一级联网络模型;基于多个样本图像和所述多个样本图像的描述语句,采用梯度下降法对所述第一级联网络模型进行训练,得到所述编码器、所述第一引导网络模型和所述解码器。
- 如权利要求1所述的方法,其特征在于,所述基于所述第一引导信息、所述第一标注向量集合和所述第一初始输入数据,通过所述解码器确定所述目标图像的描述语句,包括:基于所述第一引导信息、所述第一标注向量集合和所述第一初始输入数据,通过审阅器确定第二标注向量集合和第二初始输入数据;基于所述第二标注向量集合,通过第二引导网络模型生成第二引导信息,所述第二引导网络模型用于根据标注向量集合生成引导信息;基于所述第二引导信息,通过所述编码器对所述第二标注向量集合和所述第二初始输入数据进行编码,得到所述目标图像的描述语句。
- 如权利要求8所述的方法,其特征在于,所述基于所述第一引导信息、所述第一标注向量集合和所述第一初始输入数据,通过审阅器确定第二标注向量集合和第二初始输入数据,包括:当所述第一审阅器采用第二RNN模型,且所述第二RNN模型用于执行N个第二时序步骤时,对于所述第二RNN模型执行的每个第二时序步骤,基于所述第一引导信息确定所述第二时序步骤的输入数据;其中,所述N是指所述第二RNN模型循环处理输入数据的次数,且所述N为正整数,每个第二时序步骤为所述第二RNN模型对输入数据的处理步骤;基于所述第二时序步骤的输入数据、所述第一标注向量集合和所述第二时序步骤的上一个第二时序步骤的输出数据,确定所述第二时序步骤的输出数据;其中,当所述第二时序步骤为所述N个第二时序步骤中的第一个第二时序步骤时,所述第二时序步骤的上一个第二时序步骤的输出数据是基于所述第一初始输入数据确定得到;基于所述N个第二时序步骤中最后一个第二时序步骤的输出数据,确定所 述第二初始输入数据;基于所述N个第二时序步骤的所有输出数据,确定所述第二标注向量集合。
- 如权利要求8或9所述的方法,其特征在于,所述通过编码器对目标图像进行特征提取,得到特征向量和第一标注向量集合之前,还包括:将第二待训练编码器、第二待训练引导网络模型、待训练审阅器、第三待训练引导网络模型和第二待训练解码器进行组合,得到第二级联网络模型;基于多个样本图像和所述多个样本图像的描述语句,采用梯度下降法对所述第二级联网络模型进行训练,得到所述编码器、所述第一引导网络模型、所述审阅器、所述第二引导网络模型和所述解码器。
- 一种终端,其特征在于,所述终端包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述指令、所述程序、所述代码集或所述指令集由所述处理器加载并执行以实现如下操作:通过编码器对待识别的目标图像进行特征提取,得到特征向量和第一标注向量集合;对所述特征向量进行初始化处理,得到第一初始输入数据;基于所述第一标注向量集合,通过第一引导网络模型生成第一引导信息,所述第一引导网络模型用于根据任一图像的标注向量集合生成引导信息;基于所述第一引导信息、所述第一标注向量集合和所述第一初始输入数据,通过解码器确定所述目标图像的描述语句。
- 如权利要求11所述的终端,其特征在于,所述指令、所述程序、所述代码集或所述指令集由所述处理器加载并执行以实现如下操作:基于所述第一引导网络模型中的模型参数构成的第一矩阵,对所述第一标注向量集合进行线性变换,得到第二矩阵;基于所述第二矩阵中每一行的最大值确定所述第一引导信息。
- 如权利要求11所述的终端,其特征在于,所述第一引导网络模型用于根据任一图像的标注向量集合和属性信息生成引导信息,所述属性信息用于指示所述图像的描述语句中预测出现的词语的概率;所述指令、所述程序、所述代码集或所述指令集由所述处理器加载并执行以实现如下操作:将所述目标图像作为多示例模型的输入,通过所述多示例模型对所述目标图像进行处理,得到所述目标图像的属性信息;基于所述第一引导网络模型中的模型参数构成的第三矩阵,对所述第一标注向量集合进行线性变换,得到第四矩阵;基于所述第四矩阵和所述目标图像的属性信息,生成第五矩阵;基于所述第五矩阵中每一行的最大值确定所述第一引导信息。
- 如权利要求11所述的终端,其特征在于,所述指令、所述程序、所述代码集或所述指令集由所述处理器加载并执行以实现如下操作:基于所述第一引导信息,通过所述解码器对所述第一标注向量集合和所述第一初始输入数据进行解码,得到所述目标图像的描述语句。
- 如权利要求14所述的终端,其特征在于,所述指令、所述程序、所述代码集或所述指令集由所述处理器加载并执行以实现如下操作:当所述第一审阅器采用第二RNN模型,且所述第二RNN模型用于执行N个第二时序步骤时,对于所述第二RNN模型执行的每个第二时序步骤,基于所述第一引导信息确定所述第二时序步骤的输入数据;其中,所述N是指所述第二RNN模型循环处理输入数据的次数,且所述N为正整数,每个第二时序步骤为所述第二RNN模型对输入数据的处理步骤;基于所述第二时序步骤的输入数据、所述第一标注向量集合和所述第二时序步骤的上一个第二时序步骤的输出数据,确定所述第二时序步骤的输出数据;其中,当所述第二时序步骤为所述N个第二时序步骤中的第一个第二时序步骤时,所述第二时序步骤的上一个第二时序步骤的输出数据是基于所述第一初始输入数据确定得到;基于所述N个第二时序步骤中最后一个第二时序步骤的输出数据,确定所述第二初始输入数据;基于所述N个第二时序步骤的所有输出数据,确定所述第二标注向量集合。
- 如权利要求11-15任一所述的终端,其特征在于,所述指令、所述程序、 所述代码集或所述指令集由所述处理器加载并执行以实现如下操作:将第一待训练编码器、第一待训练引导网络模型和第一待训练解码器进行组合,得到第一级联网络模型;基于多个样本图像和所述多个样本图像的描述语句,采用梯度下降法对所述第一级联网络模型进行训练,得到所述编码器、所述第一引导网络模型和所述解码器。
- 如权利要求11所述的终端,其特征在于,所述指令、所述程序、所述代码集或所述指令集由所述处理器加载并执行以实现如下操作:基于所述第一引导信息、所述第一标注向量集合和所述第一初始输入数据,通过审阅器确定第二标注向量集合和第二初始输入数据;基于所述第二标注向量集合,通过第二引导网络模型生成第二引导信息,所述第二引导网络模型用于根据标注向量集合生成引导信息;基于所述第二引导信息,通过所述编码器对所述第二标注向量集合和所述第二初始输入数据进行编码,得到所述目标图像的描述语句。
- 如权利要求17所述的终端,其特征在于,所述指令、所述程序、所述代码集或所述指令集由所述处理器加载并执行以实现如下操作:当所述第一审阅器采用第二RNN模型,且所述第二RNN模型用于执行N个第二时序步骤时,对于所述第二RNN模型执行的每个第二时序步骤,基于所述第一引导信息确定所述第二时序步骤的输入数据;其中,所述N是指所述第二RNN模型循环处理输入数据的次数,且所述N为正整数,每个第二时序步骤为所述第二RNN模型对输入数据的处理步骤;基于所述第二时序步骤的输入数据、所述第一标注向量集合和所述第二时序步骤的上一个第二时序步骤的输出数据,确定所述第二时序步骤的输出数据;其中,当所述第二时序步骤为所述N个第二时序步骤中的第一个第二时序步骤时,所述第二时序步骤的上一个第二时序步骤的输出数据是基于所述第一初始输入数据确定得到;基于所述N个第二时序步骤中最后一个第二时序步骤的输出数据,确定所述第二初始输入数据;基于所述N个第二时序步骤的所有输出数据,确定所述第二标注向量集合。
- 如权利要求17或18所述的终端,其特征在于,所述指令、所述程序、所述代码集或所述指令集由所述处理器加载并执行以实现如下操作:将第二待训练编码器、第二待训练引导网络模型、待训练审阅器、第三待训练引导网络模型和第二待训练解码器进行组合,得到第二级联网络模型;基于多个样本图像和所述多个样本图像的描述语句,采用梯度下降法对所述第二级联网络模型进行训练,得到所述编码器、所述第一引导网络模型、所述审阅器、所述第二引导网络模型和所述解码器。
- 一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述指令、所述程序、所述代码集或所述指令集由处理器加载并执行以实现如权利要求1-10任一项所述的图像识别方法。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2020514506A JP6972319B2 (ja) | 2017-09-11 | 2018-09-11 | 画像認識方法、端末及び記憶媒体 |
KR1020197036824A KR102270394B1 (ko) | 2017-09-11 | 2018-09-11 | 이미지를 인식하기 위한 방법, 단말, 및 저장 매체 |
EP18853742.7A EP3611663A4 (en) | 2017-09-11 | 2018-09-11 | IMAGE RECOGNITION PROCESS, TERMINAL AND STORAGE MEDIA |
US16/552,738 US10956771B2 (en) | 2017-09-11 | 2019-08-27 | Image recognition method, terminal, and storage medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710814187.2 | 2017-09-11 | ||
CN201710814187.2A CN108304846B (zh) | 2017-09-11 | 2017-09-11 | 图像识别方法、装置及存储介质 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/552,738 Continuation US10956771B2 (en) | 2017-09-11 | 2019-08-27 | Image recognition method, terminal, and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019047971A1 true WO2019047971A1 (zh) | 2019-03-14 |
Family
ID=62869573
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/105009 WO2019047971A1 (zh) | 2017-09-11 | 2018-09-11 | 图像识别方法、终端及存储介质 |
Country Status (6)
Country | Link |
---|---|
US (1) | US10956771B2 (zh) |
EP (1) | EP3611663A4 (zh) |
JP (1) | JP6972319B2 (zh) |
KR (1) | KR102270394B1 (zh) |
CN (2) | CN110490213B (zh) |
WO (1) | WO2019047971A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102134893B1 (ko) * | 2019-11-07 | 2020-07-16 | 국방과학연구소 | 사전 압축된 텍스트 데이터의 압축 방식을 식별하는 시스템 및 방법 |
CN112785494A (zh) * | 2021-01-26 | 2021-05-11 | 网易(杭州)网络有限公司 | 一种三维模型构建方法、装置、电子设备和存储介质 |
Families Citing this family (33)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110490213B (zh) * | 2017-09-11 | 2021-10-29 | 腾讯科技(深圳)有限公司 | 图像识别方法、装置及存储介质 |
CN109146156B (zh) * | 2018-08-03 | 2021-12-03 | 大连理工大学 | 一种用于预测充电桩系统充电量的方法 |
JP7415922B2 (ja) * | 2018-10-19 | 2024-01-17 | ソニーグループ株式会社 | 情報処理方法、情報処理装置及び情報処理プログラム |
CN109559576B (zh) * | 2018-11-16 | 2020-07-28 | 中南大学 | 一种儿童伴学机器人及其早教系统自学习方法 |
CN109495214B (zh) * | 2018-11-26 | 2020-03-24 | 电子科技大学 | 基于一维Inception结构的信道编码类型识别方法 |
CN109902852A (zh) * | 2018-11-28 | 2019-06-18 | 北京三快在线科技有限公司 | 商品组合方法、装置、电子设备及可读存储介质 |
US10726062B2 (en) * | 2018-11-30 | 2020-07-28 | Sony Interactive Entertainment Inc. | System and method for converting image data into a natural language description |
CN109670548B (zh) * | 2018-12-20 | 2023-01-06 | 电子科技大学 | 基于改进lstm-cnn的多尺寸输入har算法 |
CN109711546B (zh) * | 2018-12-21 | 2021-04-06 | 深圳市商汤科技有限公司 | 神经网络训练方法及装置、电子设备和存储介质 |
CN111476838A (zh) * | 2019-01-23 | 2020-07-31 | 华为技术有限公司 | 图像分析方法以及系统 |
CN110009018B (zh) * | 2019-03-25 | 2023-04-18 | 腾讯科技(深圳)有限公司 | 一种图像生成方法、装置以及相关设备 |
CN110222840B (zh) * | 2019-05-17 | 2023-05-05 | 中山大学 | 一种基于注意力机制的集群资源预测方法和装置 |
CN110427870B (zh) * | 2019-06-10 | 2024-06-18 | 腾讯医疗健康(深圳)有限公司 | 眼部图片识别方法、目标识别模型训练方法及装置 |
CN110478204A (zh) * | 2019-07-25 | 2019-11-22 | 李高轩 | 一种结合图像识别的导盲眼镜及其构成的导盲系统 |
CN110517759B (zh) * | 2019-08-29 | 2022-03-25 | 腾讯医疗健康(深圳)有限公司 | 一种待标注图像确定的方法、模型训练的方法及装置 |
CN111275110B (zh) * | 2020-01-20 | 2023-06-09 | 北京百度网讯科技有限公司 | 图像描述的方法、装置、电子设备及存储介质 |
CN111310647A (zh) * | 2020-02-12 | 2020-06-19 | 北京云住养科技有限公司 | 自动识别跌倒模型的生成方法和装置 |
US11093794B1 (en) * | 2020-02-13 | 2021-08-17 | United States Of America As Represented By The Secretary Of The Navy | Noise-driven coupled dynamic pattern recognition device for low power applications |
CN111753825A (zh) | 2020-03-27 | 2020-10-09 | 北京京东尚科信息技术有限公司 | 图像描述生成方法、装置、系统、介质及电子设备 |
EP3916633A1 (de) * | 2020-05-25 | 2021-12-01 | Sick Ag | Kamera und verfahren zum verarbeiten von bilddaten |
CN111723729B (zh) * | 2020-06-18 | 2022-08-05 | 四川千图禾科技有限公司 | 基于知识图谱的监控视频犬类姿态和行为智能识别方法 |
US11455146B2 (en) * | 2020-06-22 | 2022-09-27 | Bank Of America Corporation | Generating a pseudo-code from a text summarization based on a convolutional neural network |
CN111767727B (zh) * | 2020-06-24 | 2024-02-06 | 北京奇艺世纪科技有限公司 | 数据处理方法及装置 |
WO2022006621A1 (en) * | 2020-07-06 | 2022-01-13 | Harrison-Ai Pty Ltd | Method and system for automated generation of text captions from medical images |
CN112016400B (zh) * | 2020-08-04 | 2021-06-29 | 香港理工大学深圳研究院 | 一种基于深度学习的单类目标检测方法、设备及存储介质 |
CN112614175B (zh) * | 2020-12-21 | 2024-09-06 | 滕州市东大矿业有限责任公司 | 基于特征去相关的用于封孔剂注射器的注射参数确定方法 |
CN112800247B (zh) * | 2021-04-09 | 2021-06-18 | 华中科技大学 | 基于知识图谱共享的语义编/解码方法、设备和通信系统 |
CN113205051B (zh) * | 2021-05-10 | 2022-01-25 | 中国科学院空天信息创新研究院 | 基于高空间分辨率遥感影像的储油罐提取方法 |
CN113569868B (zh) * | 2021-06-11 | 2023-09-19 | 北京旷视科技有限公司 | 一种目标检测方法、装置及电子设备 |
CN113486868B (zh) * | 2021-09-07 | 2022-02-11 | 中南大学 | 一种电机故障诊断方法及系统 |
CN113743517A (zh) * | 2021-09-08 | 2021-12-03 | Oppo广东移动通信有限公司 | 模型训练方法、图像深度预测方法及装置、设备、介质 |
CN114821560B (zh) * | 2022-04-11 | 2024-08-02 | 深圳市星桐科技有限公司 | 文本识别方法和装置 |
CN116167990B (zh) * | 2023-01-28 | 2024-06-25 | 阿里巴巴(中国)有限公司 | 基于图像的目标识别、神经网络模型处理方法 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8165354B1 (en) * | 2008-03-18 | 2012-04-24 | Google Inc. | Face recognition with discriminative face alignment |
CN106446782A (zh) * | 2016-08-29 | 2017-02-22 | 北京小米移动软件有限公司 | 图像识别方法及装置 |
CN106845411A (zh) * | 2017-01-19 | 2017-06-13 | 清华大学 | 一种基于深度学习和概率图模型的视频描述生成方法 |
CN107038221A (zh) * | 2017-03-22 | 2017-08-11 | 杭州电子科技大学 | 一种基于语义信息引导的视频内容描述方法 |
CN108304846A (zh) * | 2017-09-11 | 2018-07-20 | 腾讯科技(深圳)有限公司 | 图像识别方法、装置及存储介质 |
Family Cites Families (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9743078B2 (en) * | 2004-07-30 | 2017-08-22 | Euclid Discoveries, Llc | Standards-compliant model-based video encoding and decoding |
RU2461977C2 (ru) * | 2006-12-18 | 2012-09-20 | Конинклейке Филипс Электроникс Н.В. | Сжатие и снятие сжатия изображения |
US8254444B2 (en) * | 2007-05-14 | 2012-08-28 | Samsung Electronics Co., Ltd. | System and method for phase adaptive occlusion detection based on motion vector field in digital video |
JPWO2009110160A1 (ja) * | 2008-03-07 | 2011-07-14 | 株式会社東芝 | 動画像符号化/復号化方法及び装置 |
CN102577393B (zh) * | 2009-10-20 | 2015-03-25 | 夏普株式会社 | 运动图像编码装置、运动图像解码装置、运动图像编码/解码系统、运动图像编码方法及运动图像解码方法 |
US9369718B2 (en) * | 2009-10-30 | 2016-06-14 | Sun Patent Trust | Decoding method, decoding apparatus, coding method, and coding apparatus using a quantization matrix |
US9582431B2 (en) * | 2010-03-22 | 2017-02-28 | Seagate Technology Llc | Storage address space to NVM address, span, and length mapping/converting |
KR101420957B1 (ko) * | 2010-03-31 | 2014-07-30 | 미쓰비시덴키 가부시키가이샤 | 화상 부호화 장치, 화상 복호 장치, 화상 부호화 방법 및 화상 복호 방법 |
JP2012253482A (ja) * | 2011-06-01 | 2012-12-20 | Sony Corp | 画像処理装置および方法、記録媒体、並びにプログラム |
US8918320B2 (en) * | 2012-01-03 | 2014-12-23 | Nokia Corporation | Methods, apparatuses and computer program products for joint use of speech and text-based features for sentiment detection |
EP2842106B1 (en) * | 2012-04-23 | 2019-11-13 | Telecom Italia S.p.A. | Method and system for image analysis |
US9183460B2 (en) * | 2012-11-30 | 2015-11-10 | Google Inc. | Detecting modified images |
CN102982799A (zh) * | 2012-12-20 | 2013-03-20 | 中国科学院自动化研究所 | 一种融合引导概率的语音识别优化解码方法 |
US9349072B2 (en) * | 2013-03-11 | 2016-05-24 | Microsoft Technology Licensing, Llc | Local feature based image compression |
CN104918046B (zh) * | 2014-03-13 | 2019-11-05 | 中兴通讯股份有限公司 | 一种局部描述子压缩方法和装置 |
US10909329B2 (en) | 2015-05-21 | 2021-02-02 | Baidu Usa Llc | Multilingual image question answering |
CN105139385B (zh) * | 2015-08-12 | 2018-04-17 | 西安电子科技大学 | 基于深层自动编码器重构的图像视觉显著性区域检测方法 |
ITUB20153724A1 (it) * | 2015-09-18 | 2017-03-18 | Sisvel Tech S R L | Metodi e apparati per codificare e decodificare immagini o flussi video digitali |
US10423874B2 (en) * | 2015-10-02 | 2019-09-24 | Baidu Usa Llc | Intelligent image captioning |
US10402697B2 (en) * | 2016-08-01 | 2019-09-03 | Nvidia Corporation | Fusing multilayer and multimodal deep neural networks for video classification |
CN106548145A (zh) * | 2016-10-31 | 2017-03-29 | 北京小米移动软件有限公司 | 图像识别方法及装置 |
IT201600122898A1 (it) * | 2016-12-02 | 2018-06-02 | Ecole Polytechnique Fed Lausanne Epfl | Metodi e apparati per codificare e decodificare immagini o flussi video digitali |
US10783393B2 (en) * | 2017-06-20 | 2020-09-22 | Nvidia Corporation | Semi-supervised learning for landmark localization |
US11966839B2 (en) * | 2017-10-25 | 2024-04-23 | Deepmind Technologies Limited | Auto-regressive neural network systems with a soft attention mechanism using support data patches |
KR102174777B1 (ko) * | 2018-01-23 | 2020-11-06 | 주식회사 날비컴퍼니 | 이미지의 품질 향상을 위하여 이미지를 처리하는 방법 및 장치 |
CN110072142B (zh) * | 2018-01-24 | 2020-06-02 | 腾讯科技(深圳)有限公司 | 视频描述生成方法、装置、视频播放方法、装置和存储介质 |
US10671855B2 (en) * | 2018-04-10 | 2020-06-02 | Adobe Inc. | Video object segmentation by reference-guided mask propagation |
US10824909B2 (en) * | 2018-05-15 | 2020-11-03 | Toyota Research Institute, Inc. | Systems and methods for conditional image translation |
CN110163048B (zh) * | 2018-07-10 | 2023-06-02 | 腾讯科技(深圳)有限公司 | 手部关键点的识别模型训练方法、识别方法及设备 |
US20200104940A1 (en) * | 2018-10-01 | 2020-04-02 | Ramanathan Krishnan | Artificial intelligence enabled assessment of damage to automobiles |
-
2017
- 2017-09-11 CN CN201910848729.7A patent/CN110490213B/zh active Active
- 2017-09-11 CN CN201710814187.2A patent/CN108304846B/zh active Active
-
2018
- 2018-09-11 JP JP2020514506A patent/JP6972319B2/ja active Active
- 2018-09-11 WO PCT/CN2018/105009 patent/WO2019047971A1/zh unknown
- 2018-09-11 EP EP18853742.7A patent/EP3611663A4/en active Pending
- 2018-09-11 KR KR1020197036824A patent/KR102270394B1/ko active IP Right Grant
-
2019
- 2019-08-27 US US16/552,738 patent/US10956771B2/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8165354B1 (en) * | 2008-03-18 | 2012-04-24 | Google Inc. | Face recognition with discriminative face alignment |
CN106446782A (zh) * | 2016-08-29 | 2017-02-22 | 北京小米移动软件有限公司 | 图像识别方法及装置 |
CN106845411A (zh) * | 2017-01-19 | 2017-06-13 | 清华大学 | 一种基于深度学习和概率图模型的视频描述生成方法 |
CN107038221A (zh) * | 2017-03-22 | 2017-08-11 | 杭州电子科技大学 | 一种基于语义信息引导的视频内容描述方法 |
CN108304846A (zh) * | 2017-09-11 | 2018-07-20 | 腾讯科技(深圳)有限公司 | 图像识别方法、装置及存储介质 |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102134893B1 (ko) * | 2019-11-07 | 2020-07-16 | 국방과학연구소 | 사전 압축된 텍스트 데이터의 압축 방식을 식별하는 시스템 및 방법 |
CN112785494A (zh) * | 2021-01-26 | 2021-05-11 | 网易(杭州)网络有限公司 | 一种三维模型构建方法、装置、电子设备和存储介质 |
CN112785494B (zh) * | 2021-01-26 | 2023-06-16 | 网易(杭州)网络有限公司 | 一种三维模型构建方法、装置、电子设备和存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN110490213B (zh) | 2021-10-29 |
CN108304846A (zh) | 2018-07-20 |
JP2020533696A (ja) | 2020-11-19 |
CN110490213A (zh) | 2019-11-22 |
US20190385004A1 (en) | 2019-12-19 |
KR102270394B1 (ko) | 2021-06-30 |
US10956771B2 (en) | 2021-03-23 |
EP3611663A1 (en) | 2020-02-19 |
EP3611663A4 (en) | 2020-12-23 |
KR20200007022A (ko) | 2020-01-21 |
JP6972319B2 (ja) | 2021-11-24 |
CN108304846B (zh) | 2021-10-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019047971A1 (zh) | 图像识别方法、终端及存储介质 | |
CN110599557B (zh) | 图像描述生成方法、模型训练方法、设备和存储介质 | |
KR102646667B1 (ko) | 이미지 영역을 찾기 위한 방법, 모델 훈련 방법 및 관련 장치 | |
US11416681B2 (en) | Method and apparatus for determining a reply statement to a statement based on a sum of a probability of the reply statement being output in response to the statement and a second probability in which the statement is output in response to the statement and further based on a terminator | |
CN110472251B (zh) | 翻译模型训练的方法、语句翻译的方法、设备及存储介质 | |
CN110334360B (zh) | 机器翻译方法及装置、电子设备及存储介质 | |
WO2020103721A1 (zh) | 信息处理的方法、装置及存储介质 | |
KR20190130636A (ko) | 기계번역 방법, 장치, 컴퓨터 기기 및 기억매체 | |
CN110570840B (zh) | 一种基于人工智能的智能设备唤醒方法和装置 | |
CN110890093A (zh) | 一种基于人工智能的智能设备唤醒方法和装置 | |
WO2020147369A1 (zh) | 自然语言处理方法、训练方法及数据处理设备 | |
CN112820299B (zh) | 声纹识别模型训练方法、装置及相关设备 | |
CN111539212A (zh) | 文本信息处理方法、装置、存储介质及电子设备 | |
CN113821589B (zh) | 一种文本标签的确定方法及装置、计算机设备和存储介质 | |
CN111597804B (zh) | 一种实体识别模型训练的方法以及相关装置 | |
CN112214605A (zh) | 一种文本分类方法和相关装置 | |
CN109543014B (zh) | 人机对话方法、装置、终端及服务器 | |
CN113761122A (zh) | 一种事件抽取方法、相关装置、设备及存储介质 | |
CN111723783B (zh) | 一种内容识别方法和相关装置 | |
CN113569043A (zh) | 一种文本类别确定方法和相关装置 | |
CN113505596A (zh) | 话题切换标记方法、装置及计算机设备 | |
CN113806532B (zh) | 比喻句式判断模型的训练方法、装置、介质及设备 | |
CN117057345B (zh) | 一种角色关系的获取方法及相关产品 | |
CN116959407A (zh) | 一种读音预测方法、装置及相关产品 | |
CN113590832A (zh) | 一种基于位置信息的文本识别方法以及相关装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18853742 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2018853742 Country of ref document: EP Effective date: 20191112 |
|
ENP | Entry into the national phase |
Ref document number: 20197036824 Country of ref document: KR Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 2020514506 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |