CN111079849A

CN111079849A - Method for constructing new target network model for voice-assisted audio-visual collaborative learning

Info

Publication number: CN111079849A
Application number: CN201911334785.5A
Authority: CN
Inventors: 苟先太; 康立烨; 钱照国; 张葛祥
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2019-12-23
Filing date: 2019-12-23
Publication date: 2020-04-28

Abstract

The invention discloses a method for constructing a new audio-visual collaborative learning target network model assisted by voice, which comprises the steps of S1-S11, wherein the method is based on the traditional object recognition model and image characteristic matching technology, the known object is accurately recognized through an initial object recognition model, if a new object appears, the new object is subjected to characteristic memory through an online learning model, and the initial object recognition model is updated in real time, so that the generalization capability of the model is stronger, and the method is more suitable for the application of real scenes.

Description

Method for constructing new target network model for voice-assisted audio-visual collaborative learning

Technical Field

The invention relates to the technical field of computer vision, in particular to a method for constructing a new audio-visual collaborative learning target network model assisted by voice.

Background

With the rapid development of computer vision, object recognition technology has been applied to various fields and brings great economic benefits. In recent years, a number of object recognition network models have appeared, and the recognition accuracy thereof has been improved, but there is a common drawback that an image data set must be prepared in advance, trained on the existing data set, and an object detector must be generated. In practical applications, there are many kinds of objects, and many image data are not collected or are difficult to obtain. In some scenarios, it is not known in advance which categories of image data should be prepared, which makes it difficult for the conventional network model to be applied to the actual scenario. The image feature matching technology can match two images, has strong application value when insufficient training data exists, and can be well applied to some specific scenes although the generalization capability is weak.

A good object recognition model is similar to a human, has the capabilities of autonomous learning and guided learning, can accurately recognize a learned object, can remember and learn a new object through the guidance of the human, and continuously updates the knowledge reserve of the model, so that the model becomes more intelligent. Aiming at the prior art, the invention provides a network model for audio-visual collaborative learning of a new target assisted by voice, which has the function of on-line learning of the new target and has important application value in some specific scenes (such as a home robot, an inspection robot and the like), and the development of the field can be promoted.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a method for constructing a new audio-visual collaborative learning target network model, which solves the problem that the existing network model does not have the defect of online learning of a new target.

In order to achieve the purpose of the invention, the technical scheme adopted by the invention is as follows:

the method for constructing the new audio-visual collaborative learning target network model assisted by voice comprises the following steps:

s1: building an original object classifier M1 for original object recognition and an object feature extraction model M2 for extracting feature vectors of objects;

s2: creating an object feature vector repository B1 for holding feature vectors of new objects and a new object image repository B2 for holding image datasets of new objects;

s3: inputting a new image picture, and loading an original object classifier M1 to perform object recognition on the new image picture;

s4: if the new image picture does not have an unidentified object, stopping operation; if the unidentified objects (object-1, … …, object-M) exist, the object feature extraction model M2 is loaded to perform feature extraction on the unidentified objects (object-1, … …, object-M), and each feature vector in the extracted feature vector set R is respectively subjected to feature matching with each feature vector in the feature vector library B1;

s5: if an object with the highest confidence coefficient most-value higher than the base confidence coefficient base-value exists during matching, judging that the object is correctly identified, otherwise, judging that the object is a new object;

s6: performing man-machine interaction through voice assistance, performing voice description on the dominant characteristic of the new object, and printing a voice tag on the new object to obtain a new image;

s7: image augmentation is carried out on the new image to obtain augmented images (image-1, image-2, … … and image-n), and the images are stored in a new object image library B2;

s8: loading an object feature extraction model M2, extracting features of new object objects in a new image, and storing the obtained feature vector feature in a feature vector library B1;

s9: traversing the new object image library B2, and judging whether the data set quantity of the new object reaches the data set quantity N required by training;

s10: if yes, merging the data set N of the new object with the data set of the original object classifier M1, training a new object classifier to replace the original object classifier M1 by using the merged data set, and deleting the image data set of the new object features in the new object image library B2;

s11: otherwise, repeating steps S3-S9 until the data set amount of the new object reaches the data set amount N required by the training.

Further, the method for constructing the original object classifier M1 for original object recognition includes:

a11: generating a training image set images-input1 by using the image data set according to the actual application scene;

a12: creating a residual convolutional neural network ResNet to extract image feature features-maps of images in the training image set images-input1, wherein the residual convolutional neural network ResNet consists of convolutional layer conv1, relu1 layers and pooling layer pooling 1;

a13: creating an RPN network to generate image candidate region regions, inputting image feature features-maps, judging whether the image feature features-maps belong to a foreground or a background through Softmax, and correcting the candidate region regions to generate accurate candidate region regions 1;

a14: a feature region pro-features-maps of a fixed size is generated using the candidate regions pro-contaminants 1 and the image feature features-maps.

A15: and fully connecting the feature areas of fixed size, classifying the objects by using Softmax, calculating Loss, correcting the Loss, and realizing accurate classification of the original objects.

Further, the method for building the object feature extraction model M2 for extracting the feature vector of the object includes:

b11: preparing image Data1 with several types as training Data set images-input 2;

b12: loading a training data set images-input2, pre-training an autonomous RPN network model RPN-model, and outputting an object candidate region proposals 2;

b13: pre-training a feature extraction network model con-model, loading a training data set images-input2, wherein the feature extraction network model con-model consists of a convolutional layer conv2, a relu2 layer, a pooling layer pooling2 and a full connection layer FC.

B14: and correcting the object candidate regions propofol 2, and then inputting the corrected object candidate regions to a feature extraction network model con-model for feature extraction to obtain image feature features-maps of each candidate region.

Further, the convolutional layer conv2 of the feature extraction network model con-model is 16 layers, the relu2 layer is 15 layers, the pooling layer pooling2 is 5 layers, the convolutional layer conv2 uses a multi-channel convolution operation, the size of a convolution kernel is 3x3, the filling size is 1, the number of convolution steps is 1, the pooling layer pooling2 uses a filter size of 2x2, the step size is 2, the type is maximum pooling, the fully connected layer FC is three layers, and a dropout mechanism is added to each layer.

Further, convolutional layer conv1 of residual convolutional neural network ResNet is 49 layers, relu1 layer is 49 layers, pooling layer posing 1 is 2 layers, convolutional layer conv1 uses a multi-channel convolution operation, convolutional layer conv1 includes 1 convolution kernel of 7x7, 32 convolution kernels of 1x1 and 16 convolution kernels of 3x3, pooling layer posing 1 uses a maximum filter of 3x3 and an average filter of 2x 2.

Further, extracting the feature vector set R through a deep convolutional layer in a feature extraction network model con-model:

the characteristic matrix after 8 th convolution layer conv3-4 is

The characteristic matrix after conv4-4 of the 12 th convolution layer is

The characteristic matrix after 16 th convolution layer conv5-4 is

Where i is n/2, j is m/2, p is i/2, q is j/2,

then the feature matrix

A function MatToVec (T) splices rows of a matrix to form a one-dimensional vector, and a parameter T ═ A/B/C is a matrix; the function pad (n) is zero padding operation, and the parameter n represents the number of zero padding; the eigenvector R1 is MatToVec (S1), the eigenvector set R is (R1, R2.., Rs), n and m in the matrix A, B are the length and width of the matrix A, B, respectively, and p and q represent the length and width of the matrix C, respectively.

Further, the highest confidence most-value is the matching degree of any one feature vector in the feature vector set R and any one feature vector in the feature vector library B1:

highest confidence

Wherein α + β + γ is 1, M belongs to the feature vector library B1 feature vector set Q (Q)₁，Q₂…Q_t) N belongs to the feature vector set R ═ (R)₁，R₂…R_s) The size of all feature vectors in Q and R is L.

The invention has the beneficial effects that: the method is based on the traditional object recognition model and the image characteristic matching technology, the known object is accurately recognized through the initial object recognition model, if a new object appears, the new object is subjected to characteristic memory through the online learning model, and the initial object recognition model is updated in real time, so that the generalization capability of the model is stronger, and the method is more suitable for the application of a real scene.

In some scenes, most objects are relatively fixed, recognition can be realized only by characteristic memory, and the initial object recognition model can be updated in continuous memory learning, so that more kinds of objects can be recognized. The network model is applied to the scene needing object identification, and the model is more intelligent. Compared with the traditional object recognition model, the method has higher application value, and the network model can promote the development of the object recognition field and has important research significance.

Drawings

FIG. 1 is a flow chart of a method for constructing a new target network model for voice-assisted audio-visual collaborative learning.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

As shown in fig. 1, the method for constructing a new target network model for audio-visual collaborative learning assisted by voice comprises the following steps:

s1: building an original object classifier M1 for identifying an original object and an object feature extraction model M2 for extracting feature vectors of the object;

the method for constructing the original object classifier M1 for original object recognition comprises the following steps:

a12: a residual convolutional neural network ResNet consisting of convolutional layer conv1, relu1 layers, and pooling layer pooling1 was created to extract the image feature features-maps of the images in the training image set images-input 1.

Convolutional layer conv1 of residual convolutional neural network ResNet is 49 layers, relu1 layers is 49 layers, pooling layer posing 1 is 2 layers, convolutional layer conv1 uses a multi-channel convolution operation, convolutional layer conv1 includes 1 convolution kernel of 7x7, 32 convolution kernels of 1x1 and 16 convolution kernels of 3x3, pooling layer posing 1 uses a maximum filter of 3x3 and a mean filter of 2x 2.

Step a12 includes:

a121: using image Data1 with several types as training Data set images-input 2;

a122: and loading a training data set images-input2, pre-training an autonomous RPN network model RPN-model, and outputting an object candidate region proposals 2.

A123: the pre-training feature extraction network model con-model serves as a training data set images-input2 and is composed of a convolutional layer conv2, a relu2 layer, a pooling layer pooling2 and a full connection layer FC.

The convolutional layer conv2 of the feature extraction network model con-model is 16 layers, the relu2 layer is 15 layers, the pooling layer pooling2 is 5 layers, the convolutional layer conv2 uses multi-channel convolution operation, the size of a convolution kernel is 3x3, the filling size is 1, the convolution step number is 1, the pooling layer pooling2 uses a filter with the size of 2x2, the step size is 2, the type is maximum pooling, the full connection layer FC is three layers, and a dropout mechanism is added in each layer.

A124: and correcting the object candidate regions propofol 2, and then inputting the corrected object candidate regions to a feature extraction network model con-model for feature extraction to obtain image feature features-maps of each candidate region.

The method for building the object feature extraction model M2 for extracting the feature vector of the object comprises the following steps:

b11: generating a training Data set images-input2 by using image Data1 with several types;

b12: and loading a training data set images-input2, pre-training an autonomous RPN network model RPN-model, and outputting an object candidate region proposals 2.

extracting a deep convolutional layer in a feature extraction network model con-model by using a feature vector set R:

the characteristic matrix after 8 th convolution layer conv3-4 is

The characteristic matrix after conv4-4 of the 12 th convolution layer is

The characteristic matrix after 16 th convolution layer conv5-4 is

Where i is n/2, j is m/2, p is i/2, q is j/2,

then the feature matrix

the highest confidence most-value is the matching degree of any one feature vector in the feature vector set R and any one feature vector in the feature vector library B1:

highest confidence

the feature vector feature is obtained without passing through the autonomous RPN network model RPN-model in the feature extraction model M2, because the new object is already tagged with features by speech assistance and no object region needs to be extracted again. And directly inputting the image with the characteristic label into the characteristic extraction network model con-model in the characteristic extraction model M2 to extract the characteristic vector feature.

s10: if so, merging the data set N of the new object with the data set of the original object classifier M1, training a new object classifier to replace the original object classifier M1 by using the merged data set, and deleting the image data set of the new object features in the new object image library B2;

s11: if not, repeating the steps S3-S9 until the data set quantity of the new object reaches the data set quantity N required by the training.

The method is based on the traditional object recognition model and the image characteristic matching technology, the known object is accurately recognized through the initial object recognition model, if a new object appears, the new object is subjected to characteristic memory through the online learning model, and the initial object recognition model is updated in real time, so that the generalization capability of the model is stronger, and the method is more suitable for the application of a real scene.

Claims

1. A method for constructing a new audio-visual collaborative learning target network model assisted by voice is characterized by comprising the following steps:

2. The method for constructing a new target network model for speech-assisted audio-visual collaborative learning according to claim 1, wherein the method for constructing an original object classifier M1 for original object recognition comprises the following steps:

3. The method for constructing the new target network model for the voice-assisted audio-visual collaborative learning according to claim 2, wherein the method for constructing the object feature extraction model M2 for extracting the feature vectors of the object comprises the following steps:

b13: pre-training a feature extraction network model con-model, loading a training data set images-input2, wherein the feature extraction network model con-model consists of a convolutional layer conv2, a relu2 layer, a pooling layer pooling2 and a full connection layer FC;

4. The method as claimed in claim 3, wherein the convolutional layer conv2 of the feature extraction network model con-model is 16 layers, the relu2 layer is 15 layers, the pooling layer posing 2 is 5 layers, the convolutional layer conv2 uses multi-channel convolution operation, the convolution kernel size is 3x3, the padding size is 1, the number of convolution steps is 1, the pooling layer posing 2 uses the filter size is 2x2, the step size is 2, the type is maximum pooling, the fully connected layer FC is three layers, and a dropout mechanism is added to each layer.

5. The method of claim 2, wherein the convolutional layer conv1 of the residual convolutional neural network ResNet is 49 layers, the relu1 layer is 49 layers, and the pooling layer pooling1 is 2 layers, the convolutional layer conv1 uses a multi-channel convolution operation, the convolutional layer conv1 comprises 1 convolution kernel of 7x7, 32 convolution kernels of 1x1 and 16 convolution kernels of 3x3, and the pooling layer pooling1 uses a maximum filter of 3x3 and an average filter of 2x 2.

6. The method for constructing a new target network model for speech-assisted audio-visual collaborative learning according to claim 1, wherein the feature vector set R is extracted by a deep convolutional layer in a feature extraction network model con-model:

the characteristic matrix after 8 th convolution layer conv3-4 is

The characteristic matrix after conv4-4 of the 12 th convolution layer is

The characteristic matrix after 16 th convolution layer conv5-4 is

Where i is n/2, j is m/2, p is i/2, q is j/2,

then the feature matrix

7. The method for constructing a new target network model for speech-assisted audio-visual collaborative learning according to claim 1, wherein the highest confidence most-value is a matching degree between any one feature vector in a feature vector set R and any one feature vector in a feature vector library B1:

highest confidence