CN111079849A - Method for constructing new target network model for voice-assisted audio-visual collaborative learning - Google Patents

Method for constructing new target network model for voice-assisted audio-visual collaborative learning Download PDF

Info

Publication number
CN111079849A
CN111079849A CN201911334785.5A CN201911334785A CN111079849A CN 111079849 A CN111079849 A CN 111079849A CN 201911334785 A CN201911334785 A CN 201911334785A CN 111079849 A CN111079849 A CN 111079849A
Authority
CN
China
Prior art keywords
image
feature
new
model
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911334785.5A
Other languages
Chinese (zh)
Inventor
苟先太
康立烨
钱照国
张葛祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Jiaotong University
Original Assignee
Southwest Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Jiaotong University filed Critical Southwest Jiaotong University
Priority to CN201911334785.5A priority Critical patent/CN111079849A/en
Publication of CN111079849A publication Critical patent/CN111079849A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/757Matching configurations of points or features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method for constructing a new audio-visual collaborative learning target network model assisted by voice, which comprises the steps of S1-S11, wherein the method is based on the traditional object recognition model and image characteristic matching technology, the known object is accurately recognized through an initial object recognition model, if a new object appears, the new object is subjected to characteristic memory through an online learning model, and the initial object recognition model is updated in real time, so that the generalization capability of the model is stronger, and the method is more suitable for the application of real scenes.

Description

Method for constructing new target network model for voice-assisted audio-visual collaborative learning
Technical Field
The invention relates to the technical field of computer vision, in particular to a method for constructing a new audio-visual collaborative learning target network model assisted by voice.
Background
With the rapid development of computer vision, object recognition technology has been applied to various fields and brings great economic benefits. In recent years, a number of object recognition network models have appeared, and the recognition accuracy thereof has been improved, but there is a common drawback that an image data set must be prepared in advance, trained on the existing data set, and an object detector must be generated. In practical applications, there are many kinds of objects, and many image data are not collected or are difficult to obtain. In some scenarios, it is not known in advance which categories of image data should be prepared, which makes it difficult for the conventional network model to be applied to the actual scenario. The image feature matching technology can match two images, has strong application value when insufficient training data exists, and can be well applied to some specific scenes although the generalization capability is weak.
A good object recognition model is similar to a human, has the capabilities of autonomous learning and guided learning, can accurately recognize a learned object, can remember and learn a new object through the guidance of the human, and continuously updates the knowledge reserve of the model, so that the model becomes more intelligent. Aiming at the prior art, the invention provides a network model for audio-visual collaborative learning of a new target assisted by voice, which has the function of on-line learning of the new target and has important application value in some specific scenes (such as a home robot, an inspection robot and the like), and the development of the field can be promoted.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a method for constructing a new audio-visual collaborative learning target network model, which solves the problem that the existing network model does not have the defect of online learning of a new target.
In order to achieve the purpose of the invention, the technical scheme adopted by the invention is as follows:
the method for constructing the new audio-visual collaborative learning target network model assisted by voice comprises the following steps:
s1: building an original object classifier M1 for original object recognition and an object feature extraction model M2 for extracting feature vectors of objects;
s2: creating an object feature vector repository B1 for holding feature vectors of new objects and a new object image repository B2 for holding image datasets of new objects;
s3: inputting a new image picture, and loading an original object classifier M1 to perform object recognition on the new image picture;
s4: if the new image picture does not have an unidentified object, stopping operation; if the unidentified objects (object-1, … …, object-M) exist, the object feature extraction model M2 is loaded to perform feature extraction on the unidentified objects (object-1, … …, object-M), and each feature vector in the extracted feature vector set R is respectively subjected to feature matching with each feature vector in the feature vector library B1;
s5: if an object with the highest confidence coefficient most-value higher than the base confidence coefficient base-value exists during matching, judging that the object is correctly identified, otherwise, judging that the object is a new object;
s6: performing man-machine interaction through voice assistance, performing voice description on the dominant characteristic of the new object, and printing a voice tag on the new object to obtain a new image;
s7: image augmentation is carried out on the new image to obtain augmented images (image-1, image-2, … … and image-n), and the images are stored in a new object image library B2;
s8: loading an object feature extraction model M2, extracting features of new object objects in a new image, and storing the obtained feature vector feature in a feature vector library B1;
s9: traversing the new object image library B2, and judging whether the data set quantity of the new object reaches the data set quantity N required by training;
s10: if yes, merging the data set N of the new object with the data set of the original object classifier M1, training a new object classifier to replace the original object classifier M1 by using the merged data set, and deleting the image data set of the new object features in the new object image library B2;
s11: otherwise, repeating steps S3-S9 until the data set amount of the new object reaches the data set amount N required by the training.
Further, the method for constructing the original object classifier M1 for original object recognition includes:
a11: generating a training image set images-input1 by using the image data set according to the actual application scene;
a12: creating a residual convolutional neural network ResNet to extract image feature features-maps of images in the training image set images-input1, wherein the residual convolutional neural network ResNet consists of convolutional layer conv1, relu1 layers and pooling layer pooling 1;
a13: creating an RPN network to generate image candidate region regions, inputting image feature features-maps, judging whether the image feature features-maps belong to a foreground or a background through Softmax, and correcting the candidate region regions to generate accurate candidate region regions 1;
a14: a feature region pro-features-maps of a fixed size is generated using the candidate regions pro-contaminants 1 and the image feature features-maps.
A15: and fully connecting the feature areas of fixed size, classifying the objects by using Softmax, calculating Loss, correcting the Loss, and realizing accurate classification of the original objects.
Further, the method for building the object feature extraction model M2 for extracting the feature vector of the object includes:
b11: preparing image Data1 with several types as training Data set images-input 2;
b12: loading a training data set images-input2, pre-training an autonomous RPN network model RPN-model, and outputting an object candidate region proposals 2;
b13: pre-training a feature extraction network model con-model, loading a training data set images-input2, wherein the feature extraction network model con-model consists of a convolutional layer conv2, a relu2 layer, a pooling layer pooling2 and a full connection layer FC.
B14: and correcting the object candidate regions propofol 2, and then inputting the corrected object candidate regions to a feature extraction network model con-model for feature extraction to obtain image feature features-maps of each candidate region.
Further, the convolutional layer conv2 of the feature extraction network model con-model is 16 layers, the relu2 layer is 15 layers, the pooling layer pooling2 is 5 layers, the convolutional layer conv2 uses a multi-channel convolution operation, the size of a convolution kernel is 3x3, the filling size is 1, the number of convolution steps is 1, the pooling layer pooling2 uses a filter size of 2x2, the step size is 2, the type is maximum pooling, the fully connected layer FC is three layers, and a dropout mechanism is added to each layer.
Further, convolutional layer conv1 of residual convolutional neural network ResNet is 49 layers, relu1 layer is 49 layers, pooling layer posing 1 is 2 layers, convolutional layer conv1 uses a multi-channel convolution operation, convolutional layer conv1 includes 1 convolution kernel of 7x7, 32 convolution kernels of 1x1 and 16 convolution kernels of 3x3, pooling layer posing 1 uses a maximum filter of 3x3 and an average filter of 2x 2.
Further, extracting the feature vector set R through a deep convolutional layer in a feature extraction network model con-model:
the characteristic matrix after 8 th convolution layer conv3-4 is
Figure BDA0002330662320000041
The characteristic matrix after conv4-4 of the 12 th convolution layer is
Figure BDA0002330662320000042
The characteristic matrix after 16 th convolution layer conv5-4 is
Figure BDA0002330662320000043
Where i is n/2, j is m/2, p is i/2, q is j/2,
then the feature matrix
Figure BDA0002330662320000044
A function MatToVec (T) splices rows of a matrix to form a one-dimensional vector, and a parameter T ═ A/B/C is a matrix; the function pad (n) is zero padding operation, and the parameter n represents the number of zero padding; the eigenvector R1 is MatToVec (S1), the eigenvector set R is (R1, R2.., Rs), n and m in the matrix A, B are the length and width of the matrix A, B, respectively, and p and q represent the length and width of the matrix C, respectively.
Further, the highest confidence most-value is the matching degree of any one feature vector in the feature vector set R and any one feature vector in the feature vector library B1:
Figure BDA0002330662320000051
highest confidence
Figure BDA0002330662320000052
Wherein α + β + γ is 1, M belongs to the feature vector library B1 feature vector set Q (Q)1,Q2…Qt) N belongs to the feature vector set R ═ (R)1,R2…Rs) The size of all feature vectors in Q and R is L.
The invention has the beneficial effects that: the method is based on the traditional object recognition model and the image characteristic matching technology, the known object is accurately recognized through the initial object recognition model, if a new object appears, the new object is subjected to characteristic memory through the online learning model, and the initial object recognition model is updated in real time, so that the generalization capability of the model is stronger, and the method is more suitable for the application of a real scene.
In some scenes, most objects are relatively fixed, recognition can be realized only by characteristic memory, and the initial object recognition model can be updated in continuous memory learning, so that more kinds of objects can be recognized. The network model is applied to the scene needing object identification, and the model is more intelligent. Compared with the traditional object recognition model, the method has higher application value, and the network model can promote the development of the object recognition field and has important research significance.
Drawings
FIG. 1 is a flow chart of a method for constructing a new target network model for voice-assisted audio-visual collaborative learning.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.
As shown in fig. 1, the method for constructing a new target network model for audio-visual collaborative learning assisted by voice comprises the following steps:
s1: building an original object classifier M1 for identifying an original object and an object feature extraction model M2 for extracting feature vectors of the object;
the method for constructing the original object classifier M1 for original object recognition comprises the following steps:
a11: generating a training image set images-input1 by using the image data set according to the actual application scene;
a12: a residual convolutional neural network ResNet consisting of convolutional layer conv1, relu1 layers, and pooling layer pooling1 was created to extract the image feature features-maps of the images in the training image set images-input 1.
Convolutional layer conv1 of residual convolutional neural network ResNet is 49 layers, relu1 layers is 49 layers, pooling layer posing 1 is 2 layers, convolutional layer conv1 uses a multi-channel convolution operation, convolutional layer conv1 includes 1 convolution kernel of 7x7, 32 convolution kernels of 1x1 and 16 convolution kernels of 3x3, pooling layer posing 1 uses a maximum filter of 3x3 and a mean filter of 2x 2.
Step a12 includes:
a121: using image Data1 with several types as training Data set images-input 2;
a122: and loading a training data set images-input2, pre-training an autonomous RPN network model RPN-model, and outputting an object candidate region proposals 2.
A123: the pre-training feature extraction network model con-model serves as a training data set images-input2 and is composed of a convolutional layer conv2, a relu2 layer, a pooling layer pooling2 and a full connection layer FC.
The convolutional layer conv2 of the feature extraction network model con-model is 16 layers, the relu2 layer is 15 layers, the pooling layer pooling2 is 5 layers, the convolutional layer conv2 uses multi-channel convolution operation, the size of a convolution kernel is 3x3, the filling size is 1, the convolution step number is 1, the pooling layer pooling2 uses a filter with the size of 2x2, the step size is 2, the type is maximum pooling, the full connection layer FC is three layers, and a dropout mechanism is added in each layer.
A124: and correcting the object candidate regions propofol 2, and then inputting the corrected object candidate regions to a feature extraction network model con-model for feature extraction to obtain image feature features-maps of each candidate region.
A13: creating an RPN network to generate image candidate region regions, inputting image feature features-maps, judging whether the image feature features-maps belong to a foreground or a background through Softmax, and correcting the candidate region regions to generate accurate candidate region regions 1;
a14: a feature region pro-features-maps of a fixed size is generated using the candidate regions pro-contaminants 1 and the image feature features-maps.
A15: and fully connecting the feature areas of fixed size, classifying the objects by using Softmax, calculating Loss, correcting the Loss, and realizing accurate classification of the original objects.
The method for building the object feature extraction model M2 for extracting the feature vector of the object comprises the following steps:
b11: generating a training Data set images-input2 by using image Data1 with several types;
b12: and loading a training data set images-input2, pre-training an autonomous RPN network model RPN-model, and outputting an object candidate region proposals 2.
B13: pre-training a feature extraction network model con-model, loading a training data set images-input2, wherein the feature extraction network model con-model consists of a convolutional layer conv2, a relu2 layer, a pooling layer pooling2 and a full connection layer FC.
The convolutional layer conv2 of the feature extraction network model con-model is 16 layers, the relu2 layer is 15 layers, the pooling layer pooling2 is 5 layers, the convolutional layer conv2 uses multi-channel convolution operation, the size of a convolution kernel is 3x3, the filling size is 1, the convolution step number is 1, the pooling layer pooling2 uses a filter with the size of 2x2, the step size is 2, the type is maximum pooling, the full connection layer FC is three layers, and a dropout mechanism is added in each layer.
B14: and correcting the object candidate regions propofol 2, and then inputting the corrected object candidate regions to a feature extraction network model con-model for feature extraction to obtain image feature features-maps of each candidate region.
S2: creating an object feature vector repository B1 for holding feature vectors of new objects and a new object image repository B2 for holding image datasets of new objects;
s3: inputting a new image picture, and loading an original object classifier M1 to perform object recognition on the new image picture;
s4: if the new image picture does not have an unidentified object, stopping operation; if the unidentified objects (object-1, … …, object-M) exist, the object feature extraction model M2 is loaded to perform feature extraction on the unidentified objects (object-1, … …, object-M), and each feature vector in the extracted feature vector set R is respectively subjected to feature matching with each feature vector in the feature vector library B1;
extracting a deep convolutional layer in a feature extraction network model con-model by using a feature vector set R:
the characteristic matrix after 8 th convolution layer conv3-4 is
Figure BDA0002330662320000081
The characteristic matrix after conv4-4 of the 12 th convolution layer is
Figure BDA0002330662320000082
The characteristic matrix after 16 th convolution layer conv5-4 is
Figure BDA0002330662320000083
Where i is n/2, j is m/2, p is i/2, q is j/2,
then the feature matrix
Figure BDA0002330662320000084
A function MatToVec (T) splices rows of a matrix to form a one-dimensional vector, and a parameter T ═ A/B/C is a matrix; the function pad (n) is zero padding operation, and the parameter n represents the number of zero padding; the eigenvector R1 is MatToVec (S1), the eigenvector set R is (R1, R2.., Rs), n and m in the matrix A, B are the length and width of the matrix A, B, respectively, and p and q represent the length and width of the matrix C, respectively.
S5: if an object with the highest confidence coefficient most-value higher than the base confidence coefficient base-value exists during matching, judging that the object is correctly identified, otherwise, judging that the object is a new object;
the highest confidence most-value is the matching degree of any one feature vector in the feature vector set R and any one feature vector in the feature vector library B1:
Figure BDA0002330662320000091
highest confidence
Figure BDA0002330662320000092
Wherein α + β + γ is 1, M belongs to the feature vector library B1 feature vector set Q (Q)1,Q2…Qt) N belongs to the feature vector set R ═ (R)1,R2…Rs) The size of all feature vectors in Q and R is L.
S6: performing man-machine interaction through voice assistance, performing voice description on the dominant characteristic of the new object, and printing a voice tag on the new object to obtain a new image;
s7: image augmentation is carried out on the new image to obtain augmented images (image-1, image-2, … … and image-n), and the images are stored in a new object image library B2;
s8: loading an object feature extraction model M2, extracting features of new object objects in a new image, and storing the obtained feature vector feature in a feature vector library B1;
the feature vector feature is obtained without passing through the autonomous RPN network model RPN-model in the feature extraction model M2, because the new object is already tagged with features by speech assistance and no object region needs to be extracted again. And directly inputting the image with the characteristic label into the characteristic extraction network model con-model in the characteristic extraction model M2 to extract the characteristic vector feature.
S9: traversing the new object image library B2, and judging whether the data set quantity of the new object reaches the data set quantity N required by training;
s10: if so, merging the data set N of the new object with the data set of the original object classifier M1, training a new object classifier to replace the original object classifier M1 by using the merged data set, and deleting the image data set of the new object features in the new object image library B2;
s11: if not, repeating the steps S3-S9 until the data set quantity of the new object reaches the data set quantity N required by the training.
The method is based on the traditional object recognition model and the image characteristic matching technology, the known object is accurately recognized through the initial object recognition model, if a new object appears, the new object is subjected to characteristic memory through the online learning model, and the initial object recognition model is updated in real time, so that the generalization capability of the model is stronger, and the method is more suitable for the application of a real scene.
In some scenes, most objects are relatively fixed, recognition can be realized only by characteristic memory, and the initial object recognition model can be updated in continuous memory learning, so that more kinds of objects can be recognized. The network model is applied to the scene needing object identification, and the model is more intelligent. Compared with the traditional object recognition model, the method has higher application value, and the network model can promote the development of the object recognition field and has important research significance.

Claims (7)

1. A method for constructing a new audio-visual collaborative learning target network model assisted by voice is characterized by comprising the following steps:
s1: building an original object classifier M1 for original object recognition and an object feature extraction model M2 for extracting feature vectors of objects;
s2: creating an object feature vector repository B1 for holding feature vectors of new objects and a new object image repository B2 for holding image datasets of new objects;
s3: inputting a new image picture, and loading an original object classifier M1 to perform object recognition on the new image picture;
s4: if the new image picture does not have an unidentified object, stopping operation; if the unidentified objects (object-1, … …, object-M) exist, the object feature extraction model M2 is loaded to perform feature extraction on the unidentified objects (object-1, … …, object-M), and each feature vector in the extracted feature vector set R is respectively subjected to feature matching with each feature vector in the feature vector library B1;
s5: if an object with the highest confidence coefficient most-value higher than the base confidence coefficient base-value exists during matching, judging that the object is correctly identified, otherwise, judging that the object is a new object;
s6: performing man-machine interaction through voice assistance, performing voice description on the dominant characteristic of the new object, and printing a voice tag on the new object to obtain a new image;
s7: image augmentation is carried out on the new image to obtain augmented images (image-1, image-2, … … and image-n), and the images are stored in a new object image library B2;
s8: loading an object feature extraction model M2, extracting features of new object objects in a new image, and storing the obtained feature vector feature in a feature vector library B1;
s9: traversing the new object image library B2, and judging whether the data set quantity of the new object reaches the data set quantity N required by training;
s10: if yes, merging the data set N of the new object with the data set of the original object classifier M1, training a new object classifier to replace the original object classifier M1 by using the merged data set, and deleting the image data set of the new object features in the new object image library B2;
s11: otherwise, repeating steps S3-S9 until the data set amount of the new object reaches the data set amount N required by the training.
2. The method for constructing a new target network model for speech-assisted audio-visual collaborative learning according to claim 1, wherein the method for constructing an original object classifier M1 for original object recognition comprises the following steps:
a11: generating a training image set images-input1 by using the image data set according to the actual application scene;
a12: creating a residual convolutional neural network ResNet to extract image feature features-maps of images in the training image set images-input1, wherein the residual convolutional neural network ResNet consists of convolutional layer conv1, relu1 layers and pooling layer pooling 1;
a13: creating an RPN network to generate image candidate region regions, inputting image feature features-maps, judging whether the image feature features-maps belong to a foreground or a background through Softmax, and correcting the candidate region regions to generate accurate candidate region regions 1;
a14: a feature region pro-features-maps of a fixed size is generated using the candidate regions pro-contaminants 1 and the image feature features-maps.
A15: and fully connecting the feature areas of fixed size, classifying the objects by using Softmax, calculating Loss, correcting the Loss, and realizing accurate classification of the original objects.
3. The method for constructing the new target network model for the voice-assisted audio-visual collaborative learning according to claim 2, wherein the method for constructing the object feature extraction model M2 for extracting the feature vectors of the object comprises the following steps:
b11: preparing image Data1 with several types as training Data set images-input 2;
b12: loading a training data set images-input2, pre-training an autonomous RPN network model RPN-model, and outputting an object candidate region proposals 2;
b13: pre-training a feature extraction network model con-model, loading a training data set images-input2, wherein the feature extraction network model con-model consists of a convolutional layer conv2, a relu2 layer, a pooling layer pooling2 and a full connection layer FC;
b14: and correcting the object candidate regions propofol 2, and then inputting the corrected object candidate regions to a feature extraction network model con-model for feature extraction to obtain image feature features-maps of each candidate region.
4. The method as claimed in claim 3, wherein the convolutional layer conv2 of the feature extraction network model con-model is 16 layers, the relu2 layer is 15 layers, the pooling layer posing 2 is 5 layers, the convolutional layer conv2 uses multi-channel convolution operation, the convolution kernel size is 3x3, the padding size is 1, the number of convolution steps is 1, the pooling layer posing 2 uses the filter size is 2x2, the step size is 2, the type is maximum pooling, the fully connected layer FC is three layers, and a dropout mechanism is added to each layer.
5. The method of claim 2, wherein the convolutional layer conv1 of the residual convolutional neural network ResNet is 49 layers, the relu1 layer is 49 layers, and the pooling layer pooling1 is 2 layers, the convolutional layer conv1 uses a multi-channel convolution operation, the convolutional layer conv1 comprises 1 convolution kernel of 7x7, 32 convolution kernels of 1x1 and 16 convolution kernels of 3x3, and the pooling layer pooling1 uses a maximum filter of 3x3 and an average filter of 2x 2.
6. The method for constructing a new target network model for speech-assisted audio-visual collaborative learning according to claim 1, wherein the feature vector set R is extracted by a deep convolutional layer in a feature extraction network model con-model:
the characteristic matrix after 8 th convolution layer conv3-4 is
Figure FDA0002330662310000041
The characteristic matrix after conv4-4 of the 12 th convolution layer is
Figure FDA0002330662310000042
The characteristic matrix after 16 th convolution layer conv5-4 is
Figure FDA0002330662310000043
Where i is n/2, j is m/2, p is i/2, q is j/2,
then the feature matrix
Figure FDA0002330662310000044
A function MatToVec (T) splices rows of a matrix to form a one-dimensional vector, and a parameter T ═ A/B/C is a matrix; the function pad (n) is zero padding operation, and the parameter n represents the number of zero padding; the eigenvector R1 is MatToVec (S1), the eigenvector set R is (R1, R2.., Rs), n and m in the matrix A, B are the length and width of the matrix A, B, respectively, and p and q represent the length and width of the matrix C, respectively.
7. The method for constructing a new target network model for speech-assisted audio-visual collaborative learning according to claim 1, wherein the highest confidence most-value is a matching degree between any one feature vector in a feature vector set R and any one feature vector in a feature vector library B1:
Figure FDA0002330662310000045
highest confidence
Figure FDA0002330662310000046
Wherein α + β + γ is 1, M belongs to the feature vector library B1 feature vector set Q (Q)1,Q2…Qt) N belongs to the feature vector set R ═ (R)1,R2…Rs) The size of all feature vectors in Q and R is L.
CN201911334785.5A 2019-12-23 2019-12-23 Method for constructing new target network model for voice-assisted audio-visual collaborative learning Pending CN111079849A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911334785.5A CN111079849A (en) 2019-12-23 2019-12-23 Method for constructing new target network model for voice-assisted audio-visual collaborative learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911334785.5A CN111079849A (en) 2019-12-23 2019-12-23 Method for constructing new target network model for voice-assisted audio-visual collaborative learning

Publications (1)

Publication Number Publication Date
CN111079849A true CN111079849A (en) 2020-04-28

Family

ID=70316831

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911334785.5A Pending CN111079849A (en) 2019-12-23 2019-12-23 Method for constructing new target network model for voice-assisted audio-visual collaborative learning

Country Status (1)

Country Link
CN (1) CN111079849A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107506407A (en) * 2017-08-07 2017-12-22 深圳市大迈科技有限公司 A kind of document classification, the method and device called
CN108009591A (en) * 2017-12-14 2018-05-08 西南交通大学 A kind of contact network key component identification method based on deep learning
CN108875455A (en) * 2017-05-11 2018-11-23 Tcl集团股份有限公司 A kind of unsupervised face intelligence precise recognition method and system
CN109063594A (en) * 2018-07-13 2018-12-21 吉林大学 Remote sensing images fast target detection method based on YOLOv2

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108875455A (en) * 2017-05-11 2018-11-23 Tcl集团股份有限公司 A kind of unsupervised face intelligence precise recognition method and system
CN107506407A (en) * 2017-08-07 2017-12-22 深圳市大迈科技有限公司 A kind of document classification, the method and device called
CN108009591A (en) * 2017-12-14 2018-05-08 西南交通大学 A kind of contact network key component identification method based on deep learning
CN109063594A (en) * 2018-07-13 2018-12-21 吉林大学 Remote sensing images fast target detection method based on YOLOv2

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KAIMING HE等: "《Deep Residual Learning for Image Recognition》", 《2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 *
SHAOQING REN等: "《Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks》", 《IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE》 *

Similar Documents

Publication Publication Date Title
Kao et al. Visual aesthetic quality assessment with a regression model
CN109740413A (en) Pedestrian recognition methods, device, computer equipment and computer storage medium again
US20170032222A1 (en) Cross-trained convolutional neural networks using multimodal images
JP2017062781A (en) Similarity-based detection of prominent objects using deep cnn pooling layers as features
CN111461212A (en) Compression method for point cloud target detection model
CN112347284B (en) Combined trademark image retrieval method
CN108133235B (en) Pedestrian detection method based on neural network multi-scale feature map
CN113836992B (en) Label identification method, label identification model training method, device and equipment
CN111368766A (en) Cattle face detection and identification method based on deep learning
CN111222487A (en) Video target behavior identification method and electronic equipment
CN108921850B (en) Image local feature extraction method based on image segmentation technology
CN109034121B (en) Commodity identification processing method, device, equipment and computer storage medium
CN112200031A (en) Network model training method and equipment for generating image corresponding word description
CN111340051A (en) Picture processing method and device and storage medium
CN113989604A (en) Tire DOT information identification method based on end-to-end deep learning
CN113963026A (en) Target tracking method and system based on non-local feature fusion and online updating
CN116258990A (en) Cross-modal affinity-based small sample reference video target segmentation method
Peng et al. Document image quality assessment using discriminative sparse representation
CN117115614B (en) Object identification method, device, equipment and storage medium for outdoor image
CN112070181B (en) Image stream-based cooperative detection method and device and storage medium
Timotheatos et al. Vision based horizon detection for UAV navigation
CN117437691A (en) Real-time multi-person abnormal behavior identification method and system based on lightweight network
Abdelaziz et al. Few-shot learning with saliency maps as additional visual information
Jiafa et al. A scene recognition algorithm based on deep residual network
CN116740413A (en) Deep sea biological target detection method based on improved YOLOv5

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200428