Disclosure of Invention
The invention provides a grabbed object identification method based on fusion of a touch vibration signal and a visual image, aiming at making up for the defects in the prior art.
The invention is realized by the following technical scheme:
an object description generation method based on machine vision and tactile perception is characterized by comprising the following steps:
s1, preprocessing the visual and tactile original data;
s2, inputting the collected visual and tactile information into a two-dimensional convolution neural network and a one-dimensional convolution neural network respectively, and connecting the eigenvectors output by the two neural networks in series to obtain a visual and tactile fusion eigenvector;
s3, inputting the obtained visual-touch fusion characteristic vector into two fully-connected network branches, wherein the first fully-connected network is used for identifying and classifying objects, and the second fully-connected network is used for identifying physical attributes of the objects;
and S4, embedding the classification results and the physical attributes obtained by the two fully-connected networks into the object description sentences in the form of keywords.
Further, in order to better implement the present invention, in S1, the method for acquiring visual information includes transforming the original high-pixel image into a picture with a pixel value of 300 × 300, and randomly generating 30% offset processing on the brightness, contrast, and saturation of the picture to obtain the final image to be input.
Further, in order to better implement the present invention, in S1, the method for acquiring the haptic information includes cutting data by using matlab software, and compressing the multidimensional data with different lengths to finally obtain the haptic data with the same length.
Further, in order to better implement the present invention, in S2, the visual information and the tactile information are input in pairs, the visual information is input into the two-dimensional convolutional neural network, and the tactile information is input into the one-dimensional convolutional neural network; three layers of one-dimensional convolutional neural networks are used for processing the tactile information, and the RELU function is used as the activation function; the densnet169 model is used for visual information processing.
Further, in order to better implement the present invention, in S3, the supervised labels used by the two fully-connected networks are in the form of labels in a standard multi-class task and a multi-label task, the neural network with multi-branch output has two branches and two loss functions, the cross-entropy loss function is used in the multi-class task, and the loss function used in the multi-label task is a multi-label class loss function multilabel software label loss () provided by the pytorch neural network architecture.
Further, in order to better implement the present invention, in S4, a specific method for converting the classification result and the physical attribute into the keywords is to sort the object category keywords to form a list of n elements, and then use the index value of the object category keywords as the tags of the objects, where each object has only one tag; the output of the multi-classification task is n probability values, and the corresponding object category key words can be found according to the index of the numerical value with the maximum probability value; the label generation in the multi-label classification task is similar to the multi-classification task, m physical attribute key word values are firstly sequenced to form a list of m elements, the multi-classification label is composed of m elements and respectively corresponds to the m physical attribute key words, the physical attribute key words are required to be obtained through multi-label classification network output, an index with a predicted value of 1 in network output is required to be obtained, then corresponding attributes are called from the physical attribute key word list according to the index, and the extraction of the physical attribute key words is completed.
The invention has the beneficial effects that:
the object description generation method based on machine vision and touch perception provided by the invention constructs a multi-branch network model capable of simultaneously predicting object category keywords and physical attribute keywords, and then forms a description sentence of the object according to the predicted keywords. The method effectively improves the external perception expression capability of the robot, and enables the robot to be more intelligent in the human-computer interaction process.
Detailed Description
The invention is further described with reference to the following figures and detailed description.
Fig. 1-7 illustrate a specific embodiment of the present invention, which is an object description generation method based on visual and tactile perception, and as shown in fig. 1, this embodiment proposes a multi-branch neural network structure with multi-modal input and multi-level output, which takes machine vision and tactile sensation as two modal inputs, where machine vision is input into a two-dimensional convolutional neural network, and machine tactile sensation is input into a one-dimensional convolutional neural network. And then, connecting the feature vectors output by the two-dimensional convolutional neural network and the one-dimensional convolutional neural network in series to obtain a visual-touch fusion feature vector. And finally, respectively inputting the visual-touch fusion characteristic vectors into two fully-connected network branches, wherein the first fully-connected network outputs the object type predicted based on the visual-touch fusion characteristic vectors, and the second predicts the physical attributes of the object based on the visual-touch fusion characteristic vectors. In addition, the embodiment provides an object description generation method, which converts the classification result and the physical attribute output by the multi-branch network structure into a keyword, and then embeds the keyword into a description statement template.
The specific implementation process of this embodiment is as follows:
1. the data set is composed of a plurality of data sets,
the method of this example was trained and tested on the PHAC-2 dataset, published by the university of Pennsylvania, containing visual and tactile data for 53 objects, wherein each object's visual data contained 8 photographs, which were collected by placing the objects on an aluminum disk that was photographed once for every 45 degrees of rotation. The haptic data set consists of two pressure values, micro-vibrations, and temperature values, the haptic data being from haptic data of squeezing, pinching, slow sliding, and fast sliding for each object. The data set also contains 24 tactile adjectives to describe physical properties of the object, including softness, hardness, temperature, viscosity, elasticity, etc. Each object in the data set is assigned several tactile adjectives, and to exclude contingencies, the adjectives of each object are determined collectively by 36 individuals.
The method proposed by this embodiment requires the data set to be divided into training set and test set, and we extract one visual data and one tactile data from each sample as test set. To ensure fairness, a number a between 1 and 8 is randomly generated by the computer for each object in the test set data selection process, and then the a-th image and the a-th tactile data of the object are taken.
To reduce the amount of network parameters, the image data was changed to 300 × 300 pictures. Because the visual information of the robot is most interfered by light, 30% of offset processing is randomly generated on the brightness, the contrast and the saturation of the picture in order to improve the robustness of the model.
Since the tactile data in the PHAC-2 data set is 88-dimensional data having a long length and a different length, it is necessary to compress the data. Through observation, the lengths of the two 'slow sliding' and 'fast sliding' tactile actions in the data set are about 2000 data points, the data volume of the part is small, and the data characteristics are obvious. And then using matlab software to cut data, wherein the data cutting basis is the data change amplitude, the data is read from the last, when the absolute value of the slope of the pressure value in the tactile data is greater than 1, the data change is considered to be large, and the length of 2000 data points is continuously read forwards as the starting point of cutting. In order to further reduce the data volume, only important pressure values and micro-vibration actions in the data set are extracted as the tactile data, and finally 46-dimensional tactile data with the length of 2000 data points is obtained.
2. The introduction of the model is carried out,
in the present embodiment, the visual sense and the tactile sense corresponding to the object are input in pairs, the visual sense model is input to the two-dimensional convolution model, the tactile sense data is input to the one-dimensional convolution model, and the learning rate is set to 0.00002.
The processed haptic data consists of 46 one-dimensional signals, and features of the haptic data are extracted by using a one-dimensional convolutional neural network according to the characteristics of the one-dimensional signals, three layers of one-dimensional convolutional neural networks are used in the embodiment, the RELU function is used as the activation function, and specific parameters of each layer are as follows:
table 1: one-dimensional convolution neural network parameter table
Number of layers
|
Number of input channels
|
Number of output channels
|
Convolution kernel size
|
Convolution step size
|
1
|
46
|
32
|
7
|
5
|
2
|
32
|
64
|
5
|
3
|
3
|
64
|
46
|
5
|
3 |
The processed visual image is a 300 × 3 three-channel color image, and the processing of the image uses a mature densnet169 model in the visual field.
Visual and tactile information is respectively extracted by using a two-dimensional convolution and a one-dimensional convolution to obtain eigenvectors with the lengths of 1664 and 1978, the two eigenvectors are connected in series to obtain a visual and tactile fusion eigenvector with the length of 3642, and then the visual and tactile fusion eigenvectors are respectively input into two fully-connected neural networks for classification. The two fully-connected networks differ in that the first is used for multi-classification tasks, i.e. after visual and tactile information of an object in the test set is entered into the model, the first fully-connected network can predict which of the 53 objects the object is. The second fully-connected network is used for a multi-label classification task, which differs from the multi-classification task in that the multi-classification task identifies which of a plurality of objects the object belongs to, and the multi-label classification task identifies which of a plurality of attributes the object belongs to.
The supervised tags used by both fully connected networks are in the form of tags in a standard multi-classification and multi-tagging task. It should be noted that such a multi-branch output neural network has two branches and thus has two loss functions. In this embodiment, the multi-classification task uses a cross entropy loss function (formula 1), the loss function used by the multi-label classification task is a multi label classification loss function (formula 2) provided in the pyrrch neural network architecture, an output using the loss function is defined by 0, an output prediction value greater than 0 is 1, and an output prediction value smaller than 0 is 0. The goal of the optimization during training is to minimize the total loss function (equation 3) value resulting from the addition of two loss functions.
loss(x1,class)=-x1[class]+log(∑j exp(x1[j]) Equation 1)
Wherein: x is the number of1Representing the prediction output of the fully-connected network, class representing the index of the label class
x1[j]Denotes x1The j-th value of (a).
Wherein: x is the number of2Representing the output of a fully connected network, y2Presentation label
x2[i]Denotes x2Is given by the ith value, y2[i]Denotes y2Value of (1)
y2[i]∈{0,1},i∈{0,…,x2.nElement()-1}
x2Nelelement () is used to count the number of output elements.
Loss=loss(x1,class)+loss(x2,y2) Equation 3
3. Conversion to keywords
The multi-tasking labeling process entails sorting the object class keywords into a list of 53 elements and then using the index values of the object class keywords as labels for the objects (as shown in fig. 4), with only one label per object. The label in the multi-classification task is composed of numbers from 0 to 52, and the 53 numbers have strict corresponding relation with 53 object class keywords. Our goal is to convert the numerical values of the multi-classification task output into corresponding object class keywords. The output of the multi-classification task is 53 probability values, and according to the corresponding relation of fig. 4, the corresponding object category keyword can be found by the index of the numerical value with the maximum probability value. For example, if the 0 th output probability value in the multi-classification task is the maximum, the corresponding object category keyword is "aluminum", and if the 51 st output probability value is the maximum, the corresponding object category keyword is "yellow felt". Therefore, to obtain the category keyword corresponding to the multi-category output, it is necessary to obtain the index corresponding to the maximum probability value among the 53 probabilities, and then use this index value to retrieve the object category keyword at the corresponding position in the keyword list.
The label generation in the multi-label classification task is similar to the multi-classification task, in this embodiment, 24 physical attribute keyword values are firstly sorted to form a list of 24 elements, and the multi-classification label is composed of 24 elements and respectively corresponds to the 24 physical attribute keywords. Referring to fig. 5, the labels are formed by numbers 0/1, each position in the label corresponds to an attribute, for example, if the number of the nth position is 1, the object has the attribute corresponding to the nth position in the attribute list, and if the nth +1 position is 0, the object has no attribute corresponding to the nth +1 position in the attribute list. Therefore, when the physical attribute keywords are required to be obtained from the multi-label classification network output, the index with the predicted value of 1 in the network output needs to be obtained, and then the corresponding attributes are called from the physical attribute keyword list according to the index to complete the extraction of the physical attribute keywords.
4. The generation of the descriptive sentence is carried out,
after the object type key words and the physical attribute key words are obtained, simple object description sentences can be formed. Wherein the category keywords can determine which category the object is, and the physical attribute keywords are used to describe what the object gives. The input of the visual and tactile information of each object in the test set into the multi-branch network model proposed in this embodiment predicts the object category keyword and the physical attribute keyword. And then filling the obtained object category keywords and the obtained physical attribute keywords into a fixed sentence description template to form the object description sentence. For example: this is a plastic box whose surface is smooth, resilient and somewhat hard. Wherein "plastic box" is an object category keyword, and "smooth", "elastic", "somewhat hard" is a physical attribute keyword.
5. The results and the analysis were carried out in the same way,
through testing on the international PHAC-2 data set, after 150 rounds of training on the training set, the prediction accuracy of the network model of the embodiment on the object category keywords reaches 100%, and the prediction accuracy on the physical attributes reaches 97.8%, which indicates that the model of the embodiment can effectively form object description sentences.
Fig. 6 is a diagram of the result of predicting the physical attributes of 53 objects in the test set by the multi-branch network model provided in this embodiment, where the images and the haptic data in the test set are not included in the training set. Since the distribution of physical properties of objects in 53 is not uniform, the different properties do not occur the same number of times in the entire data set. Therefore, the AUC value is used as an evaluation standard of the prediction result, the value of the AUC is between 0 and 1, the closer the AUC value is to 1, the higher the model accuracy is, and the AUC value can be regarded as the prediction accuracy. As can be seen from the figure, the predicted result AUC values of the 24 attributes are all above 0.9, and the average value is 0.978.
Fig. 7 is a diagram of the result of predicting 53 object types in the test set by the multi-branch network model provided in this embodiment, where the diagram is presented in the form of a confusion matrix, the ordinate is the true value of the object type, and the abscissa is the result of predicting the object type of the multi-branch network. It can be seen from the figure that if the true value and the predicted value of the same object are equal, the intersection point is on the diagonal line of the picture. If the predicted value and the true value are different, the intersection point will appear at a position other than the diagonal line. As can be seen from the figure, the multi-branch network model of the embodiment successfully predicts the categories of 53 objects, and the accuracy rate reaches 100%.
In summary, the object category keyword prediction and the physical attribute keyword prediction of the multi-branch network model provided by the embodiment can respectively reach 100% and 97.8% of accuracy. The descriptive statement formed based on the object category keyword and the physical attribute keyword also has high credibility.
Finally, the above embodiments are only used for illustrating the technical solutions of the present invention and not for limiting, and other modifications or equivalent substitutions made by the technical solutions of the present invention by those of ordinary skill in the art should be covered within the scope of the claims of the present invention as long as they do not depart from the spirit and scope of the technical solutions of the present invention.