CN112766349A

CN112766349A - Object description generation method based on machine vision and tactile perception

Info

Publication number: CN112766349A
Application number: CN202110037740.2A
Authority: CN
Inventors: 张鹏; 周茂辉; 单东日; 邹文凯
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2021-01-12
Filing date: 2021-01-12
Publication date: 2021-05-07
Anticipated expiration: 2041-01-12
Also published as: CN112766349B

Abstract

The invention relates to an object description method based on machine vision and tactile perception. The method takes the machine vision and tactile information of the object as input, uses a deep learning method to identify the type and physical properties of the object, and then converts the identification result into a key Words form object description sentences. The method proposed in the patent of the present invention was trained and tested on the visual and tactile data set (PHAC‑2 data set) published by the University of Pennsylvania, and the prediction accuracy of category keywords and physical attribute keywords reached 100% and 97.8%, respectively. This method of describing sentences formed by robots after exploring and perceiving objects can effectively promote the development of human-computer interaction technology in the field of robot perception.

Description

Object description generation method based on machine vision and tactile perception

Technical Field

The invention relates to the technical field of robot perception technology, multi-modal fusion and object description generation, in particular to an object description generation method based on machine vision and touch perception.

Background

With the development of sensor technology and artificial intelligence technology, the perception and decision-making capability of the robot is continuously improved, and the development of the robot is converting from the attribute of the robot to the attribute of a human. However, the cognitive discrimination ability of robots for objects is still far less than that of humans.

Humans use a combination of visual and tactile information to accomplish object recognition processes. Functional magnetic resonance imaging data indicates that human tactile and visual signals are processed in a multi-sensory coordinated manner in identifying objects. Inspired by human brain cross-modal coprocessing, foreign researchers use touch and visual signals to design a deep learning framework for touch attribute classification, and prove that the touch and visual signals are complementary, and the performance can be improved by combining data of the two forms.

It is of great significance to generate object descriptions at the visual and tactile perception level through the exploration of robots. The technology can effectively increase the participation sense and the acquisition sense of disabled people in life, and meanwhile, the object description technology can be applied to high-risk environments, so that robots are used for replacing people to explore and sense objects, corresponding feedback reports are formed, and the injury of people can be effectively reduced. At present, no corresponding object description generation method based on visual perception and tactile perception exists, and therefore the object description generation method provided by the invention can fill the blank of the technology.

Disclosure of Invention

The invention provides a grabbed object identification method based on fusion of a touch vibration signal and a visual image, aiming at making up for the defects in the prior art.

The invention is realized by the following technical scheme:

an object description generation method based on machine vision and tactile perception is characterized by comprising the following steps:

s1, preprocessing the visual and tactile original data;

s2, inputting the collected visual and tactile information into a two-dimensional convolution neural network and a one-dimensional convolution neural network respectively, and connecting the eigenvectors output by the two neural networks in series to obtain a visual and tactile fusion eigenvector;

s3, inputting the obtained visual-touch fusion characteristic vector into two fully-connected network branches, wherein the first fully-connected network is used for identifying and classifying objects, and the second fully-connected network is used for identifying physical attributes of the objects;

and S4, embedding the classification results and the physical attributes obtained by the two fully-connected networks into the object description sentences in the form of keywords.

Further, in order to better implement the present invention, in S1, the method for acquiring visual information includes transforming the original high-pixel image into a picture with a pixel value of 300 × 300, and randomly generating 30% offset processing on the brightness, contrast, and saturation of the picture to obtain the final image to be input.

Further, in order to better implement the present invention, in S1, the method for acquiring the haptic information includes cutting data by using matlab software, and compressing the multidimensional data with different lengths to finally obtain the haptic data with the same length.

Further, in order to better implement the present invention, in S2, the visual information and the tactile information are input in pairs, the visual information is input into the two-dimensional convolutional neural network, and the tactile information is input into the one-dimensional convolutional neural network; three layers of one-dimensional convolutional neural networks are used for processing the tactile information, and the RELU function is used as the activation function; the densnet169 model is used for visual information processing.

Further, in order to better implement the present invention, in S3, the supervised labels used by the two fully-connected networks are in the form of labels in a standard multi-class task and a multi-label task, the neural network with multi-branch output has two branches and two loss functions, the cross-entropy loss function is used in the multi-class task, and the loss function used in the multi-label task is a multi-label class loss function multilabel software label loss () provided by the pytorch neural network architecture.

Further, in order to better implement the present invention, in S4, a specific method for converting the classification result and the physical attribute into the keywords is to sort the object category keywords to form a list of n elements, and then use the index value of the object category keywords as the tags of the objects, where each object has only one tag; the output of the multi-classification task is n probability values, and the corresponding object category key words can be found according to the index of the numerical value with the maximum probability value; the label generation in the multi-label classification task is similar to the multi-classification task, m physical attribute key word values are firstly sequenced to form a list of m elements, the multi-classification label is composed of m elements and respectively corresponds to the m physical attribute key words, the physical attribute key words are required to be obtained through multi-label classification network output, an index with a predicted value of 1 in network output is required to be obtained, then corresponding attributes are called from the physical attribute key word list according to the index, and the extraction of the physical attribute key words is completed.

The invention has the beneficial effects that:

the object description generation method based on machine vision and touch perception provided by the invention constructs a multi-branch network model capable of simultaneously predicting object category keywords and physical attribute keywords, and then forms a description sentence of the object according to the predicted keywords. The method effectively improves the external perception expression capability of the robot, and enables the robot to be more intelligent in the human-computer interaction process.

Drawings

FIG. 1 is a schematic diagram of a multi-drop network of the present invention;

FIG. 2 is a schematic diagram of image data processing in a data set according to the present invention;

FIG. 3 is a diagram illustrating the centralized haptic data processing of the present invention;

FIG. 4 is a multi-category label mapping of the present invention;

FIG. 5 is a diagram of a multi-label classification label correspondence of the present invention;

FIG. 6 is a diagram of the results of various physical property predictions of the present invention;

FIG. 7 shows the object class prediction result of the present invention.

Detailed Description

The invention is further described with reference to the following figures and detailed description.

Fig. 1-7 illustrate a specific embodiment of the present invention, which is an object description generation method based on visual and tactile perception, and as shown in fig. 1, this embodiment proposes a multi-branch neural network structure with multi-modal input and multi-level output, which takes machine vision and tactile sensation as two modal inputs, where machine vision is input into a two-dimensional convolutional neural network, and machine tactile sensation is input into a one-dimensional convolutional neural network. And then, connecting the feature vectors output by the two-dimensional convolutional neural network and the one-dimensional convolutional neural network in series to obtain a visual-touch fusion feature vector. And finally, respectively inputting the visual-touch fusion characteristic vectors into two fully-connected network branches, wherein the first fully-connected network outputs the object type predicted based on the visual-touch fusion characteristic vectors, and the second predicts the physical attributes of the object based on the visual-touch fusion characteristic vectors. In addition, the embodiment provides an object description generation method, which converts the classification result and the physical attribute output by the multi-branch network structure into a keyword, and then embeds the keyword into a description statement template.

The specific implementation process of this embodiment is as follows:

1. the data set is composed of a plurality of data sets,

the method of this example was trained and tested on the PHAC-2 dataset, published by the university of Pennsylvania, containing visual and tactile data for 53 objects, wherein each object's visual data contained 8 photographs, which were collected by placing the objects on an aluminum disk that was photographed once for every 45 degrees of rotation. The haptic data set consists of two pressure values, micro-vibrations, and temperature values, the haptic data being from haptic data of squeezing, pinching, slow sliding, and fast sliding for each object. The data set also contains 24 tactile adjectives to describe physical properties of the object, including softness, hardness, temperature, viscosity, elasticity, etc. Each object in the data set is assigned several tactile adjectives, and to exclude contingencies, the adjectives of each object are determined collectively by 36 individuals.

The method proposed by this embodiment requires the data set to be divided into training set and test set, and we extract one visual data and one tactile data from each sample as test set. To ensure fairness, a number a between 1 and 8 is randomly generated by the computer for each object in the test set data selection process, and then the a-th image and the a-th tactile data of the object are taken.

To reduce the amount of network parameters, the image data was changed to 300 × 300 pictures. Because the visual information of the robot is most interfered by light, 30% of offset processing is randomly generated on the brightness, the contrast and the saturation of the picture in order to improve the robustness of the model.

Since the tactile data in the PHAC-2 data set is 88-dimensional data having a long length and a different length, it is necessary to compress the data. Through observation, the lengths of the two 'slow sliding' and 'fast sliding' tactile actions in the data set are about 2000 data points, the data volume of the part is small, and the data characteristics are obvious. And then using matlab software to cut data, wherein the data cutting basis is the data change amplitude, the data is read from the last, when the absolute value of the slope of the pressure value in the tactile data is greater than 1, the data change is considered to be large, and the length of 2000 data points is continuously read forwards as the starting point of cutting. In order to further reduce the data volume, only important pressure values and micro-vibration actions in the data set are extracted as the tactile data, and finally 46-dimensional tactile data with the length of 2000 data points is obtained.

2. The introduction of the model is carried out,

in the present embodiment, the visual sense and the tactile sense corresponding to the object are input in pairs, the visual sense model is input to the two-dimensional convolution model, the tactile sense data is input to the one-dimensional convolution model, and the learning rate is set to 0.00002.

The processed haptic data consists of 46 one-dimensional signals, and features of the haptic data are extracted by using a one-dimensional convolutional neural network according to the characteristics of the one-dimensional signals, three layers of one-dimensional convolutional neural networks are used in the embodiment, the RELU function is used as the activation function, and specific parameters of each layer are as follows:

table 1: one-dimensional convolution neural network parameter table

Number of layers	Number of input channels	Number of output channels	Convolution kernel size	Convolution step size
					1	46	32	7	5
2	32	64	5	3
					3	64	46	5	3

The processed visual image is a 300 × 3 three-channel color image, and the processing of the image uses a mature densnet169 model in the visual field.

Visual and tactile information is respectively extracted by using a two-dimensional convolution and a one-dimensional convolution to obtain eigenvectors with the lengths of 1664 and 1978, the two eigenvectors are connected in series to obtain a visual and tactile fusion eigenvector with the length of 3642, and then the visual and tactile fusion eigenvectors are respectively input into two fully-connected neural networks for classification. The two fully-connected networks differ in that the first is used for multi-classification tasks, i.e. after visual and tactile information of an object in the test set is entered into the model, the first fully-connected network can predict which of the 53 objects the object is. The second fully-connected network is used for a multi-label classification task, which differs from the multi-classification task in that the multi-classification task identifies which of a plurality of objects the object belongs to, and the multi-label classification task identifies which of a plurality of attributes the object belongs to.

The supervised tags used by both fully connected networks are in the form of tags in a standard multi-classification and multi-tagging task. It should be noted that such a multi-branch output neural network has two branches and thus has two loss functions. In this embodiment, the multi-classification task uses a cross entropy loss function (formula 1), the loss function used by the multi-label classification task is a multi label classification loss function (formula 2) provided in the pyrrch neural network architecture, an output using the loss function is defined by 0, an output prediction value greater than 0 is 1, and an output prediction value smaller than 0 is 0. The goal of the optimization during training is to minimize the total loss function (equation 3) value resulting from the addition of two loss functions.

loss(x₁,class)＝-x₁[class]+log(∑_j exp(x₁[j]) Equation 1)

Wherein: x is the number of₁Representing the prediction output of the fully-connected network, class representing the index of the label class

x₁[j]Denotes x₁The j-th value of (a).

Wherein: x is the number of₂Representing the output of a fully connected network, y₂Presentation label

x₂[i]Denotes x₂Is given by the ith value, y₂[i]Denotes y₂Value of (1)

y₂[i]∈{0,1}，i∈{0,…,x₂.nElement()-1}

x₂Nelelement () is used to count the number of output elements.

Loss＝loss(x₁,class)+loss(x₂,y₂) Equation 3

3. Conversion to keywords

The multi-tasking labeling process entails sorting the object class keywords into a list of 53 elements and then using the index values of the object class keywords as labels for the objects (as shown in fig. 4), with only one label per object. The label in the multi-classification task is composed of numbers from 0 to 52, and the 53 numbers have strict corresponding relation with 53 object class keywords. Our goal is to convert the numerical values of the multi-classification task output into corresponding object class keywords. The output of the multi-classification task is 53 probability values, and according to the corresponding relation of fig. 4, the corresponding object category keyword can be found by the index of the numerical value with the maximum probability value. For example, if the 0 th output probability value in the multi-classification task is the maximum, the corresponding object category keyword is "aluminum", and if the 51 st output probability value is the maximum, the corresponding object category keyword is "yellow felt". Therefore, to obtain the category keyword corresponding to the multi-category output, it is necessary to obtain the index corresponding to the maximum probability value among the 53 probabilities, and then use this index value to retrieve the object category keyword at the corresponding position in the keyword list.

The label generation in the multi-label classification task is similar to the multi-classification task, in this embodiment, 24 physical attribute keyword values are firstly sorted to form a list of 24 elements, and the multi-classification label is composed of 24 elements and respectively corresponds to the 24 physical attribute keywords. Referring to fig. 5, the labels are formed by numbers 0/1, each position in the label corresponds to an attribute, for example, if the number of the nth position is 1, the object has the attribute corresponding to the nth position in the attribute list, and if the nth +1 position is 0, the object has no attribute corresponding to the nth +1 position in the attribute list. Therefore, when the physical attribute keywords are required to be obtained from the multi-label classification network output, the index with the predicted value of 1 in the network output needs to be obtained, and then the corresponding attributes are called from the physical attribute keyword list according to the index to complete the extraction of the physical attribute keywords.

4. The generation of the descriptive sentence is carried out,

after the object type key words and the physical attribute key words are obtained, simple object description sentences can be formed. Wherein the category keywords can determine which category the object is, and the physical attribute keywords are used to describe what the object gives. The input of the visual and tactile information of each object in the test set into the multi-branch network model proposed in this embodiment predicts the object category keyword and the physical attribute keyword. And then filling the obtained object category keywords and the obtained physical attribute keywords into a fixed sentence description template to form the object description sentence. For example: this is a plastic box whose surface is smooth, resilient and somewhat hard. Wherein "plastic box" is an object category keyword, and "smooth", "elastic", "somewhat hard" is a physical attribute keyword.

5. The results and the analysis were carried out in the same way,

through testing on the international PHAC-2 data set, after 150 rounds of training on the training set, the prediction accuracy of the network model of the embodiment on the object category keywords reaches 100%, and the prediction accuracy on the physical attributes reaches 97.8%, which indicates that the model of the embodiment can effectively form object description sentences.

Fig. 6 is a diagram of the result of predicting the physical attributes of 53 objects in the test set by the multi-branch network model provided in this embodiment, where the images and the haptic data in the test set are not included in the training set. Since the distribution of physical properties of objects in 53 is not uniform, the different properties do not occur the same number of times in the entire data set. Therefore, the AUC value is used as an evaluation standard of the prediction result, the value of the AUC is between 0 and 1, the closer the AUC value is to 1, the higher the model accuracy is, and the AUC value can be regarded as the prediction accuracy. As can be seen from the figure, the predicted result AUC values of the 24 attributes are all above 0.9, and the average value is 0.978.

Fig. 7 is a diagram of the result of predicting 53 object types in the test set by the multi-branch network model provided in this embodiment, where the diagram is presented in the form of a confusion matrix, the ordinate is the true value of the object type, and the abscissa is the result of predicting the object type of the multi-branch network. It can be seen from the figure that if the true value and the predicted value of the same object are equal, the intersection point is on the diagonal line of the picture. If the predicted value and the true value are different, the intersection point will appear at a position other than the diagonal line. As can be seen from the figure, the multi-branch network model of the embodiment successfully predicts the categories of 53 objects, and the accuracy rate reaches 100%.

In summary, the object category keyword prediction and the physical attribute keyword prediction of the multi-branch network model provided by the embodiment can respectively reach 100% and 97.8% of accuracy. The descriptive statement formed based on the object category keyword and the physical attribute keyword also has high credibility.

Finally, the above embodiments are only used for illustrating the technical solutions of the present invention and not for limiting, and other modifications or equivalent substitutions made by the technical solutions of the present invention by those of ordinary skill in the art should be covered within the scope of the claims of the present invention as long as they do not depart from the spirit and scope of the technical solutions of the present invention.

Claims

1. an object description generation method based on machine vision and tactile perception, is characterized in that, comprises the following steps:

S1, preprocessing visual and tactile raw data;

S2, input the collected visual and tactile information into a two-dimensional convolutional neural network and a one-dimensional convolutional neural network respectively, and connect the feature vectors output by the two neural networks in series to obtain a visual-touch fusion feature vector;

S3, input the obtained visual-touch fusion feature vector into two fully connected network branches, the first fully connected network is used for object recognition and classification, and the second fully connected network is used for object physical attribute recognition;

S4, the classification results and physical attributes obtained by the two fully connected networks are embedded in the object description sentences in the form of keywords.

2. the object description generation method based on machine vision and tactile perception according to claim 1, is characterized in that:

In the S1, the preprocessing method of the visual information is to transform the original high-pixel image into a picture with a pixel value of 300*300, and at the same time randomly generate a 30% offset process for the brightness, contrast, and saturation of the picture. , to get the final image that needs to be input.

3. the object description generation method based on machine vision and tactile perception according to claim 1, is characterized in that:

In the S1, the preprocessing method of the haptic information is to use the matlab software to cut the data, compress the multi-dimensional data of different lengths, and finally obtain the haptic data of the same length.

4. the object description generation method based on machine vision and tactile perception according to claim 1, is characterized in that:

In the S2, the visual and tactile information are input in pairs, the visual information is input into the two-dimensional convolutional neural network, and the tactile information is input into the one-dimensional convolutional neural network; for the feature extraction of the tactile information, a total of three layers of one-dimensional convolutional neural networks are used. , the activation function uses the RELU function; for the visual information feature extraction, the densnet169 model is used.

5. The object description generation method based on machine vision and tactile perception according to claim 1, wherein:

In the above S3, the supervised labels used by the two fully connected networks are in the form of standard multi-classification tasks and labels in multi-label tasks. The multi-branch output neural network has two branches and there are two loss functions. The cross-entropy loss function is used in the classification task, and the loss function used in the multi-label task is the multi-label classification loss function MultilLabelSoftMarginLoss() provided by the pytorch neural network architecture.

6. The object description generation method based on machine vision and tactile perception according to claim 1, wherein:

In the S4, the specific method for converting the classification results and physical attributes into keywords is to sort the object category keywords to form a list of n elements, and then use the index value of the object category keywords as the label of the object, and each object only one label;

The output of the multi-classification task is n probability values, and the corresponding object category keywords can be found by the index of the value with the largest probability value;

The label generation in the multi-label classification task is similar to the multi-classification task. First, the m physical attribute keywords are sorted to form a list of m elements. The multi-category label is composed of m elements and corresponds to the m physical attribute keywords. To obtain physical attribute keywords from the multi-label classification network output, it is necessary to obtain an index with a predicted value of 1 in the network output, and then retrieve the corresponding attribute from the physical attribute keyword list according to the index to complete the extraction of physical attribute keywords.