CN109858565B - Home indoor scene recognition method based on deep learning and integrating global features and local article information - Google Patents

Home indoor scene recognition method based on deep learning and integrating global features and local article information Download PDF

Info

Publication number
CN109858565B
CN109858565B CN201910151241.9A CN201910151241A CN109858565B CN 109858565 B CN109858565 B CN 109858565B CN 201910151241 A CN201910151241 A CN 201910151241A CN 109858565 B CN109858565 B CN 109858565B
Authority
CN
China
Prior art keywords
scene
num
picture
max
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910151241.9A
Other languages
Chinese (zh)
Other versions
CN109858565A (en
Inventor
蒋倩
朱博
王彬
高翔
郑有祺
王翼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN201910151241.9A priority Critical patent/CN109858565B/en
Publication of CN109858565A publication Critical patent/CN109858565A/en
Application granted granted Critical
Publication of CN109858565B publication Critical patent/CN109858565B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a family indoor scene recognition method based on deep learning and integrating global features and local article information. The method comprises the following steps: building a training set and a testing set of the family indoor scene picture, and sending the training set into three convolutional neural networks of Alexnet, Googlnet and VGG for training and testing respectively to obtain scene characteristics; giving corresponding weights to the three types of scene features, and taking the weighted average as a global feature; training by utilizing an SSD convolutional neural network and obtaining local features of common articles in a family indoor scene; fusing global and local article characteristics by adopting a matrix splicing mode; processing the fusion result by a clustering algorithm to generate a scene classification center vector; and judging and outputting the scene category of the picture to be detected by taking the scene classification center vector as a classification standard. By using the method, the home service robot can automatically recognize scene semantics contained in the environment, and the intelligent level of the robot is improved.

Description

Home indoor scene recognition method based on deep learning and integrating global features and local article information
Technical Field
The invention relates to the field of scene recognition, in particular to a family indoor scene recognition method based on deep learning and integrating global features and local article information.
Background
In the field of robots, how a robot identifies a current environment is an extremely important problem in the field of computer vision, scene identification research of a home service robot is beneficial to obtaining real-time pose information of a home scene where the robot is located, and the method is a key for the home service robot to build a map of the current environment and complete follow-up work. The current home service robot has a limited intelligent level and cannot accurately and quickly judge the working environment.
The convolutional neural network model in deep learning is applied to manual scene recognition of a family service robot, the characteristic data hidden in the deep learning can be automatically learned from a large amount of image data, and the characteristics correspond to the labels one to one, so that the effective extraction of the image characteristics is realized. Meanwhile, the items in the scene are used as basic characteristics for identification and are matched with the cognitive logic of people on the environment. The global characteristics of the scene and the local article information are combined and sent to the convolutional neural network, and the robot can automatically judge the current working environment by acquiring judgment experience through learning.
Disclosure of Invention
The invention aims to solve the technical problem of providing a deep learning-based home indoor scene recognition method integrating global features and local article information, and solves the problems that the current home service robot is not high in intelligence level, cannot timely and correctly respond to a working environment and is poor in scene recognition capability, so that the service robot can automatically classify and recognize home scenes.
The method for identifying the family indoor scene based on the deep learning and with the fusion of the global features and the local article information comprises the following steps:
step 1, constructing a training set and a test set of pictures of a family indoor scene, simultaneously sending the training set into three convolutional neural networks of Alexnet, Googlnet and VGG for respective training to generate corresponding network models, calling the model to identify the training set, and outputting and judging confidence coefficients of each scene to which each picture belongs to as three types of scene features of the training set;
step 2, assigning specific weights to the three types of scene features obtained in the step 1 to perform weighted average, and taking the result as a global feature matrix of a training data set;
step 3, framing the articles in the pictures of the training set by using a picture labeling tool, labeling article labels, sending the generated labeling file into an SSD convolutional neural network for training to generate an article detection model, calling the model to identify the training set, and outputting the article labels of various types appearing in each picture and the confidence thereof as an article local feature matrix of the training set;
step 4, fusing global and local article characteristics, horizontally splicing the global characteristic matrix obtained in the step two and the article local characteristic matrix obtained in the step three to generate a comprehensive characteristic matrix, wherein one row vector of the matrix corresponds to the comprehensive characteristic of one picture in the training set, the row vector is divided into two parts according to the scene category number and the article number, the first half corresponds to the global characteristic of the picture, and the second half corresponds to the local object characteristic of the picture;
Step 5, classifying the training set according to scene types by using a clustering algorithm, randomly taking the comprehensive features of a certain picture under each scene as initial vectors, respectively calculating the vector similarity between each feature in the comprehensive features and each feature in the central vector group, updating the initial vectors according to a certain rule according to the calculation result, iterating to a preset number of rounds, obtaining central vectors representing the comprehensive features of each scene, and forming a scene classification central vector group;
and 6, after the picture to be detected is correspondingly processed, obtaining comprehensive vectors, respectively calculating Euclidean distances between the comprehensive vectors and each vector in the scene classification center vector group, and outputting a scene classification label corresponding to the scene classification center vector with the minimum distance as an identification result.
Further, in the step 1, the obtaining of three types of scene characteristics specifically includes:
step 1-1, dividing the family indoor scene into Num _ scene categories such as a toilet, a bedroom, a restaurant and the like, naming the jth scene category as type _ j for facilitating subsequent calculation, and dividing a plurality of color pictures of each view angle under each scene category into Num _ train pictures of a training data set and Num _ test pictures of a test data set according to a certain proportion, wherein Num _ scene belongs to N * ,Num_test∈N * ,Num_train∈N * ,i={i∈[1,Num_train]∧i∈N * },j={j∈[1,Num_scene]∧j∈N * };
Step 1-2, adding a scene category label set for the training set and the test set:
List_train={list_train 1 ,list_train 2 ,......,list_train Num_train }
List_test=(list_test 1 ,list_test 2 ,......,list_test Num_test )
Num_train∈N * ,Num_test∈N *
step 1-3, processing the training set into a corresponding data format according to the requirements of Alexnet, Googlnet and VGG convolutional neural networks, then simultaneously sending the training set into the Alexnet, Googlnet and VGG convolutional neural networks, and respectively training and generating network models, namely Model _ Alexnet, Model _ Googlnet and Model _ VGG;
step 1-4, calling the generated network model recognition training set, wherein the confidence degrees of the ith picture which is judged as the jth scene category are P respectively Alexnet_i_j ,P Googlnet_i_j ,P VGG_i_j And forming a scene confidence coefficient vector of the graph by using the confidence coefficients:
Figure BDA0001981599720000031
Figure BDA0001981599720000041
Figure BDA0001981599720000042
the matrixes formed by the three types of confidence coefficient vectors are respectively used as scene characteristic matrixes:
Figure BDA0001981599720000043
Figure BDA0001981599720000044
Figure BDA0001981599720000045
Num_train∈N * ,Num_scene∈N * ,i={i∈[1,Num_train]∧i∈N * },
j={j∈[1,Num_scene]∧j∈N * }。
further, in the step 2, the step 2 of weighted averaging the three types of scene features specifically includes the following steps:
step 2-1, calling the network model obtained in the step 1-3, detecting the pictures of the test set to obtain the confidence coefficient of each Num _ scene to which each picture belongs, taking the scene type corresponding to the largest picture as a judgment result, comparing the judgment result with the picture real label, and if the judgment result is the same as the picture real label, judging that the identification is correct; the number of correct identifications of each convolutional neural network accumulated by Alexnet, Googlnet, VGG is recorded as: num _ Alexnet, Num _ Googlnet, Num _ VGG, Num _ Alexnet belonging to N * ,Num_Googlnet∈N * ,Num_VGG∈N *
Step 2-2, respectively assigning weights Weight _ alexne, Weight _ google, and Weight _ VGG to the scene features Matrix _ alexne, Matrix _ google, and Matrix _ VGG, wherein,
Figure BDA0001981599720000051
Figure BDA0001981599720000052
Figure BDA0001981599720000053
the confidence level that the ith picture in the training set is judged as the jth scene class after the weighted average can be expressed as:
P Global_i_j =Weight_Alexnet×P Alexnet_i_j +Weight_Googlnet×P Googlnet_i_j +Weight_VGG×P VGG_i_j
and obtaining a Global feature Matrix _ Global by using the new confidence coefficient:
Figure BDA0001981599720000054
further, in the step 3, the obtaining of the local features of the common articles in the family indoor scene specifically includes the following steps:
step 3-1, selecting common articles in a family scene, and respectively setting maximum values of article types and numbers, namely Max _ category and Max _ num; using the picture marking tool frame to frame the objects appearing in the pictures of the training set, recording the labels, the number and the positions of the objects in the pictures, and obtaining object marks; max _ category belongs to N * ,k={k∈[1,Max_category]∧k∈N * },Max_num∈N *
Step 3-2, setting the maximum number of each article type Max _ num k ,Max_num k ∈N * Training the article label by utilizing an SSD convolutional neural network to generate an article detection Model _ SSD; k ═ k ∈ [1, Max _ category [ ]]∧k∈N * },r k ={r k ∈[1,Max_num k ]∧r k ∈N * };
Step 3-3, calling the model recognition training set generated in the step 3-2, wherein the ith picture is recognized as the r < th > picture k The individual class k item confidence may be expressed as:
Figure BDA0001981599720000063
the confidence vector that the ith picture is recognized as a k-type item can be expressed as:
Figure BDA0001981599720000061
Wherein each item of confidence coefficient is arranged from large to small, if the articles in the picture do not reach the maximum number Max _ num k If the number of articles is over-limit, only the first Max _ num with higher confidence coefficient is reserved k An item;
local item feature Matrix _ object composed of item confidence vectors:
Figure BDA0001981599720000062
Max_category∈N * k={k∈[1,Max_category]∧k∈N * }
r k ={r k ∈[1,Max_num k ]∧r k ∈N * }
further, in step 4, the step 4 fuses global and local article features, specifically:
horizontally splicing the Global feature Matrix _ Global obtained in the step 2 and the article local feature Matrix _ object obtained in the step 3 to generate a comprehensive feature Matrix _ combination:
Figure BDA0001981599720000071
ith row vector of synthetic feature matrix
Figure BDA0001981599720000072
The comprehensive characteristics of the ith picture in the training set are represented, the first half of the line vector corresponds to the global characteristics of the picture, and the second half corresponds to the local object characteristics of the picture:
Figure BDA0001981599720000073
further, in the step 5, the customizing the scene classification criteria specifically includes the following steps:
step 5-1, dividing the training data set into Num _ scene parts according to scenesAnd dividing the number of pictures corresponding to each scene type into Num _ j, and dividing the comprehensive characteristic matrix according to the scene types to obtain the comprehensive characteristics corresponding to each scene type. In the comprehensive characteristics corresponding to the subdata set with the scene type of type _ j, the ith type_j Confidence of j scene type identified in picture and r-th scene type identified in picture k The confidence of each k-type article can be respectively expressed as
Figure BDA0001981599720000075
And
Figure BDA0001981599720000076
the confidence vector identifying the kth class of item in the picture can be expressed as:
Figure BDA0001981599720000074
sorting out a comprehensive characteristic Matrix _ combination corresponding to a sub-data set with a scene type of type _ j type_j Can be expressed as:
Figure BDA0001981599720000081
Figure BDA0001981599720000082
i type_j ={i type_j ∈[1,Num_j]∧i type_j ∈N * },Max_category∈N *
step 5-2, the Num _ scene sub-synthesis feature Matrix _ combination obtained in the step 5-1 type_j In the present invention, a row vector is randomly selected, and in the present invention, a first comprehensive feature under each type of scene is exemplarily selected as an initial center vector, then the center vector representing type _ j scene can be represented as:
Figure BDA0001981599720000083
and forming a central vector group Matrix _ center by the central vectors corresponding to the scenes:
Figure BDA0001981599720000084
Num_scene∈N * ,i type_j ={i type_j ∈[1,Num_j]∧i type_j ∈N * },
Max_category∈N *
step 5-3, respectively calculating the Euclidean distance between each feature in the comprehensive features and each feature in the central vector group to obtain Num _ scene distances, and taking the scene category corresponding to the minimum value as a label List _ detect set obtained by scene classification:
List_detect={list_detect 1 ,list_detect 2 ,......,list_detect Num_train };
comparing the List _ detect with the original scene tag List _ train of the picture, and updating the central vector according to the following rules:
if list _ detect i =list_train i J, then:
Figure BDA0001981599720000091
if list _ detect i ≠list_train i J, then:
Figure BDA0001981599720000092
Wherein i ═ { i ∈ [1, Num _ train [ ]]∧i∈N * },j={j∈[1,Num_scene]∧j∈N * Gamma is an updating coefficient;
and 5-4, setting the iteration updating times as Max _ interaction, repeating the steps 5-2 and 5-3, and iterating the Max _ interaction times to obtain a final result Matrix _ center _ Max _ interaction serving as a scene classification standard.
Max_interation∈N *
i iteration ={i iteration ∈[1,Max_iteration]∧i iteration ∈N * }
Further, in step 6, the step of obtaining a scene classification result by the test specifically includes the following steps:
step 6-1, processing the picture to be recognized in the steps 1-4 in sequence to obtain comprehensive characteristics;
step 6-2, respectively calculating Euclidean distance sets of the comprehensive characteristics obtained in the step 6-1 and each vector in Matrix _ center _ Max _ interaction obtained in the step 5-4:
Distance={d 1 ,d 2 ,...d j ...,d Num_scene },Num_scene∈N *
and taking the scene category label corresponding to the minimum value of the Num _ scene distances as a recognition result to output.
The method considers the influence of objects in the scene on the scene type, introduces the deep learning convolutional neural network, enables the family service robot to independently learn the family scene, obtains the identification experience, and realizes the automatic judgment of the scene. The home service robot with the environment cognitive ability can select working contents according to working environment switching working modes, complete work such as man-machine interaction and the like, and meet the man-machine co-fusion requirement in a home scene. The problems that the intelligent level of the current household service robot is not high, correct response cannot be timely made to the working environment, and the scene recognition capability is poor are solved, so that automatic classification and recognition of the service robot on family scenes are achieved.
Drawings
FIG. 1 is a system block diagram of the present invention.
Detailed Description
The technical scheme of the invention is further explained in detail by combining the drawings in the specification.
The method for identifying the family indoor scene based on the deep learning and with the fusion of the global features and the local article information comprises the following steps:
step 1, constructing a training set and a testing set of pictures of a family indoor scene, simultaneously sending the training set into three convolutional neural networks of Alexnet, Googlnet and VGG for respective training to generate corresponding network models, calling the model to identify the training set, and outputting and judging confidence coefficients of each scene to which each picture belongs to as three types of scene features of the training set.
In the step 1, the obtaining of the three types of scene features specifically includes:
step 1-1, dividing the family indoor scene into Num _ scene categories such as a toilet, a bedroom, a restaurant and the like, naming the jth scene category as type _ j for facilitating subsequent calculation, and dividing a plurality of color pictures of each view angle under each scene category into Num _ train pictures of a training data set and Num _ test pictures of a test data set according to a certain proportion, wherein Num _ scene belongs to N * ,Num_test∈N * ,Num_train∈N * ,i={i∈[1,Num_train]∧i∈N * },j={j∈[1,Num_scene]∧j∈N * }。
Step 1-2, adding a scene category label set for the training set and the test set:
List_train={list_train 1 ,list_train 2 ,......,list_train Num_train }
List_test=(list_test 1 ,list_test 2 ,......,list_test Num_test )
Num_train∈N * ,Num_test∈N *
And 1-3, processing the training set into a corresponding data format according to the requirements of Alexnet, Googlnet and VGG convolutional neural networks, simultaneously sending the training set into the Alexnet, Googlnet and VGG convolutional neural networks, and respectively training and generating network models including Model _ Alexnet, Model _ Googlnet and Model _ VGG.
Step 1-4, calling the generated network model recognition training set, wherein the confidence degrees of the ith picture which is judged as the jth scene category are respectively P Alexnet_i_j ,P Googlnet_i_j ,P VGG_i_j And forming a scene confidence coefficient vector of the graph by using the confidence coefficients:
Figure BDA0001981599720000111
Figure BDA0001981599720000112
Figure BDA0001981599720000113
the matrixes formed by the three types of confidence coefficient vectors are respectively used as scene characteristic matrixes:
Figure BDA0001981599720000114
Figure BDA0001981599720000121
Figure BDA0001981599720000122
Num_train∈N * ,Num_scene∈N * ,i={i∈[1,Num_train]∧i∈N * },j={j∈[1,Num_scene]∧j∈N * }。
and 2, assigning specific weights to the three types of scene features obtained in the step 1, and performing weighted average, wherein the result is used as a global feature matrix of the training data set.
In the step 2, the step 2 of weighted averaging the three types of scene features specifically includes the following steps:
step 2-1, calling the network model obtained in the step 1-3, detecting the pictures of the test set to obtain the confidence coefficient of each Num _ scene to which each picture belongs, taking the scene type corresponding to the largest picture as a judgment result, comparing the judgment result with the picture real label, and if the judgment result is the same as the picture real label, judging that the identification is correct; accumulating Alexnet, Googlnet, VGG The number of correct identifications of the convolutional neural network is noted as: num _ Alexnet, Num _ Googlnet, Num _ VGG, Num _ Alexnet belonging to N * ,Num_Googlnet∈N * ,Num_VGG∈N *
Step 2-2, respectively assigning weights Weight _ alexne, Weight _ google, and Weight _ VGG to the scene features Matrix _ alexne, Matrix _ google, and Matrix _ VGG, wherein,
Figure BDA0001981599720000131
Figure BDA0001981599720000132
Figure BDA0001981599720000133
the confidence that the ith picture in the training set is judged as the jth scene category after the weighted average can be represented as:
P Global_i_j =Weight_Alexnet×P Alexnet_i_j +Weight_Googlnet×P Googlnet_i_j +Weight_VGG×P VGG_i_j
and obtaining a Global feature Matrix _ Global by using the new confidence coefficient:
Figure BDA0001981599720000134
and 3, framing the articles in the images of the training set by using the image labeling tool, labeling article labels, sending the generated labeling file into an SSD convolutional neural network for training to generate an article detection model, calling the model to identify the training set, and outputting the article labels of various types appearing in each image and the confidence coefficients thereof as an article local feature matrix of the training set.
In the step 3, the obtaining of the local features of the common articles in the family indoor scene specifically includes the following steps:
step 3-1, selecting common articles in a family scene, and respectively setting maximum values of article types and numbers, namely Max _ category and Max _ num; and (5) framing the objects appearing in the pictures of the training set by using the picture marking tool, and recording the labels, the number and the positions of the objects in the pictures to obtain object marks. Max _ category belongs to N * ,k={k∈[1,Max_category]∧k∈N * },Max_num∈N *
Step 3-2, setting the maximum number of each article type Max _ num k ,Max_num k ∈N * And training the article label by utilizing the SSD convolutional neural network to generate an article detection Model _ SSD. k ∈ { k ∈ [1, Max _ category [ ]]∧k∈N * },r k ={r k ∈[1,Max_num k ]∧r k ∈N * }
Step 3-3, calling the model recognition training set generated in the step 3-2, wherein the ith picture is recognized as the r < th > picture k The individual k-item confidence may be expressed as:
Figure BDA0001981599720000143
the confidence vector that the ith picture is recognized as a k-type item can be expressed as:
Figure BDA0001981599720000141
wherein each item of confidence coefficient is arranged from large to small, if the articles in the picture do not reach the maximum number Max _ num k If the number of articles is over-limit, only the first Max _ num with higher confidence coefficient is reserved k An item.
Local item feature Matrix _ object composed of item confidence vectors:
Figure BDA0001981599720000142
Max_category∈N * k={k∈[1,Max_category]∧k∈N * }
r k ={r k ∈[1,Max_num k ]∧r k ∈N * }
and 4, fusing global and local article characteristics, horizontally splicing the global characteristic matrix obtained in the step two and the article local characteristic matrix obtained in the step three to generate a comprehensive characteristic matrix, wherein one row vector of the matrix corresponds to the comprehensive characteristic of one picture in the training set, the row vector is divided into two parts according to the scene category number and the article number, the first half corresponds to the global characteristic of the picture, and the second half corresponds to the local object characteristic of the picture.
In the step 4, the step 4 fuses global and local article features, specifically:
horizontally splicing the Global feature Matrix _ Global obtained in the step 2 and the article local feature Matrix _ object obtained in the step 3 to generate a comprehensive feature Matrix _ combination:
Figure BDA0001981599720000151
ith row vector of synthetic feature matrix
Figure BDA0001981599720000152
The comprehensive characteristics of the ith picture in the training set are represented, the first half of the line vector corresponds to the global characteristics of the picture, and the second half corresponds to the local object characteristics of the picture:
Figure BDA0001981599720000153
i={i∈[1,Num_train]∧i∈N * },Max_category∈N *
and 5, classifying the training set according to scene types by using a clustering algorithm, randomly taking the comprehensive features of a certain picture under each type of scene as initial vectors, respectively calculating the vector similarity between each feature in the comprehensive features and each feature in the central vector group, updating the initial vectors according to a certain rule according to the calculation result, iterating to a preset number of rounds, obtaining central vectors representing the comprehensive features of each scene, and forming the scene classification central vector group.
In the step 5, the customized scene classification standard specifically includes the following steps:
and step 5-1, dividing the training data set into Num _ scene parts according to scenes, wherein the number of pictures corresponding to each scene type is Num _ j, and dividing the comprehensive characteristic matrix according to the scene types to obtain the comprehensive characteristics corresponding to each scene type. In the comprehensive characteristics corresponding to the subdata set with the scene type of type _ j, the ith type_j Confidence of j scene type identified in picture and r-th scene type identified in picture k The confidence of each k-type article can be respectively expressed as
Figure BDA0001981599720000166
And
Figure BDA0001981599720000167
the confidence vector identifying the kth class of item in the picture can be expressed as:
Figure BDA0001981599720000161
sorting out a comprehensive characteristic Matrix _ combination corresponding to a sub-data set with a scene type of type _ j type_j Can be expressed as:
Figure BDA0001981599720000162
Figure BDA0001981599720000163
i type_j ={i type_j ∈[1,Num_j]∧i type_j ∈N * },Max_category∈N *
step 5-2, obtaining Num _ screen e sub comprehensive characteristics matrix _ x in the step 5-1
Figure BDA0001981599720000165
In the present invention, a row vector is randomly selected, and in the present invention, a first comprehensive feature under each type of scene is exemplarily selected as an initial center vector, then the center vector representing type _ j scene can be represented as:
Figure BDA0001981599720000164
and forming a central vector group Matrix _ center by the central vectors corresponding to the scenes:
Figure BDA0001981599720000171
Num_scene∈N * ,i type_j ={i type_j ∈[1,Num_j]∧i type_j ∈N * },
Max_category∈N *
step 5-3, respectively calculating the Euclidean distance between each feature in the comprehensive features and each feature in the central vector group to obtain Num _ scene distances, and taking the scene category corresponding to the minimum value as a label List _ detect set obtained by scene classification:
List_detect={list_detect 1 ,list_detect 2 ,......,list_detect Num_train }
comparing the List _ detect with the original scene tag List _ train of the picture, and updating the central vector according to the following rules:
if list _ detect i =list_train i J, then:
Figure BDA0001981599720000172
if list _ detect i ≠list_train i J, then:
Figure BDA0001981599720000173
Wherein i ═ { i ∈ [1, Num _ train [ ]]∧i∈N * },j={j∈[1,Num_scene]∧j∈N * And gamma is an updating coefficient.
And 5-4, setting the iteration updating times as Max _ interaction, repeating the steps 5-2 and 5-3, and iterating the Max _ interaction times to obtain a final result Matrix _ center _ Max _ interaction as a scene classification standard.
Max_interation∈N *
i iteration ={i iteration ∈[1,Max_iteration]∧i iteration ∈N * }
And 6, after the picture to be detected is correspondingly processed, obtaining comprehensive vectors, respectively calculating Euclidean distances between the comprehensive vectors and each vector in the scene classification center vector group, and outputting a scene classification label corresponding to the scene classification center vector with the minimum distance as an identification result.
In the step 6, the step of obtaining the scene classification result by the test specifically includes the following steps:
and 6-1, processing the picture to be identified in the steps 1-4 in sequence to obtain comprehensive characteristics.
Step 6-2, respectively calculating Euclidean distance sets of the comprehensive characteristics obtained in the step 6-1 and each vector in Matrix _ center _ Max _ interaction obtained in the step 5-4:
Distance={d 1 ,d 2 ,...d j ...,d Num_scene },Num_scene∈N *
and taking the scene category label corresponding to the minimum value of the Num _ scene distances as a recognition result to output.
The above description is only a preferred embodiment of the present invention, and the scope of the present invention is not limited to the above embodiment, but equivalent modifications or changes made by those skilled in the art according to the present disclosure should be included in the scope of the present invention as set forth in the appended claims.

Claims (4)

1. The method for identifying the family indoor scene based on the deep learning and integrating the global features and the local article information is characterized in that: the method comprises the following steps:
step 1, constructing a training set and a test set of pictures of a family indoor scene, simultaneously sending the training set into three convolutional neural networks of Alexnet, Googlnet and VGG for respective training to generate corresponding network models, calling the model to identify the training set, and outputting and judging confidence coefficients of each scene to which each picture belongs to as three types of scene features of the training set;
step 2, weighting and averaging the three types of scene features obtained in the step 1, and taking the result as a global feature matrix of a training data set;
step 3, framing the articles in the pictures of the training set by using a picture labeling tool, labeling article labels, sending the generated labeling file into an SSD convolutional neural network for training to generate an article detection model, calling the model to identify the training set, and outputting the article labels of various types appearing in each picture and the confidence thereof as an article local feature matrix of the training set;
step 4, fusing global and local article characteristics, horizontally splicing the global characteristic matrix obtained in the step two and the article local characteristic matrix obtained in the step three to generate a comprehensive characteristic matrix, wherein one row vector of the matrix corresponds to the comprehensive characteristic of one picture in the training set, the row vector is divided into two parts according to the scene category number and the article number, the first half corresponds to the global characteristic of the picture, and the second half corresponds to the local object characteristic of the picture;
In the step 4, the step 4 fuses global and local article features, specifically:
horizontally splicing the Global feature Matrix _ Global obtained in the step 2 and the article local feature Matrix _ object obtained in the step 3 to generate a comprehensive feature Matrix _ combination:
Figure FDA0003726646830000021
ith row vector of synthetic feature matrix
Figure FDA0003726646830000022
The comprehensive characteristics of the ith picture in the training set are represented, the first half of the line vector corresponds to the global characteristics of the picture, and the second half corresponds to the local object characteristics of the picture:
Figure FDA0003726646830000023
i={i∈[1,Num_train]∧i∈N * },Max_category∈N *
step 5, classifying the training set according to scene types by using a clustering algorithm, randomly taking the comprehensive features of a certain picture under each type of scene as initial vectors, respectively calculating the vector similarity between each feature in the comprehensive features and each feature in the central vector group, updating the initial vectors according to the calculation results, iterating to preset turns to obtain central vectors representing the comprehensive features of each scene, and forming a scene classification central vector group;
in the step 5, the user-defined scene classification standard specifically includes the following steps:
step 5-1, dividing the training data set into Num _ scene parts according to scenes, wherein the number of pictures corresponding to each scene type is Num _ j, and dividing the comprehensive characteristic matrix according to the scene types to obtain comprehensive characteristics corresponding to each scene type; in the comprehensive characteristics corresponding to the subdata set with the scene type of type _ j, the ith type_j Confidence of j scene type identified in picture and r-th scene type identified in picture k The confidence of each k-type article can be respectively expressed as
Figure FDA0003726646830000024
And
Figure FDA0003726646830000025
the confidence vector identifying the kth class of item in the picture can be expressed as:
Figure FDA0003726646830000026
sorting out a comprehensive characteristic Matrix _ combination corresponding to a sub data set with a scene type of type _ j type_j Can be expressed as:
Figure FDA0003726646830000031
Figure FDA0003726646830000032
i type_j ={i type_j ∈[1,Num_j]∧i type_j ∈N * },Max_category∈N *
step 5-2, the Num _ scene sub-synthesis feature Matrix _ combination obtained in the step 5-1 type_j In the method, a row vector is randomly selected, the first comprehensive feature under each type of scene is selected as an initial center vector, and then the center vector representing the type _ j scene can be represented as follows:
Figure FDA0003726646830000033
and forming a central vector group Matrix _ center by the central vectors corresponding to the scenes:
Figure FDA0003726646830000034
Num_scene∈N * ,i type_j ={i type_j ∈[1,Num_j]∧i type_j ∈N * },
Max_category∈N *
step 5-3, respectively calculating the Euclidean distance between each feature in the comprehensive features and each feature in the central vector group to obtain Num _ scene distances, and taking the scene category corresponding to the minimum value as a label List _ detect set obtained by scene classification:
List_detect={list_detect 1 ,list_detect 2 ,......,list_detect Num_train };
comparing the List _ detect with the original scene tag List _ train of the picture, and updating the central vector according to the following rules:
if list _ detect i =list_train i J, then:
Figure FDA0003726646830000041
if list _ detect i ≠list_train i J, then:
Figure FDA0003726646830000042
wherein i ═ { i ∈ [1, Num _ train [ ] ]∧i∈N * },j={j∈[1,Num_scene]∧j∈N * Gamma is an updating coefficient;
step 5-4, setting the iteration updating times as Max _ interaction, repeating the steps 5-2 and 5-3, and iterating the Max _ interaction times to obtain a final result Matrix _ center _ Max _ interaction as a scene classification standard;
Max_interation∈N *
i iteration ={i iteration ∈[1,Max_iteration]∧i iteration ∈N * step 6, after the picture to be detected is correspondingly processed, obtaining a comprehensive vector, respectively calculating Euclidean distances between the comprehensive vector and each vector in the scene classification center vector group, and outputting a scene classification label corresponding to the scene classification center vector with the minimum distance as an identification result;
in the step 6, the test obtains a scene classification result, and specifically includes the following steps:
step 6-1, processing the picture to be identified in the steps 1-4 in sequence to obtain comprehensive characteristics;
step 6-2, respectively calculating Euclidean distance sets of the comprehensive characteristics obtained in the step 6-1 and each vector in Matrix _ center _ Max _ interaction obtained in the step 5-4:
Distance={d 1 ,d 2 ,...d j ...,d Num_scene },Num_scene∈N *
and taking the scene category label corresponding to the minimum value of the Num _ scene distances as a recognition result to output.
2. The deep learning based home indoor scene recognition method fusing global features and local item information according to claim 1, characterized in that: in the step 1, three types of scene characteristics are obtained, and the specific steps are as follows:
Step 1-1, dividing the family indoor scenes into total Num _ scene categories, naming the jth scene category as type _ j for facilitating subsequent calculation, and dividing a plurality of color pictures of each view angle under each scene category, which are searched by a network, into a training data set of total Num _ train pictures and a test data set of total Num _ test pictures, wherein Num _ scene belongs to N * ,Num_test∈N * ,Num_train∈N * ,i={i∈[1,Num_train]∧i∈N * },j={j∈[1,Num_scene]∧j∈N * };
Step 1-2, adding a scene category label set for the training set and the test set:
List_train={list_train 1 ,list_train 2 ,......,list_train Num_train }
List_test=(list_test 1 ,list_test 2 ,......,list_test Num_test )
Num_train∈N * ,Num_test∈N *
step 1-3, processing the training set into a corresponding data format according to the requirements of Alexnet, Googlnet and VGG convolutional neural networks, simultaneously sending the training set into the Alexnet, Googlnet and VGG convolutional neural networks, and respectively training and generating network models of Model _ Alexnet, Model _ Googlnet and Model _ VGG;
step 1-4, calling the generated network model identificationTraining set, wherein the confidence of the ith picture being judged as the jth scene category is P Alexnet_i_j ,P Googlnet_i_j ,P VGG_i_j And forming a scene confidence coefficient vector of the graph by using the confidence coefficients:
Figure FDA0003726646830000051
Figure FDA0003726646830000052
Figure FDA0003726646830000053
the matrixes formed by the three types of confidence coefficient vectors are respectively used as scene characteristic matrixes:
Figure FDA0003726646830000061
Figure FDA0003726646830000062
Figure FDA0003726646830000063
Num_train∈N * ,Num_scene∈N * ,i={i∈[1,Num_train]∧i∈N * },
j={j∈[1,Num_scene]∧j∈N * }。
3. the deep learning based home indoor scene recognition method fusing global features and local item information according to claim 1, characterized in that: in the step 2, the step 2 of weighted averaging the three types of scene features specifically includes the following steps:
Step 2-1, calling the network model obtained in the step 1-3, detecting the pictures of the test set to obtain the confidence coefficient of each Num _ scene to which each picture belongs, taking the scene type corresponding to the largest picture as a judgment result, comparing the judgment result with the picture real label, and if the judgment result is the same as the picture real label, judging that the identification is correct; the number of correct identifications of each convolutional neural network accumulated by Alexnet, Googlnet, VGG is recorded as: num _ Alexnet, Num _ Googlnet, Num _ VGG, Num _ Alexnet belonging to N * ,Num_Googlnet∈N * ,Num_VGG∈N *
Step 2-2, respectively assigning weights Weight _ alexne, Weight _ google, and Weight _ VGG to the scene features Matrix _ alexne, Matrix _ google, and Matrix _ VGG, wherein,
Figure FDA0003726646830000071
Figure FDA0003726646830000072
Figure FDA0003726646830000073
the confidence level that the ith picture in the training set is judged as the jth scene class after the weighted average can be expressed as:
P Global_i_j =Weight_Alexnet×P Alexnet_i_j +Weight_Googlnet×P Googlnet_i_j +Weight_VGG×P VGG_i_j
and obtaining a Global feature Matrix _ Global by using the new confidence coefficient:
Figure FDA0003726646830000074
i={i∈[1,Num_train]∧i∈N * }j={j∈[1,Num_scene]∧j∈N * }。
4. the deep learning-based home indoor scene recognition method integrating global features and local article information according to claim 1, wherein: in the step 3, local features of common articles in a family indoor scene are obtained, and the method specifically comprises the following steps:
step 3-1, selecting common articles in a family scene, and respectively setting maximum values of article types and numbers, namely Max _ category and Max _ num; using the picture marking tool frame to frame the objects appearing in the pictures of the training set, recording the labels, the number and the positions of the objects in the pictures, and obtaining object marks; max _ category belongs to N * ,k={k∈[1,Max_category]∧k∈N * },Max_num∈N *
Step 3-2, setting the maximum number of each article type Max _ num k ,Max_num k ∈N * Training the article label by utilizing an SSD convolutional neural network to generate an article detection Model _ SSD; k ∈ { k ∈ [1, Max _ category [ ]]∧k∈N * },r k ={r k ∈[1,Max_num k ]∧r k ∈N * };
Step 3-3, calling the model recognition training set generated in the step 3-2, wherein the ith picture is recognized as the r < th > picture k The individual class k item confidence may be expressed as:
Figure FDA0003726646830000081
the confidence vector that the ith picture is recognized as a k-type item can be expressed as:
Figure FDA0003726646830000082
wherein each item of confidence coefficient is arranged from large to small, if the articles in the picture do not reach the maximum number Max _ num k If the number of articles is over-limit, only the first Max _ num with high confidence coefficient is reserved k An item;
local item feature Matrix _ object composed of item confidence vectors:
Figure FDA0003726646830000083
i={i∈[1,Num_train]∧i∈N * }i={i∈[1,Num_train]∧i∈N * }
Max_category∈N * k={k∈[1,Max_category]∧k∈N * }
r k ={r k ∈[1,Max_num k ]∧r k ∈N * }。
CN201910151241.9A 2019-02-28 2019-02-28 Home indoor scene recognition method based on deep learning and integrating global features and local article information Active CN109858565B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910151241.9A CN109858565B (en) 2019-02-28 2019-02-28 Home indoor scene recognition method based on deep learning and integrating global features and local article information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910151241.9A CN109858565B (en) 2019-02-28 2019-02-28 Home indoor scene recognition method based on deep learning and integrating global features and local article information

Publications (2)

Publication Number Publication Date
CN109858565A CN109858565A (en) 2019-06-07
CN109858565B true CN109858565B (en) 2022-08-12

Family

ID=66899355

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910151241.9A Active CN109858565B (en) 2019-02-28 2019-02-28 Home indoor scene recognition method based on deep learning and integrating global features and local article information

Country Status (1)

Country Link
CN (1) CN109858565B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110751218B (en) * 2019-10-22 2023-01-06 Oppo广东移动通信有限公司 Image classification method, image classification device and terminal equipment
CN112633064B (en) * 2020-11-19 2023-12-15 深圳银星智能集团股份有限公司 Scene recognition method and electronic equipment
CN112632378B (en) * 2020-12-21 2021-08-24 广东省信息网络有限公司 Information processing method based on big data and artificial intelligence and data server
CN113177133B (en) * 2021-04-23 2024-03-29 深圳依时货拉拉科技有限公司 Image retrieval method, device, equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165682A (en) * 2018-08-10 2019-01-08 中国地质大学(武汉) A kind of remote sensing images scene classification method merging depth characteristic and significant characteristics
CN109255364A (en) * 2018-07-12 2019-01-22 杭州电子科技大学 A kind of scene recognition method generating confrontation network based on depth convolution

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103679189B (en) * 2012-09-14 2017-02-01 华为技术有限公司 Method and device for recognizing scene

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109255364A (en) * 2018-07-12 2019-01-22 杭州电子科技大学 A kind of scene recognition method generating confrontation network based on depth convolution
CN109165682A (en) * 2018-08-10 2019-01-08 中国地质大学(武汉) A kind of remote sensing images scene classification method merging depth characteristic and significant characteristics

Also Published As

Publication number Publication date
CN109858565A (en) 2019-06-07

Similar Documents

Publication Publication Date Title
CN109858565B (en) Home indoor scene recognition method based on deep learning and integrating global features and local article information
CN108416394B (en) Multi-target detection model building method based on convolutional neural networks
CN106897670B (en) Express violence sorting identification method based on computer vision
CN104573669B (en) Image object detection method
Brdiczka et al. Learning situation models in a smart home
CN109583315B (en) Multichannel rapid human body posture recognition method for intelligent video monitoring
CN110991435A (en) Express waybill key information positioning method and device based on deep learning
CN104732208A (en) Video human action reorganization method based on sparse subspace clustering
CN112819065B (en) Unsupervised pedestrian sample mining method and unsupervised pedestrian sample mining system based on multi-clustering information
CN110807434A (en) Pedestrian re-identification system and method based on combination of human body analysis and coarse and fine particle sizes
CN111882586B (en) Multi-actor target tracking method oriented to theater environment
CN113239916B (en) Expression recognition and classroom state evaluation method, device and medium
CN107085729B (en) Bayesian inference-based personnel detection result correction method
CN109509228A (en) Method for positioning one or more candidate digital images
CN110458022A (en) It is a kind of based on domain adapt to can autonomous learning object detection method
CN113065516A (en) Unsupervised pedestrian re-identification system and method based on sample separation
CN113723277B (en) Learning intention monitoring method and system integrated with multi-mode visual information
CN113221721A (en) Image recognition method, device, equipment and medium
CN111695484A (en) Method for classifying gesture postures
CN115050044B (en) Cross-modal pedestrian re-identification method based on MLP-Mixer
CN107392102A (en) Based on the family of local image characteristics and multi-instance learning group photo and non-family safe group photo sorting technique
CN113223018A (en) Fine-grained image analysis processing method
CN110717544A (en) Pedestrian attribute analysis method and system under vertical fisheye lens
CN115294441B (en) Robot scene recognition and analysis method integrating three characteristics by attention
CN115223103B (en) High-altitude parabolic detection method based on digital image processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant