CN110096986B

CN110096986B - Intelligent museum exhibit guiding method based on image recognition and text fusion

Info

Publication number: CN110096986B
Application number: CN201910333050.4A
Authority: CN
Inventors: 王斌; 杨晓春; 张斯婷
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2019-04-24
Filing date: 2019-04-24
Publication date: 2022-04-12
Anticipated expiration: 2039-04-24
Also published as: CN110096986A

Abstract

The invention provides an intelligent museum exhibit guiding method based on image recognition and text fusion, and relates to the technical field of image recognition and text fusion. The invention comprises the following steps: step 1: collecting an exhibit image to obtain an exhibit picture set; step 2: establishing a recognition model based on a convolutional neural network structure; training an initial recognition model based on a convolutional neural network structure by using a picture X to obtain a loss function L (X), training and recognizing parameters in the initial model according to the loss function to obtain a recognition model based on the convolutional neural network structure, and obtaining a recognition result of the picture X; and step 3: crawling collection of relevant information is carried out according to the recognition result as a keyword, and an information data set is obtained; and 4, step 4: extracting the abstract T from the acquired information data set; and 5: performing information fusion on the abstract T obtained in the step 4; the method can improve the visiting experience of the visitors, reduce the daily operation cost of the museum and reduce the labor cost.

Description

Intelligent museum exhibit guiding method based on image recognition and text fusion

Technical Field

The invention relates to the technical field of image recognition and text fusion, in particular to an intelligent museum exhibit guiding method based on image recognition and text fusion.

Background

The museum is an important carrier for people to learn history, understand the development of professional fields and enrich spiritual culture life. In a traditional museum, the tour guide service is mainly provided by an instructor, so that the freedom is poor, the display form is single, a user cannot efficiently screen the knowledge which the user wants to know, even the user needs to pay extra cost, and local blockage in an exhibition hall of the museum can be caused sometimes. The existing museum navigation technology is mainly based on an RFID (radio frequency identification), a user passively receives voice information of related exhibits during visiting, experience is poor, a large number of wireless transmitting devices are required to be installed in a museum to serve as electronic tags of the exhibits in the mode, installation cost is high, and the museum navigation technology is limited by the environment of the museum venue. In a mobile phone APP navigation system based on two-dimension codes emerging in recent years, a two-dimension code label of an exhibit is scanned to search relevant information of the exhibit, and the method has the defect that the two-dimension code label needs to be attached to all the exhibits. More importantly, the descriptions of the methods are obtained by querying the database, and the information in the database is usually updated at a slow speed, which causes a delay in the information.

Disclosure of Invention

The technical problem to be solved by the invention is to provide an intelligent museum exhibit guiding method based on image recognition and text fusion, aiming at the defects of the prior art, the method can improve the visiting experience of visitors, reduce the daily operation cost of museums and reduce the labor cost;

in order to solve the technical problems, the technical scheme adopted by the invention is as follows:

the invention provides an intelligent museum exhibit guiding method based on image recognition and text fusion, which comprises the following steps:

step 1: collecting images of the exhibits, adjusting all the images into a uniform size, and performing data enhancement processing on the images to obtain an exhibit image set, wherein the images in the exhibit image set are images with correct classification labels;

step 2: establishing a recognition model based on a convolutional neural network structure; taking parameters of a feature extraction layer and a classification layer in the VGG network structure model as parameters in an initial identification model based on a convolutional neural network structure; training an initial recognition model based on a convolutional neural network structure by using the picture X in the exhibit picture set in the step 1 to obtain a loss function L (X), training and recognizing parameters in the initial model according to the loss function to obtain a recognition model based on the convolutional neural network structure, and obtaining a recognition result of the picture X;

and step 3: crawling collection of relevant information is carried out according to the identification result obtained in the step 2 as a keyword; performing keyword segmentation on the recognition result according to a crawler technology, and collecting information in hundred degrees according to segmented keywords to obtain an information data set;

and 4, step 4: extracting the abstract T from the acquired information data set; generating an abstract by using an extraction mode, removing redundant information in the information data set, and extracting the main content of the information data set; arranging the sentences in the information data set in step 3 into a sentence sequence A₁、A₂、…、A_nForming a document D, wherein n is the total number of sentences, and m sentences in the document D are selected according to the probability to form a summary T;

the probability selection method comprises the following steps: the extraction type abstract generating model comprises a sentence encoder, a document encoder and a sentence extractor; wherein, the sentence encoder uses word2vec to obtain 200-dimensional vector of each word and uses convolution and pooling operation to obtain encoding vector of the sentence; using the LSTM network in the document encoder, the sentence extractor part takes the keywords as extra information to participate in the process of scoring sentences, with the ultimate goal of getting a higher score for the sentences containing the keywords, i.e.:

wherein u is₀、W_e、w′_eParameters of the neural network being a single hidden layer, r represents A_nThe probability of being selected into the summary is,

representing the time of muSentence A of_n，h_μIntermediate state of the μ -time LSTM network, h_μ' represents the weighted sum of the keywords, b is the number of the keywords, K is the probability that each sentence is selected to the abstract T, the larger the probability, the sentence A_nThe greater the probability of appearing in T; c. C_iAs the keywords, h is the more keywords contained in the sentence_μThe larger the' the greater the probability that the sentence will be eventually selected into the summary.

And 5: performing information fusion on the abstract T obtained in the step 4;

the sentences in the abstract T are fused into logical language segments, and the sentences in the abstract T are fused into a complete and meaningful paragraph by defining the description logic and matching with a predefined paragraph description template by adopting a paragraph template combined description logic method; the paragraph template takes time as logic or places and people as logic.

The specific steps of the step 2 are as follows:

step 2.1: initializing a feature extraction layer and classification layer parameters in a classification sub-network by using a VGG network structure model, inputting pictures in a training set, searching a response value of the full-connection layer to an input image, and selecting an area with the highest value as a square of a suggested attention area;

the classification subnetwork f consists of a fully connected layer and a softmax layer:

P(X)＝f(w_c*X)

where X is the input image, P (X) is the output result vector of the classification sub-network, each dimension of the vector represents the probability that the input image belongs to the class represented by that dimension, and w (X) is the output result vector of the classification sub-network_cA parameter matrix for an image feature extraction process; w is a_cX comprises three sections of convolution networks, each section of convolution network comprises two convolution layers and a maximum pooling layer, and the feature maps obtained by image feature extraction are respectively sent into a classification sub-network f and an attention suggestion sub-network g;

attention is advised that subnetwork g consists of two fully-connected layers, of which the last fully-connected layer has 3 output channels, each corresponding to t_x,t_y,t_l；

[t_x,t_y,t_l]＝g(w_c*X)

Wherein, t_x,t_yFor outputting the abscissa and ordinate, t, of the center point of the attention area _l1/2 being the side length of the output attention area;

the process of cutting and extracting the attention area from the original image comprises the following steps:

X^att＝X⊙M(t_x,t_y,t_l)

wherein, X^attFor the attention area, X is the original input image, M (t)_x,t_y,t_l) Is represented by t_x,t_y,t_lThe generated mask;

step 2.2: training the whole network in an alternating mode, keeping parameters of the attention area suggestion sub-network unchanged, and scaling each attention area according to a bilinear interpolation method; each attention area suggested output area is enlarged to the same size pixels as the input image X;

the scaling method comprises the following steps:

first, horizontal interpolation is performed:

wherein R is₁＝(x,y₁)；

Wherein R is₂＝(x,y₂)

Then, vertical interpolation is performed to find the pixel value of the p point:

wherein p (x, y) represents the coordinate of the point p to be inserted as x, y, q (p) isPixel value, Q, of point to be interpolated p₁₁(x₁,y₁)、Q₁₂(x₁,y₂)、Q₂₁(x₂,y₁)、Q₂₂(x₂,y₂) Four points, x, around the point p to be interpolated₁,y₁Is a point Q₁₁Coordinate of (a), x₁,y₂Is a point Q₁₂Coordinate of (a), x₂,y₁Is a point Q₂₁Coordinate of (a), x₂,y₂Is a point Q₂₂The coordinates of (a); r₁Is p point vertical line and Q₁₁And Q₂₁Focal point of a line segment formed by two points, x, y₁Is a point R₁The coordinates of (a); r₂Is p point vertical line and Q₁₂And Q₂₂Two points form the focal point of the line segment, x, y₂Is a point R₂The coordinates of (a);

step 2.3: optimizing the classification loss of each scale according to the zoomed attention area, wherein the learning of each scale comprises a classification sub-network and an attention area suggestion sub-network; the parameters of the convolutional layer and the classification layer are fixed and a loss function is used for optimizing the attention area suggestion sub-network;

the loss function l (x) is composed of two parts, classification loss and ranking loss, and is expressed according to the loss function l (x):

wherein, Y^*Indicating the actual kind of image, Y^sRepresenting the classification result in the s-scale, P_o ^(s)Representing the probability of classifying the subnetwork output as a true class at the s-scale;

ranking loss Lrank (P)_o ^(s),P_o ^(s+1)) Comprises the following steps:

Lrank(P_o ^(s),P_o ^(s+1))＝max(0,P_o ^(s)-P_o ^(s+1)+margin)

where margin represents the threshold.

Adopt the produced beneficial effect of above-mentioned technical scheme to lie in: the intelligent museum exhibit guiding method based on image recognition and text fusion provided by the invention can accurately recognize the category name of an exhibit by photographing the exhibit by a user, extracts keywords according to the name of the exhibit to collect relevant information, and finally returns a relevant introduction with timeliness for the user through an automatic summarization technology and a text fusion technology. The technical scheme includes that the characteristics of museum exhibits are learned through a convolutional neural network, a neural network model frame which can give out corresponding kinds of names according to images shot by a user in real time and high in accuracy rate is constructed, and relevant information is collected according to recognition results to extract and integrate information, so that the intelligent guide function of the museum exhibits is achieved. The method for circularly learning under different scales accurately identifies the types of the fine-grained similar images, the latest introduction information can be provided in the subsequent online information acquisition and processing process, and the processing result is more accurate by adding the keyword information into the loss function in the information processing process.

Drawings

FIG. 1 is a flow chart of a method provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of an image recognition model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of summary generation according to an embodiment of the present invention;

fig. 4 is a schematic diagram of an identification and interpretation result of an exhibit provided in the embodiment of the present invention, wherein a is a schematic diagram of a test interface; b is a picture uploading schematic diagram; c is a schematic diagram of an image recognition display interface; d is a detailed information interface schematic diagram of the text.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

As shown in fig. 1, the method of the present embodiment is as follows.

step 2: establishing a recognition model based on a convolutional neural network structure; taking parameters of a feature extraction layer and a classification layer in the VGG network structure model as parameters in an initial identification model based on a convolutional neural network structure; training an initial recognition model based on a convolutional neural network structure by using the picture X in the exhibit picture set in the step 1 to obtain a loss function L (X), training and recognizing parameters in the initial model according to the loss function to obtain a recognition model based on the convolutional neural network structure, and obtaining a recognition result of the picture X; as shown in fig. 2, a schematic diagram of a three-stage cascade network, which may be composed of multiple stages, and in this embodiment, a 3-stage cascade network is adopted;

step 2.1: initializing a feature extraction layer and classification layer parameters in a classification sub-network by using a VGG (visual Geometry group) network structure model, inputting pictures in a training set, searching a response value of a full-connection layer to an input image, and selecting an area with the highest value as a square of a suggested attention area;

p(x)＝f(w_c*x)

attention suggests that subnetwork g consists of two fully connected layersThe number of the output channels of the last full connection layer is 3, and the output channels respectively correspond to t_x,t_y,t_l；

[t_x,t_y,t_l]＝g(w_c*X)

x^att＝x⊙M(t_x，t_y，t_l)

the scaling method comprises the following steps:

first, horizontal interpolation is performed:

wherein R is₁＝(x,y₁)；

Wherein R is₂＝(x,y₂)

wherein, p (x, y) represents the coordinate of the point p to be interpolated as x and y, Q (p) is the pixel value of the point p to be interpolated, Q₁₁(x₁,y₁)、Q₁₂(x₁,y₂)、Q₂₁(x₂,y₁)、Q₂₂(x₂,y₂) Four points, x, around the point p to be interpolated₁,y₁Is a point Q₁₁Coordinate of (a), x₁,y₂Is a point Q₁₂Coordinate of (a), x₂,y₁Is a point Q₂₁Coordinate of (a), x₂,y₂Is a point Q₂₂The coordinates of (a); r₁Is p point vertical line and Q₁₁And Q₂₁Focal point of a line segment formed by two points, x, y₁Is a point R₁The coordinates of (a); r₂Is p point vertical line and Q₁₂And Q₂₂Two points form the focal point of the line segment, x, y₂Is a point R₂The coordinates of (a);

wherein, Y^*Indicating the actual kind of image, Y^sRepresenting the classification result in the s-scale, P_o ^(s)Representing the probability of classifying the subnetwork output as a true class at the s-scale; the upper limit of the scale s is variable;

ranking loss Lrank (P)_o ^(s),P_o ^(s+1)) Comprises the following steps:

Lrank(P_o ^(s),P_o ^(s+1))＝max(0,P_o ^(s)-P_o ^(s+1)+margin)

wherein margin represents a threshold;

and 4, step 4: extracting the abstract T from the acquired information data set, as shown in FIG. 3; generating an abstract by using an extraction mode, removing redundant information in the information data set, and extracting the main content of the information data set; arranging the sentences in the information data set in step 3 into a sentence sequence A₁、A₂、…、A_nAnd (4) forming a document D, wherein n is the total number of sentences, and m sentences in the document D are selected according to the probability to form a summary T.

sentence A representing time of μ_n，h_μIntermediate state of the μ -time LSTM network, h_μ' represents the weighted sum of the keywords, b is the number of the keywords, K is the probability that each sentence is selected to the abstract T, the larger the probability, the sentence A_nThe greater the probability of appearing in T; c. C_iAs the keywords, h is the more keywords contained in the sentence_μThe larger the' the greater the probability that the sentence will be eventually selected in the abstract.

And 5: performing information fusion on the abstract T obtained in the step 4;

As shown in fig. 4, which is the recognition result of the exhibit through the method, the APP is opened first, and the application interface shown as a is entered; clicking a 'taking picture' or 'uploading picture' button uploads the picture, as shown in b; a brief textual description of the image is then obtained and the detailed information of the relevant textual subject can be clicked on to view, as indicated at c and d.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims

1. An intelligent museum exhibit guiding method based on image recognition and text fusion is characterized in that: the method comprises the following steps:

sentence A representing time of μ_n，h_μIntermediate state of the μ -time LSTM network, h_μ' represents the weighted sum of the keywords, b is the number of the keywords, K is the probability that each sentence is selected to the abstract T, the larger the probability, the sentence A_nThe greater the probability of appearing in T; c. C_iAs the keywords, h is the more keywords contained in the sentence_μThe larger the' the probability that the sentence will be finally selected in the abstract is also greater;

and 5: performing information fusion on the abstract T obtained in the step 4;

2. The intelligent museum exhibit guiding method based on image recognition and text fusion as claimed in claim 1, characterized in that: the specific steps of the step 2 are as follows:

P(X)＝f(w_c*X)

[t_x,t_y,t_l]＝g(w_c*X)

Wherein, t_x,t_yFor outputting the abscissa and ordinate, t, of the center point of the attention area_l1/2 being the side length of the output attention area;

X^att＝X⊙M(t_x,t_y,t_l)

the scaling method comprises the following steps:

first, horizontal interpolation is performed:

wherein R is₁＝(x,y₁)；

Wherein R is₂＝(x,y₂)

ranking loss Lrank (P)_o ^(s),P_o ^(s+1)) Comprises the following steps:

Lrank(P_o ^(s),P_o ^(s+1))＝max(0,P_o ^(s)-P_o ^(s+1)+margin)

where margin represents the threshold.