CN110096986B - Intelligent museum exhibit guiding method based on image recognition and text fusion - Google Patents

Intelligent museum exhibit guiding method based on image recognition and text fusion Download PDF

Info

Publication number
CN110096986B
CN110096986B CN201910333050.4A CN201910333050A CN110096986B CN 110096986 B CN110096986 B CN 110096986B CN 201910333050 A CN201910333050 A CN 201910333050A CN 110096986 B CN110096986 B CN 110096986B
Authority
CN
China
Prior art keywords
point
network
classification
sentence
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910333050.4A
Other languages
Chinese (zh)
Other versions
CN110096986A (en
Inventor
王斌
杨晓春
张斯婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201910333050.4A priority Critical patent/CN110096986B/en
Publication of CN110096986A publication Critical patent/CN110096986A/en
Application granted granted Critical
Publication of CN110096986B publication Critical patent/CN110096986B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes

Abstract

The invention provides an intelligent museum exhibit guiding method based on image recognition and text fusion, and relates to the technical field of image recognition and text fusion. The invention comprises the following steps: step 1: collecting an exhibit image to obtain an exhibit picture set; step 2: establishing a recognition model based on a convolutional neural network structure; training an initial recognition model based on a convolutional neural network structure by using a picture X to obtain a loss function L (X), training and recognizing parameters in the initial model according to the loss function to obtain a recognition model based on the convolutional neural network structure, and obtaining a recognition result of the picture X; and step 3: crawling collection of relevant information is carried out according to the recognition result as a keyword, and an information data set is obtained; and 4, step 4: extracting the abstract T from the acquired information data set; and 5: performing information fusion on the abstract T obtained in the step 4; the method can improve the visiting experience of the visitors, reduce the daily operation cost of the museum and reduce the labor cost.

Description

Intelligent museum exhibit guiding method based on image recognition and text fusion
Technical Field
The invention relates to the technical field of image recognition and text fusion, in particular to an intelligent museum exhibit guiding method based on image recognition and text fusion.
Background
The museum is an important carrier for people to learn history, understand the development of professional fields and enrich spiritual culture life. In a traditional museum, the tour guide service is mainly provided by an instructor, so that the freedom is poor, the display form is single, a user cannot efficiently screen the knowledge which the user wants to know, even the user needs to pay extra cost, and local blockage in an exhibition hall of the museum can be caused sometimes. The existing museum navigation technology is mainly based on an RFID (radio frequency identification), a user passively receives voice information of related exhibits during visiting, experience is poor, a large number of wireless transmitting devices are required to be installed in a museum to serve as electronic tags of the exhibits in the mode, installation cost is high, and the museum navigation technology is limited by the environment of the museum venue. In a mobile phone APP navigation system based on two-dimension codes emerging in recent years, a two-dimension code label of an exhibit is scanned to search relevant information of the exhibit, and the method has the defect that the two-dimension code label needs to be attached to all the exhibits. More importantly, the descriptions of the methods are obtained by querying the database, and the information in the database is usually updated at a slow speed, which causes a delay in the information.
Disclosure of Invention
The technical problem to be solved by the invention is to provide an intelligent museum exhibit guiding method based on image recognition and text fusion, aiming at the defects of the prior art, the method can improve the visiting experience of visitors, reduce the daily operation cost of museums and reduce the labor cost;
in order to solve the technical problems, the technical scheme adopted by the invention is as follows:
the invention provides an intelligent museum exhibit guiding method based on image recognition and text fusion, which comprises the following steps:
step 1: collecting images of the exhibits, adjusting all the images into a uniform size, and performing data enhancement processing on the images to obtain an exhibit image set, wherein the images in the exhibit image set are images with correct classification labels;
step 2: establishing a recognition model based on a convolutional neural network structure; taking parameters of a feature extraction layer and a classification layer in the VGG network structure model as parameters in an initial identification model based on a convolutional neural network structure; training an initial recognition model based on a convolutional neural network structure by using the picture X in the exhibit picture set in the step 1 to obtain a loss function L (X), training and recognizing parameters in the initial model according to the loss function to obtain a recognition model based on the convolutional neural network structure, and obtaining a recognition result of the picture X;
and step 3: crawling collection of relevant information is carried out according to the identification result obtained in the step 2 as a keyword; performing keyword segmentation on the recognition result according to a crawler technology, and collecting information in hundred degrees according to segmented keywords to obtain an information data set;
and 4, step 4: extracting the abstract T from the acquired information data set; generating an abstract by using an extraction mode, removing redundant information in the information data set, and extracting the main content of the information data set; arranging the sentences in the information data set in step 3 into a sentence sequence A1、A2、…、AnForming a document D, wherein n is the total number of sentences, and m sentences in the document D are selected according to the probability to form a summary T;
the probability selection method comprises the following steps: the extraction type abstract generating model comprises a sentence encoder, a document encoder and a sentence extractor; wherein, the sentence encoder uses word2vec to obtain 200-dimensional vector of each word and uses convolution and pooling operation to obtain encoding vector of the sentence; using the LSTM network in the document encoder, the sentence extractor part takes the keywords as extra information to participate in the process of scoring sentences, with the ultimate goal of getting a higher score for the sentences containing the keywords, i.e.:
Figure GDA0003488345400000021
Figure GDA0003488345400000022
Figure GDA0003488345400000023
wherein u is0、We、w′eParameters of the neural network being a single hidden layer, r represents AnThe probability of being selected into the summary is,
Figure GDA0003488345400000024
representing the time of muSentence A ofn,hμIntermediate state of the μ -time LSTM network, hμ' represents the weighted sum of the keywords, b is the number of the keywords, K is the probability that each sentence is selected to the abstract T, the larger the probability, the sentence AnThe greater the probability of appearing in T; c. CiAs the keywords, h is the more keywords contained in the sentenceμThe larger the' the greater the probability that the sentence will be eventually selected into the summary.
And 5: performing information fusion on the abstract T obtained in the step 4;
the sentences in the abstract T are fused into logical language segments, and the sentences in the abstract T are fused into a complete and meaningful paragraph by defining the description logic and matching with a predefined paragraph description template by adopting a paragraph template combined description logic method; the paragraph template takes time as logic or places and people as logic.
The specific steps of the step 2 are as follows:
step 2.1: initializing a feature extraction layer and classification layer parameters in a classification sub-network by using a VGG network structure model, inputting pictures in a training set, searching a response value of the full-connection layer to an input image, and selecting an area with the highest value as a square of a suggested attention area;
the classification subnetwork f consists of a fully connected layer and a softmax layer:
P(X)=f(wc*X)
where X is the input image, P (X) is the output result vector of the classification sub-network, each dimension of the vector represents the probability that the input image belongs to the class represented by that dimension, and w (X) is the output result vector of the classification sub-networkcA parameter matrix for an image feature extraction process; w is acX comprises three sections of convolution networks, each section of convolution network comprises two convolution layers and a maximum pooling layer, and the feature maps obtained by image feature extraction are respectively sent into a classification sub-network f and an attention suggestion sub-network g;
attention is advised that subnetwork g consists of two fully-connected layers, of which the last fully-connected layer has 3 output channels, each corresponding to tx,ty,tl
[tx,ty,tl]=g(wc*X)
Wherein, tx,tyFor outputting the abscissa and ordinate, t, of the center point of the attention area l1/2 being the side length of the output attention area;
the process of cutting and extracting the attention area from the original image comprises the following steps:
Xatt=X⊙M(tx,ty,tl)
wherein, XattFor the attention area, X is the original input image, M (t)x,ty,tl) Is represented by tx,ty,tlThe generated mask;
step 2.2: training the whole network in an alternating mode, keeping parameters of the attention area suggestion sub-network unchanged, and scaling each attention area according to a bilinear interpolation method; each attention area suggested output area is enlarged to the same size pixels as the input image X;
the scaling method comprises the following steps:
first, horizontal interpolation is performed:
Figure GDA0003488345400000031
wherein R is1=(x,y1);
Figure GDA0003488345400000032
Wherein R is2=(x,y2)
Then, vertical interpolation is performed to find the pixel value of the p point:
Figure GDA0003488345400000033
wherein p (x, y) represents the coordinate of the point p to be inserted as x, y, q (p) isPixel value, Q, of point to be interpolated p11(x1,y1)、Q12(x1,y2)、Q21(x2,y1)、Q22(x2,y2) Four points, x, around the point p to be interpolated1,y1Is a point Q11Coordinate of (a), x1,y2Is a point Q12Coordinate of (a), x2,y1Is a point Q21Coordinate of (a), x2,y2Is a point Q22The coordinates of (a); r1Is p point vertical line and Q11And Q21Focal point of a line segment formed by two points, x, y1Is a point R1The coordinates of (a); r2Is p point vertical line and Q12And Q22Two points form the focal point of the line segment, x, y2Is a point R2The coordinates of (a);
step 2.3: optimizing the classification loss of each scale according to the zoomed attention area, wherein the learning of each scale comprises a classification sub-network and an attention area suggestion sub-network; the parameters of the convolutional layer and the classification layer are fixed and a loss function is used for optimizing the attention area suggestion sub-network;
the loss function l (x) is composed of two parts, classification loss and ranking loss, and is expressed according to the loss function l (x):
Figure GDA0003488345400000041
wherein, Y*Indicating the actual kind of image, YsRepresenting the classification result in the s-scale, Po (s)Representing the probability of classifying the subnetwork output as a true class at the s-scale;
ranking loss Lrank (P)o (s),Po (s+1)) Comprises the following steps:
Lrank(Po (s),Po (s+1))=max(0,Po (s)-Po (s+1)+margin)
where margin represents the threshold.
Adopt the produced beneficial effect of above-mentioned technical scheme to lie in: the intelligent museum exhibit guiding method based on image recognition and text fusion provided by the invention can accurately recognize the category name of an exhibit by photographing the exhibit by a user, extracts keywords according to the name of the exhibit to collect relevant information, and finally returns a relevant introduction with timeliness for the user through an automatic summarization technology and a text fusion technology. The technical scheme includes that the characteristics of museum exhibits are learned through a convolutional neural network, a neural network model frame which can give out corresponding kinds of names according to images shot by a user in real time and high in accuracy rate is constructed, and relevant information is collected according to recognition results to extract and integrate information, so that the intelligent guide function of the museum exhibits is achieved. The method for circularly learning under different scales accurately identifies the types of the fine-grained similar images, the latest introduction information can be provided in the subsequent online information acquisition and processing process, and the processing result is more accurate by adding the keyword information into the loss function in the information processing process.
Drawings
FIG. 1 is a flow chart of a method provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram of an image recognition model according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of summary generation according to an embodiment of the present invention;
fig. 4 is a schematic diagram of an identification and interpretation result of an exhibit provided in the embodiment of the present invention, wherein a is a schematic diagram of a test interface; b is a picture uploading schematic diagram; c is a schematic diagram of an image recognition display interface; d is a detailed information interface schematic diagram of the text.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
As shown in fig. 1, the method of the present embodiment is as follows.
The invention provides an intelligent museum exhibit guiding method based on image recognition and text fusion, which comprises the following steps:
step 1: collecting images of the exhibits, adjusting all the images into a uniform size, and performing data enhancement processing on the images to obtain an exhibit image set, wherein the images in the exhibit image set are images with correct classification labels;
step 2: establishing a recognition model based on a convolutional neural network structure; taking parameters of a feature extraction layer and a classification layer in the VGG network structure model as parameters in an initial identification model based on a convolutional neural network structure; training an initial recognition model based on a convolutional neural network structure by using the picture X in the exhibit picture set in the step 1 to obtain a loss function L (X), training and recognizing parameters in the initial model according to the loss function to obtain a recognition model based on the convolutional neural network structure, and obtaining a recognition result of the picture X; as shown in fig. 2, a schematic diagram of a three-stage cascade network, which may be composed of multiple stages, and in this embodiment, a 3-stage cascade network is adopted;
step 2.1: initializing a feature extraction layer and classification layer parameters in a classification sub-network by using a VGG (visual Geometry group) network structure model, inputting pictures in a training set, searching a response value of a full-connection layer to an input image, and selecting an area with the highest value as a square of a suggested attention area;
the classification subnetwork f consists of a fully connected layer and a softmax layer:
p(x)=f(wc*x)
where X is the input image, P (X) is the output result vector of the classification sub-network, each dimension of the vector represents the probability that the input image belongs to the class represented by that dimension, and w (X) is the output result vector of the classification sub-networkcA parameter matrix for an image feature extraction process; w is acX comprises three sections of convolution networks, each section of convolution network comprises two convolution layers and a maximum pooling layer, and the feature maps obtained by image feature extraction are respectively sent into a classification sub-network f and an attention suggestion sub-network g;
attention suggests that subnetwork g consists of two fully connected layersThe number of the output channels of the last full connection layer is 3, and the output channels respectively correspond to tx,ty,tl
[tx,ty,tl]=g(wc*X)
Wherein, tx,tyFor outputting the abscissa and ordinate, t, of the center point of the attention area l1/2 being the side length of the output attention area;
the process of cutting and extracting the attention area from the original image comprises the following steps:
xatt=x⊙M(tx,ty,tl)
wherein, XattFor the attention area, X is the original input image, M (t)x,ty,tl) Is represented by tx,ty,tlThe generated mask;
step 2.2: training the whole network in an alternating mode, keeping parameters of the attention area suggestion sub-network unchanged, and scaling each attention area according to a bilinear interpolation method; each attention area suggested output area is enlarged to the same size pixels as the input image X;
the scaling method comprises the following steps:
first, horizontal interpolation is performed:
Figure GDA0003488345400000061
wherein R is1=(x,y1);
Figure GDA0003488345400000062
Wherein R is2=(x,y2)
Then, vertical interpolation is performed to find the pixel value of the p point:
Figure GDA0003488345400000063
wherein, p (x, y) represents the coordinate of the point p to be interpolated as x and y, Q (p) is the pixel value of the point p to be interpolated, Q11(x1,y1)、Q12(x1,y2)、Q21(x2,y1)、Q22(x2,y2) Four points, x, around the point p to be interpolated1,y1Is a point Q11Coordinate of (a), x1,y2Is a point Q12Coordinate of (a), x2,y1Is a point Q21Coordinate of (a), x2,y2Is a point Q22The coordinates of (a); r1Is p point vertical line and Q11And Q21Focal point of a line segment formed by two points, x, y1Is a point R1The coordinates of (a); r2Is p point vertical line and Q12And Q22Two points form the focal point of the line segment, x, y2Is a point R2The coordinates of (a);
step 2.3: optimizing the classification loss of each scale according to the zoomed attention area, wherein the learning of each scale comprises a classification sub-network and an attention area suggestion sub-network; the parameters of the convolutional layer and the classification layer are fixed and a loss function is used for optimizing the attention area suggestion sub-network;
the loss function l (x) is composed of two parts, classification loss and ranking loss, and is expressed according to the loss function l (x):
Figure GDA0003488345400000064
wherein, Y*Indicating the actual kind of image, YsRepresenting the classification result in the s-scale, Po (s)Representing the probability of classifying the subnetwork output as a true class at the s-scale; the upper limit of the scale s is variable;
ranking loss Lrank (P)o (s),Po (s+1)) Comprises the following steps:
Lrank(Po (s),Po (s+1))=max(0,Po (s)-Po (s+1)+margin)
wherein margin represents a threshold;
and step 3: crawling collection of relevant information is carried out according to the identification result obtained in the step 2 as a keyword; performing keyword segmentation on the recognition result according to a crawler technology, and collecting information in hundred degrees according to segmented keywords to obtain an information data set;
and 4, step 4: extracting the abstract T from the acquired information data set, as shown in FIG. 3; generating an abstract by using an extraction mode, removing redundant information in the information data set, and extracting the main content of the information data set; arranging the sentences in the information data set in step 3 into a sentence sequence A1、A2、…、AnAnd (4) forming a document D, wherein n is the total number of sentences, and m sentences in the document D are selected according to the probability to form a summary T.
The probability selection method comprises the following steps: the extraction type abstract generating model comprises a sentence encoder, a document encoder and a sentence extractor; wherein, the sentence encoder uses word2vec to obtain 200-dimensional vector of each word and uses convolution and pooling operation to obtain encoding vector of the sentence; using the LSTM network in the document encoder, the sentence extractor part takes the keywords as extra information to participate in the process of scoring sentences, with the ultimate goal of getting a higher score for the sentences containing the keywords, i.e.:
Figure GDA0003488345400000071
Figure GDA0003488345400000072
Figure GDA0003488345400000073
wherein u is0、We、w′eParameters of the neural network being a single hidden layer, r represents AnThe probability of being selected into the summary is,
Figure GDA0003488345400000074
sentence A representing time of μn,hμIntermediate state of the μ -time LSTM network, hμ' represents the weighted sum of the keywords, b is the number of the keywords, K is the probability that each sentence is selected to the abstract T, the larger the probability, the sentence AnThe greater the probability of appearing in T; c. CiAs the keywords, h is the more keywords contained in the sentenceμThe larger the' the greater the probability that the sentence will be eventually selected in the abstract.
And 5: performing information fusion on the abstract T obtained in the step 4;
the sentences in the abstract T are fused into logical language segments, and the sentences in the abstract T are fused into a complete and meaningful paragraph by defining the description logic and matching with a predefined paragraph description template by adopting a paragraph template combined description logic method; the paragraph template takes time as logic or places and people as logic.
As shown in fig. 4, which is the recognition result of the exhibit through the method, the APP is opened first, and the application interface shown as a is entered; clicking a 'taking picture' or 'uploading picture' button uploads the picture, as shown in b; a brief textual description of the image is then obtained and the detailed information of the relevant textual subject can be clicked on to view, as indicated at c and d.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims (2)

1. An intelligent museum exhibit guiding method based on image recognition and text fusion is characterized in that: the method comprises the following steps:
step 1: collecting images of the exhibits, adjusting all the images into a uniform size, and performing data enhancement processing on the images to obtain an exhibit image set, wherein the images in the exhibit image set are images with correct classification labels;
step 2: establishing a recognition model based on a convolutional neural network structure; taking parameters of a feature extraction layer and a classification layer in the VGG network structure model as parameters in an initial identification model based on a convolutional neural network structure; training an initial recognition model based on a convolutional neural network structure by using the picture X in the exhibit picture set in the step 1 to obtain a loss function L (X), training and recognizing parameters in the initial model according to the loss function to obtain a recognition model based on the convolutional neural network structure, and obtaining a recognition result of the picture X;
and step 3: crawling collection of relevant information is carried out according to the identification result obtained in the step 2 as a keyword; performing keyword segmentation on the recognition result according to a crawler technology, and collecting information in hundred degrees according to segmented keywords to obtain an information data set;
and 4, step 4: extracting the abstract T from the acquired information data set; generating an abstract by using an extraction mode, removing redundant information in the information data set, and extracting the main content of the information data set; arranging the sentences in the information data set in step 3 into a sentence sequence A1、A2、…、AnForming a document D, wherein n is the total number of sentences, and m sentences in the document D are selected according to the probability to form a summary T;
the probability selection method comprises the following steps: the extraction type abstract generating model comprises a sentence encoder, a document encoder and a sentence extractor; wherein, the sentence encoder uses word2vec to obtain 200-dimensional vector of each word and uses convolution and pooling operation to obtain encoding vector of the sentence; using the LSTM network in the document encoder, the sentence extractor part takes the keywords as extra information to participate in the process of scoring sentences, with the ultimate goal of getting a higher score for the sentences containing the keywords, i.e.:
Figure FDA0003488345390000011
Figure FDA0003488345390000012
Figure FDA0003488345390000013
wherein u is0、We、w′eParameters of the neural network being a single hidden layer, r represents AnThe probability of being selected into the summary is,
Figure FDA0003488345390000014
sentence A representing time of μn,hμIntermediate state of the μ -time LSTM network, hμ' represents the weighted sum of the keywords, b is the number of the keywords, K is the probability that each sentence is selected to the abstract T, the larger the probability, the sentence AnThe greater the probability of appearing in T; c. CiAs the keywords, h is the more keywords contained in the sentenceμThe larger the' the probability that the sentence will be finally selected in the abstract is also greater;
and 5: performing information fusion on the abstract T obtained in the step 4;
the sentences in the abstract T are fused into logical language segments, and the sentences in the abstract T are fused into a complete and meaningful paragraph by defining the description logic and matching with a predefined paragraph description template by adopting a paragraph template combined description logic method; the paragraph template takes time as logic or places and people as logic.
2. The intelligent museum exhibit guiding method based on image recognition and text fusion as claimed in claim 1, characterized in that: the specific steps of the step 2 are as follows:
step 2.1: initializing a feature extraction layer and classification layer parameters in a classification sub-network by using a VGG network structure model, inputting pictures in a training set, searching a response value of the full-connection layer to an input image, and selecting an area with the highest value as a square of a suggested attention area;
the classification subnetwork f consists of a fully connected layer and a softmax layer:
P(X)=f(wc*X)
where X is the input image, P (X) is the output result vector of the classification sub-network, each dimension of the vector represents the probability that the input image belongs to the class represented by that dimension, and w (X) is the output result vector of the classification sub-networkcA parameter matrix for an image feature extraction process; w is acX comprises three sections of convolution networks, each section of convolution network comprises two convolution layers and a maximum pooling layer, and the feature maps obtained by image feature extraction are respectively sent into a classification sub-network f and an attention suggestion sub-network g;
attention is advised that subnetwork g consists of two fully-connected layers, of which the last fully-connected layer has 3 output channels, each corresponding to tx,ty,tl
[tx,ty,tl]=g(wc*X)
Wherein, tx,tyFor outputting the abscissa and ordinate, t, of the center point of the attention areal1/2 being the side length of the output attention area;
the process of cutting and extracting the attention area from the original image comprises the following steps:
Xatt=X⊙M(tx,ty,tl)
wherein, XattFor the attention area, X is the original input image, M (t)x,ty,tl) Is represented by tx,ty,tlThe generated mask;
step 2.2: training the whole network in an alternating mode, keeping parameters of the attention area suggestion sub-network unchanged, and scaling each attention area according to a bilinear interpolation method; each attention area suggested output area is enlarged to the same size pixels as the input image X;
the scaling method comprises the following steps:
first, horizontal interpolation is performed:
Figure FDA0003488345390000021
wherein R is1=(x,y1);
Figure FDA0003488345390000022
Wherein R is2=(x,y2)
Then, vertical interpolation is performed to find the pixel value of the p point:
Figure FDA0003488345390000031
wherein, p (x, y) represents the coordinate of the point p to be interpolated as x and y, Q (p) is the pixel value of the point p to be interpolated, Q11(x1,y1)、Q12(x1,y2)、Q21(x2,y1)、Q22(x2,y2) Four points, x, around the point p to be interpolated1,y1Is a point Q11Coordinate of (a), x1,y2Is a point Q12Coordinate of (a), x2,y1Is a point Q21Coordinate of (a), x2,y2Is a point Q22The coordinates of (a); r1Is p point vertical line and Q11And Q21Focal point of a line segment formed by two points, x, y1Is a point R1The coordinates of (a); r2Is p point vertical line and Q12And Q22Two points form the focal point of the line segment, x, y2Is a point R2The coordinates of (a);
step 2.3: optimizing the classification loss of each scale according to the zoomed attention area, wherein the learning of each scale comprises a classification sub-network and an attention area suggestion sub-network; the parameters of the convolutional layer and the classification layer are fixed and a loss function is used for optimizing the attention area suggestion sub-network;
the loss function l (x) is composed of two parts, classification loss and ranking loss, and is expressed according to the loss function l (x):
Figure FDA0003488345390000032
wherein, Y*Indicating the actual kind of image, YsRepresenting the classification result in the s-scale, Po (s)Representing the probability of classifying the subnetwork output as a true class at the s-scale;
ranking loss Lrank (P)o (s),Po (s+1)) Comprises the following steps:
Lrank(Po (s),Po (s+1))=max(0,Po (s)-Po (s+1)+margin)
where margin represents the threshold.
CN201910333050.4A 2019-04-24 2019-04-24 Intelligent museum exhibit guiding method based on image recognition and text fusion Active CN110096986B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910333050.4A CN110096986B (en) 2019-04-24 2019-04-24 Intelligent museum exhibit guiding method based on image recognition and text fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910333050.4A CN110096986B (en) 2019-04-24 2019-04-24 Intelligent museum exhibit guiding method based on image recognition and text fusion

Publications (2)

Publication Number Publication Date
CN110096986A CN110096986A (en) 2019-08-06
CN110096986B true CN110096986B (en) 2022-04-12

Family

ID=67445681

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910333050.4A Active CN110096986B (en) 2019-04-24 2019-04-24 Intelligent museum exhibit guiding method based on image recognition and text fusion

Country Status (1)

Country Link
CN (1) CN110096986B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104238884A (en) * 2014-09-12 2014-12-24 北京诺亚星云科技有限责任公司 Dynamic information presentation and user interaction system and equipment based on digital panorama
CN207198847U (en) * 2017-06-27 2018-04-06 徐桐 A kind of system of the architecture information guide to visitors based on image comparison identification
CN108595632A (en) * 2018-04-24 2018-09-28 福州大学 A kind of hybrid neural networks file classification method of fusion abstract and body feature
CN109145105A (en) * 2018-07-26 2019-01-04 福州大学 A kind of text snippet model generation algorithm of fuse information selection and semantic association

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018071403A1 (en) * 2016-10-10 2018-04-19 Insurance Services Office, Inc. Systems and methods for optical charater recognition for low-resolution ducuments
US10679011B2 (en) * 2017-05-10 2020-06-09 Oracle International Corporation Enabling chatbots by detecting and supporting argumentation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104238884A (en) * 2014-09-12 2014-12-24 北京诺亚星云科技有限责任公司 Dynamic information presentation and user interaction system and equipment based on digital panorama
CN207198847U (en) * 2017-06-27 2018-04-06 徐桐 A kind of system of the architecture information guide to visitors based on image comparison identification
CN108595632A (en) * 2018-04-24 2018-09-28 福州大学 A kind of hybrid neural networks file classification method of fusion abstract and body feature
CN109145105A (en) * 2018-07-26 2019-01-04 福州大学 A kind of text snippet model generation algorithm of fuse information selection and semantic association

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
An Adaptive Hierarchical Compositional Model for Phrase Embedding;Bing Li 等;《ResearchGate》;20180724;第4144-4151页 *
An Efficient Method for High Quality and Cohesive Topical Phrase Mining;Xiaochun Yang 等;《IEEE》;20190131;第120-137页 *
基于移动增强现实的智慧城市导览;张运超 等;《计算机研究与发展》;20141231;第302-310页 *

Also Published As

Publication number Publication date
CN110096986A (en) 2019-08-06

Similar Documents

Publication Publication Date Title
CN110795543B (en) Unstructured data extraction method, device and storage medium based on deep learning
CN109635171B (en) Fusion reasoning system and method for news program intelligent tags
CN107766371B (en) Text information classification method and device
CN110363194B (en) NLP-based intelligent examination paper reading method, device, equipment and storage medium
CN112528963A (en) Intelligent arithmetic question reading system based on MixNet-YOLOv3 and convolutional recurrent neural network CRNN
JP2022023770A (en) Method and device for recognizing letter, electronic apparatus, computer readable storage medium and computer program
CN109874053A (en) The short video recommendation method with user's dynamic interest is understood based on video content
CN111611436A (en) Label data processing method and device and computer readable storage medium
CN113609305B (en) Method and system for constructing regional knowledge map of film and television works based on BERT
CN113297955B (en) Sign language word recognition method based on multi-mode hierarchical information fusion
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN113064995A (en) Text multi-label classification method and system based on deep learning of images
CN111967267B (en) XLNET-based news text region extraction method and system
CN112541347A (en) Machine reading understanding method based on pre-training model
CN115131698A (en) Video attribute determination method, device, equipment and storage medium
CN109492168A (en) A kind of visualization tour interest recommendation information generation method based on tourism photo
CN115546553A (en) Zero sample classification method based on dynamic feature extraction and attribute correction
CN115115740A (en) Thinking guide graph recognition method, device, equipment, medium and program product
CN112148874A (en) Intention identification method and system capable of automatically adding potential intention of user
CN113239159A (en) Cross-modal retrieval method of videos and texts based on relational inference network
CN110096986B (en) Intelligent museum exhibit guiding method based on image recognition and text fusion
CN107656760A (en) Data processing method and device, electronic equipment
CN116958512A (en) Target detection method, target detection device, computer readable medium and electronic equipment
CN110929013A (en) Image question-answer implementation method based on bottom-up entry and positioning information fusion
Murali et al. Remote sensing image captioning via multilevel attention-based visual question answering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant