CN110378342B

CN110378342B - Method and device for recognizing words based on convolutional neural network

Info

Publication number: CN110378342B
Application number: CN201910677804.8A
Authority: CN
Inventors: 张韵东; 黄发亮; 刘小涛
Original assignee: Vimicro Corp
Current assignee: Vimicro Corp
Priority date: 2019-07-25
Filing date: 2019-07-25
Publication date: 2023-04-28
Anticipated expiration: 2039-07-25
Also published as: CN110378342A

Abstract

The invention provides a method and a device for recognizing words based on a convolutional neural network. The method comprises the steps of carrying out feature extraction on an original image by adopting a convolutional neural network model so as to output a first feature map; slicing the first feature map in a height dimension to obtain a plurality of second feature maps; performing convolution and addition operation on the plurality of second feature images from top to bottom and from bottom to top respectively to obtain a third feature image; slicing the third feature map in the width dimension to obtain a plurality of fourth feature maps; performing convolution and addition operation on the plurality of fourth feature images from left to right and from right to left respectively to obtain a fifth feature image; mapping the fifth feature map to a word similarity probability space in an average pooling and full connection mode to obtain a first word semantic space feature map; and solving an optimal word sequence corresponding to the first word semantic space feature map by adopting a time sequence classification algorithm, so that a context space sequence learning convolutional neural network is utilized to fully explore the semantic relation of the context in the image.

Description

Method and device for recognizing words based on convolutional neural network

Technical Field

The invention relates to the technical field of convolutional neural networks, in particular to a method and a device for recognizing words based on a convolutional neural network.

Background

In the existing arbitrary length word recognition method, space information is used in a deep neural network in two ways, namely, a Long Short-Term Memory (LSTM) variant is used for exploring context semantic information, but the method is not easy to train and has larger calculation consumption; another approach is to use a recurrent neural network (Recurrent Neural Network, RNN) to deliver information on a per-row and per-column basis, but each point on the signature can only accept information on the nearest-neighbor same row or column, and no richer spatial hierarchy can be explored.

Disclosure of Invention

In view of the above, the embodiment of the invention provides a method and a device for recognizing words based on a convolutional neural network, which can effectively overcome the defects of large parameter quantity, time consumption and difficult training when learning the spatial semantic interrelationship in the prior art, and can explore more abundant spatial layers to ensure more accurate classification of sequence features.

In a first aspect of the present embodiment, the present embodiment provides a method for recognizing a word based on a convolutional neural network, including: performing feature extraction on the original image by adopting a convolutional neural network model to output a first feature map; slicing the first feature map in a height dimension to obtain a plurality of second feature maps; performing convolution and addition operation on the plurality of second feature images from top to bottom and from bottom to top respectively to obtain a third feature image; slicing the third feature map in the width dimension to obtain a plurality of fourth feature maps; performing convolution and addition operation on the plurality of fourth feature images from left to right and from right to left respectively to obtain a fifth feature image; mapping the fifth feature map to a word similarity probability space in an average pooling and full connection mode to obtain a first word semantic space feature map; and solving an optimal word sequence corresponding to the first word semantic space feature map by adopting a time sequence classification algorithm.

In an embodiment of the present invention, the first feature map has a size of c×h×w ₁ C is the number of channels, H is the height, W ₁ For width, slicing the first feature map in a height dimension to obtain a plurality of second feature maps includes: slicing the first feature map into a second feature map 1, a second feature map 2, a second feature map 3, and a second feature map … in the height dimension, wherein the convolution and addition operation are performed on the plurality of second feature maps from top to bottom and from bottom to top to obtain a third feature map, respectively, and the steps include: taking the second characteristic diagram 1 as input, and carrying out convolution and addition operation on the second characteristic diagrams of the H single chips from top to bottom to obtain a new second characteristic diagram 1, a new second characteristic diagram 2 and a new second characteristic diagram … of the new second characteristic diagram 3; will be newThe second feature map H is used as an input, and a convolution and addition operation is performed on the new second feature map 1, the new second feature map 2, and the new second feature map … of the new second feature map 3 from bottom to top to obtain a third feature map.

In an embodiment of the present invention, after mapping the fifth feature map into the word similarity probability space by means of average pooling and full-join to obtain the first word semantic space feature map, the method further includes: carrying out Softmax calculation on the first word semantic space feature map to obtain a second word semantic space feature map, wherein the method for solving the optimal word sequence corresponding to the first word semantic space feature map by adopting a time sequence classification algorithm comprises the following steps: and solving an optimal word sequence corresponding to the second word semantic space feature map by adopting a time sequence classification algorithm.

In an embodiment of the present invention, the first feature map has a size of c×h×w ₁ C is the number of channels, H is the height, W ₁ For width, mapping the fifth feature map to the word similarity probability space in an average pooling and full-connection manner to obtain a first word semantic space feature map, including: carrying out average pooling on the fifth characteristic diagram in the height dimension to obtain an average pooled fifth characteristic diagram, wherein the size of the average pooled fifth characteristic diagram is C. 1*W ₁ The method comprises the steps of carrying out a first treatment on the surface of the Mapping the fifth feature map after the average pooling to a word similarity probability space in a full-connection mode to obtain a first word semantic space feature map, wherein the size of the first word semantic space feature map is W ₂ * N, wherein W ₂ To map the fifth feature map after the averaging pooling to the width of the feature map output after the word similarity probability space, N is the number of categories of the word.

In one embodiment of the present invention, the timing classification algorithm comprises a concatenated timing classification algorithm or a framewise classification algorithm.

In an embodiment of the present invention, when the time sequence classification algorithm is a connection time sequence classification algorithm, the solving, by using the time sequence classification algorithm, an optimal word sequence corresponding to the first word semantic space feature map includes: performing guide training on the first word semantic space feature map by adopting a connection time sequence classification algorithm; and decoding and solving an optimal word sequence corresponding to the first word semantic space feature map by adopting an optimal path in a connection time sequence classification algorithm.

In one embodiment of the invention, the convolutional neural network model comprises an AlexNet model or a VGG model.

In an embodiment of the present invention, the VGG model includes a VGG11 model, a VGG13 model, a VGG16 model or a VGG19 model.

In a second aspect of the present embodiment, an embodiment of the present invention provides an apparatus for recognizing a word based on a convolutional neural network, including: the extraction module is used for carrying out feature extraction on the original image by adopting the convolutional neural network model so as to output a first feature map; the first slicing module is used for slicing the first feature map in the height dimension to obtain a plurality of second feature maps; the first convolution and addition operation module is used for carrying out convolution and addition operation on the plurality of second feature images from top to bottom and from bottom to top respectively to obtain a third feature image; the second slicing module is used for slicing the third feature map in the width dimension to obtain a plurality of fourth feature maps; the second convolution and addition operation module is used for carrying out convolution and addition operation on the plurality of fourth feature images from left to right and from right to left so as to obtain a fifth feature image; the first mapping module is used for mapping the fifth feature map to the word similarity probability space in an average pooling and full-connection mode to obtain a first word semantic space feature map; and the solving module is used for solving the optimal word sequence corresponding to the first word semantic space feature map by adopting a time sequence classification algorithm.

In a third aspect of the embodiments of the present invention, the embodiments of the present invention provide a computer-readable storage medium having stored thereon computer-executable instructions that, when executed by a processor, implement a method for identifying words based on a convolutional neural network as in any one of the first aspects of the embodiments of the present invention.

According to the technical scheme provided by the embodiment of the invention, the characteristic extraction is carried out on the original image by adopting the convolutional neural network model so as to output a first characteristic diagram; slicing the first feature map in a height dimension to obtain a plurality of second feature maps; performing convolution and addition operation on the plurality of second feature images from top to bottom and from bottom to top respectively to obtain a third feature image; slicing the third feature map in the width dimension to obtain a plurality of fourth feature maps; performing convolution and addition operation on the plurality of fourth feature images from left to right and from right to left respectively to obtain a fifth feature image; mapping the fifth feature map to a word similarity probability space in an average pooling and full connection mode to obtain a first word semantic space feature map; and solving the optimal word sequence corresponding to the first word semantic space feature map by adopting a time sequence classification algorithm, so that the deep learning convolutional neural network of the context space sequence is utilized, the semantic relation of the contexts of the rows and the columns of the image is fully explored, and the sequence feature classification is more accurate.

Drawings

Fig. 1 is a flowchart of a method for recognizing words based on a convolutional neural network according to an embodiment of the present invention.

Fig. 2 is a flowchart of a method for recognizing words based on a convolutional neural network according to another embodiment of the present invention.

Fig. 3 is a schematic flow chart of converting the first feature map into the fifth feature map according to an embodiment of the present invention.

Fig. 4 is a schematic structural diagram of an apparatus for recognizing words based on a convolutional neural network according to an embodiment of the present invention.

FIG. 5 is a block diagram of a system for recognizing words based on convolutional neural networks, in accordance with one embodiment of the present invention.

Detailed Description

The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings required for the embodiments of the present invention, and it is apparent that the drawings described below are only some embodiments of the present invention, but not all embodiments.

It is noted that, based on the embodiments of the present invention, all the related embodiments obtained by those skilled in the art without making any creative effort fall within the protection scope of the present invention.

The embodiment of the invention provides a method and a device for recognizing words based on a convolutional neural network, which are respectively described in detail below.

Fig. 1 is a flowchart of a method for recognizing words based on a convolutional neural network according to an embodiment of the present invention. The method may be executed by a server, a processor, or the like, taking the server as an execution subject. As shown in fig. 1, the method includes the following steps.

S110: and adopting a convolutional neural network model to perform feature extraction on the original image so as to output a first feature map.

Specifically, the server receives an original image input by a user, and the server performs feature extraction on the original image by adopting a convolutional neural network model so as to output a first feature map.

The convolutional neural network model may be a visual geometry group (Visual Geometry Group, VGG) model, or may be another model, such as an AlexNet model or a google net model, which is not specifically limited in the embodiment of the present invention.

S120: the first feature map is sliced in a height dimension to obtain a plurality of second feature maps.

Specifically, the server slices the first feature map in a height dimension to slice the first feature map into a plurality of second feature maps.

S130: and respectively carrying out convolution and addition operation on the plurality of second feature images from top to bottom and from bottom to top to obtain a third feature image.

Specifically, the server may perform convolution and addition operation on the multiple second feature maps from top to bottom to obtain multiple new second feature maps, and then perform convolution and addition operation on the multiple new second feature maps from bottom to top to obtain a third feature map; the server may perform convolution and addition operation on the multiple second feature maps from bottom to top to obtain multiple new second feature maps, and then perform convolution and addition operation on the multiple new second feature maps from top to bottom to obtain a third feature map.

The size of the convolution kernel in the convolution operation may be 2×2, or may be 3*3, or may be another size.

The step length in the convolution operation process can be 1, 2, or other values such as 3, 4, etc., and the specific value of the step length in the convolution operation process is not specifically limited in the embodiment of the invention.

The convolution and addition operation is to convolve the second feature map i first, and then add the convolved second feature map i with the second feature map i-1 or the second feature map i+1, thereby obtaining a new second feature map i-1 or the second feature map i+1.

S140: the third feature map is sliced in the width dimension to obtain a plurality of fourth feature maps.

Specifically, the server slices the third feature map in the height dimension to slice the first feature map into a plurality of fourth feature maps.

S150: and carrying out convolution and addition operation on the plurality of fourth feature images from left to right and from right to left to obtain a fifth feature image.

Specifically, the server may perform convolution and addition operation on the multiple fourth feature maps from left to right to obtain multiple new fourth feature maps, and then perform convolution and addition operation on the multiple new fourth feature maps from right to left to obtain a fifth feature map; the server may perform convolution and addition operation on the second feature maps from right to left to obtain a new fourth feature map, and then perform convolution and addition operation on the new fourth feature maps from left to right to obtain a fifth feature map.

S160: the second feature map is mapped into the word similarity probability space in an average pooling and full-connection mode to obtain a first word semantic space feature map.

Specifically, the server maps the second feature map to the word similarity probability space in an average pooling and full-connection mode to classify the second feature map, so that the first word semantic space feature map is obtained.

The word similarity probability space belongs to a feature space of word semantic classification, specifically, for example, a pair of images is divided into T feature segments according to the width direction, the word similarity probability space contains N words, then the word similarity probability space is a vector space formed by tenses of N words, and the T feature segments are mapped to values of T N-dimensional vectors in the word similarity probability space. Specific contents in the word similarity probability space can be designed according to actual requirements or contents of a database, and the embodiment of the invention is not particularly limited.

The words can be Chinese characters, english words and other types of words, and the embodiment of the invention does not limit the types of the words specifically.

The word semantic space feature map is a feature vector map of the word semantic space.

S170: and solving an optimal word sequence corresponding to the first word semantic space feature map by adopting a time sequence classification algorithm.

Specifically, the server solves the first word semantic space feature map by adopting a time sequence classification algorithm, so that an optimal word sequence corresponding to the first word semantic space feature map is obtained.

The server may use a connection timing sequence classification (Connectionist Temporal Classification, CTC) algorithm to solve the first word semantic space feature map, may use a framewise classification algorithm to solve the first word semantic space feature map, and may use other ways to solve the optimal word sequence corresponding to the first word semantic space feature map.

According to the technical scheme provided by the embodiment of the invention, the characteristic extraction is carried out on the original image by adopting the convolutional neural network model so as to output a first characteristic diagram; slicing the first feature map in a height dimension to obtain a plurality of second feature maps; performing convolution and addition operation on the plurality of second feature images from top to bottom and from bottom to top respectively to obtain a third feature image; slicing the third feature map in the width dimension to obtain a plurality of fourth feature maps; performing convolution and addition operation on the plurality of fourth feature images from left to right and from right to left respectively to obtain a fifth feature image; mapping the fifth feature map into a word similarity probability space in an average pooling and full connection mode to obtain a first word semantic space feature map; and solving the optimal word sequence corresponding to the first word semantic space feature map by adopting a time sequence classification algorithm, so that the deep learning convolutional neural network of the context space sequence is utilized, the semantic relation of the contexts of the image rows and the image columns is fully explored on a richer space level, and the sequence feature classification is more accurate.

Fig. 2 is a flowchart of a method for recognizing words based on a convolutional neural network according to another embodiment of the present invention. Fig. 2 is a modification of the embodiment of fig. 1, specifically, step S111 in the embodiment of fig. 2 corresponds to step S110 in the embodiment of fig. 1, step S121 corresponds to step S120 in the embodiment of fig. 1, steps S131 to S132 correspond to step S130 in the embodiment of fig. 1, step S141 corresponds to step S140 in the embodiment of fig. 1, steps S151 to S152 correspond to step S150 in the embodiment of fig. 1, steps S161 to S162 correspond to step S160 in the embodiment of fig. 1, and step S171 corresponds to step S170 in the embodiment of fig. 1. As shown in fig. 2, the method includes the following steps.

S111: extracting features of the original image by adopting a VGG model to output a first feature map, wherein the size of the first feature map is C.H.W ₁ 。

C*H*W ₁ The number of channels, the height and the width respectively corresponding to the first profile.

S121: the first feature map is sliced into H monolithic second feature maps, i.e., second feature map 1, second feature map 2, and second feature map 3, …, in the height dimension.

Specifically, the server slices the first feature map in the height dimension, thereby slicing the first feature map into second feature maps of H pieces in total of the second feature maps 1, 2, ….

S131: the second feature map 1 is taken as input, and convolution and addition operation are performed on the second feature maps of the H singlets from top to bottom to obtain a new second feature map 1, a new second feature map 2 and a new second feature map ….

S132: the new second feature map H is taken as an input, and the new second feature map H of the new second feature map 1, the new second feature map 2 and the new second feature map 3 and … is convolved and added from bottom to top to obtain a third feature map.

The execution sequence of S131 and S132 in the embodiment of the present invention is not specifically limited. Specifically, the server may perform convolution and addition operation on the second feature maps of the H singlechips from top to bottom to obtain new second feature maps of the H singlechips, and then perform convolution and addition operation on the second feature maps of the new H singlechips from bottom to top to obtain a third feature map, where the third feature map is formed by performing convolution and addition operation on the second feature maps of the new H singlechips from bottom to top and performing serialization fusion; the server may perform convolution and addition operation on the second feature maps of the H singlechips from bottom to top to obtain new second feature maps of the H singlechips, and then perform convolution and addition operation on the second feature maps of the new H singlechips from top to bottom to obtain a third feature map, where the third feature map is formed by performing convolution and addition operation on the second feature maps of the new H singlechips from top to bottom and performing serialization fusion.

When convolution and addition operation are carried out on the second feature graphs of the H singlechips from top to bottom, information of each second feature graph is transmitted downwards, so that features of each row in the first feature graph are related to features of the row below the first feature graph; when the convolution and addition operation is performed on the second feature maps of the H singlets from bottom to top, the information of each third feature map is transferred upwards, so that the features of each row in the first feature map are related to the features of the row above the first feature map.

S141: slicing the third feature map into a fourth feature map 1, a fourth feature map 2, a fourth feature map 3, …, and a fourth feature map W in the width dimension ₁ W altogether ₁ Fourth feature map of individual slices.

Specifically, the server slices the third feature map in the width dimension to obtain a fourth feature mapCharacterization map 1, fourth characterization map 2, fourth characterization map 3, …, fourth characterization map W ₁ W altogether ₁ Fourth feature map of individual slices.

The server may perform the segmentation in the height dimension first, then perform the segmentation in the width dimension, or may perform the segmentation in the width dimension first, then perform the segmentation in the height dimension, which is not specifically limited in the embodiment of the present invention.

S151: taking the fourth feature map 1 as input, right to left ₁ The fourth feature map of each single sheet is subjected to convolution and addition operation to obtain a new fourth feature map 1, a new fourth feature map 2 and a new fourth feature map 3 … ₁ 。

S152: the new fourth feature map H is taken as an input, and the new fourth feature map H of the new fourth feature map 1, the new fourth feature map 2, and the new fourth feature map 3 … is convolved and added from right to left to obtain a fifth feature map.

The execution sequence of S151 and S152 in the embodiment of the present invention is not specifically limited. Specifically, the server may first go from left to right to W ₁ The fourth feature map of each single chip is subjected to convolution and addition operation to obtain new W ₁ Fourth feature map of single chip, new W is from right to left ₁ The fourth feature map of each single chip is subjected to convolution and addition operation to obtain a fifth feature map, wherein the fifth feature map is a new W ₁ The fourth feature images of the single sheets are formed by convolution and addition operation from right to left and serialization fusion; the server can also firstly change W from right to left ₁ The fourth feature map of each single chip is subjected to convolution and addition operation to obtain new W ₁ Fourth feature map of single chip, new W is from left to right ₁ The fourth feature map of each single chip is subjected to convolution and addition operation to obtain a fifth feature map, wherein the fifth feature map is a new W ₁ The fourth characteristic diagram of each single chip is formed by convolution and addition operation from left to right and sequential fusion.

When going from left to right to W ₁ The convolution and addition operation of the fourth feature images of the single sheets lead the information of each fourth feature image to bePassing to the right such that the features of each column in the first feature map are related to the features of the column to the right thereof; when the convolution and addition operation is performed on the fourth feature maps from right to left, the information of each fourth feature map is transferred to the left, so that the features of each column in the first feature map are related to the features of the column on the left side.

The process of converting the first feature map into the fifth feature map in steps S121-S152 may be as shown in fig. 3, wherein the black solid dots in fig. 3 represent an addition operation, for example, when performing convolution and addition operation from top to bottom, the second feature map 1 is taken as input, the new second feature map 1 is equal to the second feature map 1, the new second feature map 2 is formed by convolution processing the second feature map 1 and adding the convolution processed second feature map 1 to the second feature map 2, the new second feature map 3 is formed by convolution processing the new second feature map 2 and adding the convolution processed new second feature map 2 to the second feature map 3, and so on.

S161: carrying out average pooling on the fifth characteristic diagram in the height dimension to obtain an average pooled fifth characteristic diagram, wherein the size of the average pooled fifth characteristic diagram is C. 1*W ₁ 。

S162: mapping the fifth feature map after the average pooling to a word similarity probability space in a full-connection mode to obtain a first word semantic space feature map, wherein the size of the first word semantic space feature map is W ₂ * N, wherein W ₂ To map the fifth feature map after the averaging pooling to the width of the feature map output after the word similarity probability space, N is the number of categories of the word.

W ₂ Can be equal to W ₁ The same as the value of W ₁ Is different in value of W ₂ ≤W ₁ Specifically W ₂ The numerical values of (a) are related to the process of performing a series of transformations on the first feature map, which is not particularly limited in the embodiment of the present invention.

S180: and carrying out Softmax calculation on the first word semantic space feature map to obtain a second word semantic space feature map.

Specifically, the server performs Softmax computation on the first word semantic space feature map, thereby converting the first word semantic space feature map into a second word semantic space feature map.

In the embodiment of the invention, the first word semantic space feature map is mapped into the probability space of (0, 1) by carrying out Softmax calculation on the first word semantic space feature map so as to obtain the second word semantic space feature map, so that the method is more beneficial to solving the optimal word sequence corresponding to the second word semantic space feature map.

The server may implement step S180 between step S162 and step S171 or between step S160 and step S170, or may not implement step S180 between step S162 and step S171 or between step S160 and step S170, which is not particularly limited in the embodiment of the present invention.

S171: and solving an optimal word sequence corresponding to the second word semantic space feature map by adopting a time sequence classification algorithm.

According to the technical scheme provided by the embodiment of the invention, the VGG model is adopted to perform feature extraction on the original image so as to output the first feature map, so that the network hierarchy is deepened and excessive parameters are avoided in the process of performing feature extraction on the original image; slicing the first feature map into second feature maps of H single slices in the height dimension, taking the second feature map 1 as input, performing convolution and addition operation on the second feature maps of H single slices from top to bottom to obtain a new second feature map 1, a new second feature map 2 and a new second feature map …, taking the new second feature map H as input, and performing convolution and addition operation on the new second feature map 1, the new second feature map 2 and the new second feature map H of the new second feature map 3 … from bottom to top to obtain a third feature map; slicing the third feature map into a fourth feature map 1, a fourth feature map 2, a fourth feature map 3, …, and a fourth feature map W in the width dimension ₁ W altogether ₁ Fourth feature map of each single sheet, taking fourth feature map 1 as input, from left to right to W ₁ The fourth feature map of each single chip is subjected to convolution and addition operation to obtainObtaining a new fourth characteristic diagram 1, a new fourth characteristic diagram 2 and a new fourth characteristic diagram … and a new fourth characteristic diagram W ₁ Taking the new fourth characteristic diagram H as input, carrying out convolution and addition operation on the new fourth characteristic diagram 1, the new fourth characteristic diagram 2 and the new fourth characteristic diagram … from right to left to obtain a fifth characteristic diagram, so that the obtained second characteristic diagram can be fused with context semantic information, and the classification of the sequence characteristics is more accurate; carrying out average pooling on the fifth feature map in the height dimension to obtain an average pooled fifth feature map, wherein the dimension of the fifth feature map can be reduced on the premise of retaining important features in the fifth feature map; mapping the fifth feature map after the average pooling to a word similarity probability space in a full-connection mode to obtain a first word semantic space feature map, wherein the size of the first word semantic space feature map is W ₂ * N, carrying out Softmax calculation on the first word semantic space feature map to obtain a second word semantic space feature map, so as to convert the fifth feature map into the second word semantic space feature map; and finally, solving the optimal word sequence corresponding to the second word semantic space feature map by adopting a time sequence classification algorithm, thereby obtaining the optimal word sequence. The embodiment of the invention realizes deep learning of the convolutional neural network by using the context space sequence, fully explores the semantic relation of the contexts of the image rows and the image columns, and ensures that the sequence feature classification is more accurate.

The VGG model includes, but is not limited to, a VGG11 model, a VGG13 model, a VGG16 model, or a VGG19 model, and the type of the VGG model is not particularly limited in the embodiment of the invention.

In one embodiment of the invention, the temporal classification algorithm is connected with a word semantic space feature map of the temporal classification algorithm or the framewise classification algorithm.

The server may use connection time sequence classification to solve the optimal word sequence corresponding to the first word semantic space feature map, may use framing classification to solve the optimal word sequence corresponding to the first word semantic space feature map, and may use other classification methods to solve the optimal word sequence corresponding to the word semantic space feature map.

In the embodiment of the invention, the optimal word sequence corresponding to the word semantic space feature map is solved by adopting the connection time sequence classification or the framewise classification, so that the first word semantic space feature map and the word classification sequence can be effectively corresponded, and the optimal word sequence corresponding to the first word semantic space feature map is solved.

In an embodiment of the present invention, when the timing classification algorithm is a connection timing classification algorithm, step S171 includes: performing guide training on the second word semantic space feature map by adopting a connection time sequence classification algorithm; and decoding and solving an optimal word sequence corresponding to the second word semantic space feature map by adopting an optimal path in a connection time sequence classification algorithm.

Fig. 4 is a schematic structural diagram of an apparatus 400 for recognizing words based on a convolutional neural network according to an embodiment of the present invention. The apparatus 400 includes an extraction module 410 for performing feature extraction on an original image using a convolutional neural network model to output a first feature map; a first slicing module 420, configured to slice the first feature map in a height dimension to obtain a plurality of second feature maps; the first convolution and addition operation module 430 is configured to perform convolution and addition operations on the plurality of second feature maps from top to bottom and from bottom to top, respectively, so as to obtain a third feature map; a second slicing module 440, configured to slice the third feature map in a width dimension to obtain a plurality of fourth feature maps; a second convolution and addition operation module 450, configured to perform convolution and addition operations on the plurality of fourth feature maps from left to right and from right to left, respectively, so as to obtain a fifth feature map; a first mapping module 460, configured to map the fifth feature map to a word similarity probability space in an average pooling and full-join manner to obtain a first word semantic space feature map; and the solving module 470 is configured to solve the optimal word sequence corresponding to the first word semantic space feature map by using a time sequence classification algorithm.

In an embodiment of the present invention, the first feature map has a size of c×h×w ₁ C is the number of channels and is the number of channels,h is the height, W ₁ For width, the first slicing module 420 is further configured to slice the first feature map into a second feature map 1, a second feature map 2, a second feature map 3, and a second feature map …, where the second feature map H is H singlets in the height dimension; the first convolution and addition operation module 430 is further configured to perform convolution and addition operation on the second feature maps of the H singlets from top to bottom to obtain a new second feature map 1, a new second feature map 2, a new second feature map 3, …, and a third feature map by performing convolution and addition operation on the new second feature map 1, the new second feature map 2, and the new second feature map H of the new second feature map 3, … from bottom to top.

In an embodiment of the present invention, the apparatus 400 further includes a Softmax calculating module 480, configured to perform Softmax calculation on the first word semantic space feature map to obtain a second word semantic space feature map, and the solving module 460 is further configured to solve an optimal word sequence corresponding to the second word semantic space feature map by using a time sequence classification algorithm.

In an embodiment of the present invention, the first feature map has a size of c×h×w ₁ C is the number of channels, H is the height, W ₁ For width, the first mapping module 460 includes: an average pooling module 461 for average pooling the fifth feature map in the height dimension to obtain an average pooled fifth feature map, where the size of the average pooled fifth feature map is c× 1*W ₁ The method comprises the steps of carrying out a first treatment on the surface of the A second mapping module 462 for mapping the averaged and pooled fifth feature map to the word similarity probability space in a fully connected manner to obtain a first word semantic space feature map, the first word semantic space feature map having a size W ₂ * N, wherein W ₂ In order to map the fifth feature map after the average pooling to the width of the feature map output after the word similarity probability space, N is the word semantic space feature map of the category number of words.

In one embodiment of the present invention, the timing classification algorithm includes a connection timing classification or a framewise classification.

In an embodiment of the present invention, when the time sequence classification algorithm is a connection time sequence classification algorithm, the solving module 470 is further configured to conduct guidance training on the first word semantic space feature map by using the connection time sequence classification algorithm, and decode and solve an optimal word sequence corresponding to the first word semantic space feature map by using an optimal path in the connection time sequence classification algorithm.

According to the technical scheme provided by the embodiment of the invention, the extracting module 410 is configured to perform feature extraction on an original image by adopting a convolutional neural network model so as to output a first feature map; the first slicing module is used for slicing the first feature map in the height dimension to obtain a plurality of second feature maps; the first convolution and addition operation module is used for carrying out convolution and addition operation on the plurality of second feature images from top to bottom and from bottom to top respectively to obtain a third feature image; a second slicing module, configured to slice the third feature map in a width dimension to obtain a plurality of fourth feature maps; the second convolution and addition operation module is used for carrying out convolution and addition operation on the plurality of fourth feature images from left to right and from right to left so as to obtain a fifth feature image; a first mapping module 430, configured to map the fifth feature map to a word similarity probability space in an average pooling and full-join manner to obtain a first word semantic space feature map; the solving module 440 is configured to solve an optimal word sequence corresponding to the first word semantic space feature map by using a time sequence classification algorithm, thereby implementing deep learning of the convolutional neural network by using the context space sequence, and fully exploring the semantic relationship of the contexts of the rows and columns of the image, so that the sequence feature classification is more accurate.

FIG. 5 is a block diagram of a system 500 for recognizing words based on convolutional neural networks, in accordance with an embodiment of the present invention.

Referring to fig. 5, system 500 includes a processing component 510 that further includes one or more processors and memory resources represented by memory 520 for storing instructions, such as applications, executable by processing component 510. The application program stored in memory 520 may include one or more modules each corresponding to a set of instructions. Further, the processing component 510 is configured to execute instructions to perform the above-described method of recognizing words based on convolutional neural networks.

The system 500 may also include a power component configured to perform power management of the system 500, a wired or wireless network interface configured to connect the system 500 to a network, and an input output (I/O) interface. The system 500 may operate based on an operating system stored in the memory 520, such as Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ,Linux ^TM ，FreeBSD ^TM Or the like.

A non-transitory computer readable storage medium, which when executed by a processor of the system 500, causes the system 500 to perform a method of recognizing words based on a convolutional neural network, comprising feature extracting an original image using a convolutional neural network model to output a first feature map; slicing the first feature map in a height dimension to obtain a plurality of second feature maps; performing convolution and addition operation on the plurality of second feature images from top to bottom and from bottom to top respectively to obtain a third feature image; slicing the third feature map in the width dimension to obtain a plurality of fourth feature maps; performing convolution and addition operation on the plurality of fourth feature images from left to right and from right to left respectively to obtain a fifth feature image; mapping the fifth feature map to a word similarity probability space in an average pooling and full connection mode to obtain a first word semantic space feature map; and solving an optimal word sequence corresponding to the first word semantic space feature map by adopting a time sequence classification algorithm.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It will be clear to those skilled in the art that for convenience and brevity of description, reference may be made to corresponding processes in the foregoing method embodiments for specific working procedures of the apparatus, device and unit described above, and no further description will be made here.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus, system, and method may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program verification codes.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for recognizing words based on convolutional neural network, comprising:

performing feature extraction on the original image by adopting a convolutional neural network model to output a first feature map;

slicing the first feature map in a height dimension to obtain a plurality of second feature maps;

performing convolution and addition operation on the plurality of second feature images from top to bottom and from bottom to top respectively to obtain a third feature image;

slicing the third feature map in a width dimension to obtain a plurality of fourth feature maps;

performing convolution and addition operation on the plurality of fourth feature images from left to right and from right to left respectively to obtain a fifth feature image;

mapping the fifth feature map into a word similarity probability space in an average pooling and full connection mode to obtain a first word semantic space feature map;

and solving an optimal word sequence corresponding to the first word semantic space feature map by adopting a time sequence classification algorithm.

2. The method of claim 1, wherein the first feature map has a size of C x H x W ₁ C is the number of channels, H is the height, W ₁ For width, the slicing the first feature map in a height dimension to obtain a plurality of second feature maps includes:

slicing the first feature map into H pieces of second feature maps of second feature map 1, second feature map 2 and second feature map 3 and … in the height dimension,

the convolution and addition operation are performed on the plurality of second feature graphs from top to bottom and from bottom to top respectively to obtain a third feature graph, which includes:

taking the second characteristic diagram 1 as input, and carrying out convolution and addition operation on the second characteristic diagrams of the H single chips from top to bottom to obtain a new second characteristic diagram 1, a new second characteristic diagram 2 and a new second characteristic diagram …;

and taking the new second characteristic diagram H as an input, and carrying out convolution and addition operation on the new second characteristic diagram H of the new second characteristic diagram 1, the new second characteristic diagram 2 and the new second characteristic diagram … from bottom to top to obtain the third characteristic diagram.

3. The method of claim 1, wherein after mapping the fifth feature map into a word similarity probability space by means of averaging pooling and full concatenation to obtain a first word semantic space feature map, the method further comprises:

performing Softmax calculation on the first word semantic space feature map to obtain a second word semantic space feature map,

the method for solving the optimal word sequence corresponding to the first word semantic space feature map by adopting a time sequence classification algorithm comprises the following steps:

and solving an optimal word sequence corresponding to the second word semantic space feature map by adopting a time sequence classification algorithm.

4. As claimed in claim 1The method is characterized in that the size of the first characteristic diagram is C, H and W ₁ C is the number of channels, H is the height, W ₁ For width, mapping the fifth feature map into a word similarity probability space in an average pooling and full-connection mode to obtain a first word semantic space feature map, wherein the method comprises the following steps of:

averaging and pooling the fifth feature map in the height dimension to obtain an averaged and pooled fifth feature map, wherein the dimension of the averaged and pooled fifth feature map is c× 1*W ₁ ；

Mapping the fifth feature map after the average pooling to a word similarity probability space by adopting a full-connection mode to obtain the first word semantic space feature map, wherein the size of the first word semantic space feature map is W ₂ * N, wherein W ₂ In order to map the fifth feature map after the average pooling to the width of the feature map output after the word similarity probability space, N is the number of categories of the word.

5. The method of any of claims 1-4, wherein the timing classification algorithm comprises a concatenated timing classification algorithm or a framewise classification algorithm.

6. The method as recited in claim 5 wherein, when the temporal classification algorithm is a join temporal classification algorithm, said employing the temporal classification algorithm to solve the optimal word sequence for the first word semantic space feature map comprises:

performing guidance training on the first word semantic space feature map by adopting the connection time sequence classification algorithm;

and decoding and solving an optimal word sequence corresponding to the first word semantic space feature map by adopting an optimal path in the connection time sequence classification algorithm.

7. The method of any one of claims 1-4, wherein the convolutional neural network model comprises an AlexNet model or a VGG model.

8. The method of claim 7, wherein the VGG model comprises a VGG11 model, a VGG13 model, a VGG16 model, or a VGG19 model.

9. An apparatus for recognizing words based on convolutional neural network, comprising:

the extraction module is used for carrying out feature extraction on the original image by adopting the convolutional neural network model so as to output a first feature map;

a first slicing module, configured to slice the first feature map in a height dimension to obtain a plurality of second feature maps;

the first convolution and addition operation module is used for carrying out convolution and addition operation on the plurality of second feature images from top to bottom and from bottom to top respectively to obtain a third feature image;

a second slicing module, configured to slice the third feature map in a width dimension to obtain a plurality of fourth feature maps;

the second convolution and addition operation module is used for carrying out convolution and addition operation on the plurality of fourth feature images from left to right and from right to left so as to obtain a fifth feature image;

the mapping module is used for mapping the fifth feature map to a word similarity probability space in an average pooling and full-connection mode to obtain a first word semantic space feature map;

and the solving module is used for solving the optimal word sequence corresponding to the first word semantic space feature map by adopting a time sequence classification algorithm.

10. A computer readable storage medium having stored thereon computer executable instructions which when executed by a processor implement the method of identifying words based on a convolutional neural network as claimed in any one of claims 1 to 8.