WO2022022695A1 - 图像识别方法和装置 - Google Patents

图像识别方法和装置 Download PDF

Info

Publication number
WO2022022695A1
WO2022022695A1 PCT/CN2021/109680 CN2021109680W WO2022022695A1 WO 2022022695 A1 WO2022022695 A1 WO 2022022695A1 CN 2021109680 W CN2021109680 W CN 2021109680W WO 2022022695 A1 WO2022022695 A1 WO 2022022695A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
feature map
map
image
size
Prior art date
Application number
PCT/CN2021/109680
Other languages
English (en)
French (fr)
Inventor
车慧敏
李志刚
杨雨
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022022695A1 publication Critical patent/WO2022022695A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the technical field of neural networks, and in particular, to an image recognition method and apparatus.
  • picture book robots With the development of technology, some preschool education product robots (referred to as picture book robots) with the function of reading picture books have appeared on the market.
  • the picture book robot needs to accurately identify the picture book before reading the picture book. Specifically, the robot first collects an image of a certain page of the picture book through the camera, then performs local feature detection on the image, and then matches the detection result with the picture book image template pre-stored in the database, and obtains the highest matching degree with the detection result. The image that matches the detection result with the highest degree is regarded as the image to be read. Subsequently, the picture book robot reads the to-be-read image.
  • the above method for identifying picture books has high requirements on the placement of picture books.
  • the picture book is required to be spread out on a horizontal plane, which is consistent with the horizontal plane where the picture book robot is located; the distance and angle between the picture book and the picture book robot are also required to meet certain requirements.
  • the picture book robot is also required to stand up and wait.
  • the embodiments of the present application provide an image recognition method and apparatus, which help to improve the image recognition accuracy.
  • an image recognition method including: first, acquiring an image to be recognized. Then, use the first neural network to perform feature extraction on the image to be recognized to obtain a first feature map. Next, use the second neural network to perform feature extraction on the first feature map to obtain a second feature map, and perform dot product of the second feature map with the first feature map to obtain a third feature map; wherein, the third feature map represents The feature map obtained by transforming the features of the image to be recognized to the main direction. And, a first score map of the image to be recognized is obtained based on the third feature map. Finally, based on the third feature map and the first score map, the to-be-recognized image is recognized.
  • a second neural network is used to perform feature extraction on the first feature map to obtain a second feature map, and the second feature map is dot-multiplied with the first feature map to obtain a third feature map, which is helpful for constructing A network with rotational invariance features, based on which image recognition is performed, helps to improve the accuracy of image recognition.
  • using the first neural network to perform feature extraction on the image to be recognized to obtain the first feature map includes: using the first neural network to perform at least one layer of convolution operations on the image to be recognized to obtain the first feature map.
  • using the second neural network to perform feature extraction on the first feature map to obtain the second feature map includes: using the second neural network to perform at least one layer of convolution operations on the first feature map to obtain the first feature map.
  • Two feature maps In this possible design, feature extraction is performed on the first feature map through at least one layer of convolution operations to obtain the second feature map, and the operation is simple.
  • using the second neural network to perform feature extraction on the first feature map to obtain the second feature map includes: using the second neural network to perform at least one layer of convolution operations on the first feature map; At least one layer of pooling operation and/or full connection operation is performed on the first feature map after the convolution operation to obtain the second feature map.
  • feature extraction is performed on the first feature map through at least one layer of convolution operation and at least one layer of pooling operation and/or full connection operation, which helps to achieve more complex feature extraction, thereby helping to make The result of feature extraction is more accurate, which in turn helps to improve the accuracy of image recognition.
  • the size of the third feature map is M1*N1*P1
  • the size of the first score map is M1*N1
  • P1 is the dimension of the feature direction
  • M1*N1 is the dimension perpendicular to the feature direction.
  • Dimensions, M1, N1 and P1 are all positive integers.
  • the size of the third feature map is M2*N2*P2
  • the size of the first score map is M1*N1
  • P2 is the size of the feature direction dimension
  • M1, N1, P1, M2, N2 and P2 are all positive integers.
  • Identifying the image to be recognized based on the third feature map and the first score map includes: performing feature extraction on the third feature map to obtain a fourth feature map; wherein, the size of the fourth feature map is M1*N1*P1; P1 is the size of the feature direction dimension, and P1 is a positive integer; based on the fourth feature map and the first score map, the image to be recognized is recognized; wherein, the size of the first score map is M1*N1.
  • using the first score map and the feature map obtained after feature extraction from the third feature map to identify the image to be recognized helps to change the size of the feature map.
  • obtaining the first score map of the image to be recognized based on the third feature map includes: using a 1-channel convolution kernel to perform a convolution operation on the third feature map to obtain X fifth feature maps;
  • the size of the feature direction of the fifth feature map is smaller than the size of the feature direction of the third feature map;
  • X is an integer greater than 2; the elements of the X fifth feature maps are weighted and summed to obtain the sixth feature map;
  • Six feature maps are used for feature extraction to obtain a first score map.
  • the process of obtaining the score map only the dimension of the feature direction of the third feature map is compressed, so the implementation is simple.
  • obtaining the first score map of the image to be recognized based on the third feature map includes: performing feature extraction on the third feature map to obtain a seventh feature map; wherein the third feature map is perpendicular to the feature The dimension of the direction is greater than the dimension of the seventh feature map perpendicular to the feature direction; X is an integer greater than 2; use a 1-channel convolution kernel to perform a convolution operation on the seventh feature map to obtain X fifth feature maps; The elements of the X fifth feature maps are weighted and summed to obtain a sixth feature map; and feature extraction is performed on the sixth feature map to obtain a first score map.
  • the size of the feature direction and the size perpendicular to the feature direction of the third feature map are compressed, so it is helpful to reduce the complexity of the image processing process, thereby improving the image quality. Processing efficiency of the identification process.
  • the size of the image to be recognized is larger than the size of the first score map. Since the size of the first score map (assumed to be a*b) used in the image recognition process represents the number of features in the feature map used in the process, in this possible design, if the features of the image to be recognized are is a dense feature, the feature map corresponding to the first score map is a sparse feature, and using sparse features for image recognition helps reduce the complexity of the image processing process, thereby improving the processing efficiency of the image recognition process.
  • the present application provides an image recognition device.
  • the image recognition apparatus is used to execute any one of the methods provided in the first aspect above.
  • the present application may divide the image recognition device into functional modules according to any of the methods provided in the first aspect.
  • each function module may be divided corresponding to each function, or two or more functions may be integrated into one processing module.
  • the present application may divide the image recognition device into an acquisition unit, a feature extraction unit, a recognition unit, and the like according to functions.
  • the image recognition device includes: a memory and one or more processors, the memory and the processor being coupled.
  • the memory is used for storing computer instructions
  • the processor is used for invoking the computer instructions to perform any one of the methods provided by the first aspect and any possible design manners thereof.
  • the present application provides a computer readable storage medium, such as a computer non-transitory readable storage medium.
  • a computer program (or instruction) is stored thereon, and when the computer program (or instruction) runs on the image recognition device, the image recognition device is made to execute any one of the possible implementations provided in the first aspect above method.
  • the present application provides a computer program product that, when run on a computer, enables any one of the methods provided by any one of the possible implementations of the first aspect to be executed.
  • the present application provides a chip system, including: a processor, where the processor is configured to call and run a computer program stored in the memory from a memory, and execute any one of the methods provided in the implementation manner of the first aspect.
  • FIG. 1 is a schematic diagram of a hardware structure of a computer device applicable to an embodiment of the present application
  • 2a is a schematic diagram of a deep learning network model provided by an embodiment of the application.
  • 2b is a schematic diagram of another deep learning network model provided by an embodiment of the application.
  • FIG. 3 is a schematic diagram of a logical structure of a first neural network provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a logical result of a second neural network provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of each dimension of a feature map according to an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of a method for acquiring training data provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a reference image applicable to an embodiment of the present application and a sample image obtained after performing homography transformation on the reference image;
  • FIG. 8 is a schematic diagram of a relationship between reference data and training data provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a connection relationship between a front-end network, an adversarial network, and a twin network provided by an embodiment of the present application.
  • FIG. 10 is a schematic flowchart of a method for training a front-end network provided by an embodiment of the present application
  • FIG. 11 is a schematic diagram of a logical structure of an adversarial network provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of a logical structure of an extraction network provided by an embodiment of the present application.
  • FIG. 13 is a schematic diagram of a logical structure representing a network according to an embodiment of the present application.
  • 16 is a schematic structural diagram of an image recognition apparatus provided by an embodiment of the present application.
  • FIG. 17 is a schematic structural diagram of a chip system provided by an embodiment of the present application.
  • FIG. 18 is a conceptual partial view of a computer program product provided by an embodiment of the present application.
  • Feature that is, image features, which can include color features, texture features, etc., shape features, and local feature points.
  • Global features refer to the overall attributes of an image. Common global features include color features, texture features, and shape features. The global feature is to use all the features of an image to represent the image, such features have a lot of redundant information. Local features refer to local properties of an image. Local features use the local feature points of an image to represent the image. Each local feature point only contains the information of the image block in which it is located, and does not perceive the global information of the image.
  • Feature points ie local feature points: In image processing, the same object or scene is collected from different angles for multiple images. If the same part of the object or scene can be identified, the result is the same, then it is called. are scale invariant for these parts.
  • a pixel point or pixel block with "scale invariance (ie, a pixel block composed of multiple pixel points)" is a feature point. In one example, if a pixel in the image is an extreme point (such as a maximum or minimum point) in its neighborhood, it is determined that the pixel is a feature point.
  • Image patch a local square area in the image, such as 4*4 pixel, 8*8 pixel image area.
  • a*a pixel represents a square area whose width and height are respectively a pixels, and a is an integer greater than or equal to 1.
  • H is a 3*3 matrix (also called a homography matrix)
  • X is the position coordinate of the pixel point in the source image
  • Y is the position coordinate of the corresponding pixel point on the mapped target image .
  • a picture book can be regarded as a plane, and its corresponding geometric transformation subset is a homography transformation. matrix). If one image is transformed by homography to obtain another image, it is considered that there is a homography transformation relationship between the two images.
  • Histogram of oriented gradient is a statistical report graph, which is represented by a series of vertical stripes or line segments of varying heights, and the data is generally represented by the horizontal axis. type, and the vertical axis represents the distribution.
  • the gradient direction histogram is a statistical value used to calculate the direction information of the local image gradient.
  • Main direction In an image/image block, by calculating the gradient direction between adjacent pixels (that is, the unit vector of the vector difference between adjacent pixels), a gradient direction histogram is established, and the gradient of the peak in the gradient direction histogram is located. That is, the main direction of the image/image block.
  • CNN Convolutional neural network
  • Maximum pooling (max poling): The most direct purpose of the pooling layer is to reduce the amount of data to be processed by the next layer. Maximum pooling extracts several eigenvalues for a filter, and only obtains the largest pooling layer as the reserved value, and discards all other eigenvalues. The largest value means that only the strongest of these features are retained, and other weak ones are discarded. feature.
  • Rotational invariance In physics, if the properties of a physical system are independent of its orientation in space, the system is rotationally invariant. In image processing, if the features extracted by the feature extractor hardly change when the image is rotated at any angle in the plane, the feature extractor is said to have rotation invariance.
  • the feature extractor may be a picture book robot, or a functional module in a picture book robot, such as a neural network.
  • Loss function The loss function is used to estimate the inconsistency between the predicted value f(x) of the model and the true value Y. It is a non-negative real-valued function, usually using L(Y, f(x)) to Indicates that the smaller the loss function, the better the robustness of the model. The goal of an optimization problem is to minimize the loss function.
  • An objective function is usually a loss function itself or its negative value. When an objective function is a negative value of the loss function, the value of the objective function seeks to be maximized.
  • Sparse features and dense features In the local feature detection, if the position index (index) of each pixel in the image is recorded, each index should correspond to a feature, then the sparse feature refers to the index set, most of the The index is empty, or most indexes have no corresponding features.
  • the dense feature means that most of the indexes are not empty, that is, most of the indexes have their corresponding feature descriptions.
  • the local feature detection algorithm includes two parts: “extraction” and “representation”.
  • extraction is to determine whether each pixel (or image block) in the image is a feature point.
  • representation means that all detected feature points are represented as feature values in the same dimension according to their neighborhoods. By calculating the distance of the feature values of the two feature points, it can be judged whether the two feature points are similar, and then the similarity of the two images can be judged according to the number or ratio of similar feature points in the two images. Therefore, the evaluation criteria of the local feature detection algorithm are: the matching accuracy of the feature points being successfully matched for two images with the same/similar regions.
  • a high homography transformation scene refers to a scene in which the feature representation before and after transformation is very different (that is, the difference between the determined feature points before and after transformation is very large), such as a picture book recognition scene.
  • words such as “exemplary” or “for example” are used to represent examples, illustrations or illustrations. Any embodiments or designs described in the embodiments of the present application as “exemplary” or “such as” should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as “exemplary” or “such as” is intended to present the related concepts in a specific manner.
  • first and second are only used for description purposes, and cannot be understood as indicating or implying relative importance or implying the number of indicated technical features.
  • a feature defined as “first”, “second” may expressly or implicitly include one or more of that feature.
  • plural means two or more.
  • the meaning of the term “at least one” refers to one or more, and the meaning of the term “plurality” in this application refers to two or more.
  • a plurality of second messages refers to two or more more than one second message.
  • system and “network” are often used interchangeably herein.
  • the size of the sequence number of each process does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not be used in the embodiment of the present application. Implementation constitutes any limitation.
  • determining B according to A does not mean that B is only determined according to A, and B may also be determined according to A and/or other information.
  • the term “if” may be interpreted to mean “when” or “upon” or “in response to determining” or “in response to detecting.”
  • the phrases “if it is determined" or “if a [statement or event] is detected” can be interpreted to mean “when determining" or “in response to determining... ” or “on detection of [recited condition or event]” or “in response to detection of [recited condition or event]”.
  • references throughout the specification to "one embodiment,” “an embodiment,” and “one possible implementation” mean that a particular feature, structure, or characteristic related to the embodiment or implementation is included in the present application at least one embodiment of .
  • appearances of "in one embodiment” or “in an embodiment” or “one possible implementation” in various places throughout this specification are not necessarily necessarily referring to the same embodiment.
  • the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments.
  • the first one local detection algorithm based on manual feature method, that is, the extraction and representation of local feature points are based on rules. For example, in the judgment of extreme value points, it is necessary to compare the pixel value of each pixel point with the pixel value of the surrounding neighboring pixel points one by one. In the judgment of the main direction, it is necessary to construct a gradient direction histogram one by one. During feature representation, complex steps such as normalization and orientation correction are required. Each of these steps requires experimentally setting fixed parameters.
  • the second: local detection algorithm based on deep learning that is, the input of the neural network is an image, and the output is a score map where each pixel (or pixel block) in the image is considered as a feature point (that is, each pixel corresponds to The probability that it can be marked as a local feature point is a probability value of 0-1), and each pixel point (or pixel block) corresponds to a feature map of the feature value.
  • This method is a non-end-to-end method.
  • the feature extraction in this method still relies on manual feature extraction, so the above problems also exist.
  • the neural network is usually a convolutional neural network, and the convolutional neural network only has rotation invariance to a certain extent, and does not rotate and normalize the feature points like the above method 1, so In the high homography transformation scene, the difference of feature representation before and after transformation is very large, resulting in very low matching accuracy.
  • the embodiments of the present application provide a neural network model training method and an image recognition method, which are applied to high homography transformation scenarios (eg, picture book recognition scenarios).
  • a neural network with rotation invariance is trained based on multiple images, more precisely, a neural network with a higher degree of rotation invariance is trained than the convolutional neural network in the prior art.
  • the plurality of images include images having a homography transformation relationship.
  • the image recognition stage the image is recognized based on the rotation-invariant neural network. In this way, compared with the prior art, it is helpful to make the difference of the feature representation before and after transformation small, thereby improving the matching accuracy.
  • the neural network model training method and the image recognition method provided by the embodiments of the present application may be applied to the same or different computer devices respectively.
  • the neural network model training method can be executed by a computer device such as a server or a terminal.
  • the image recognition method can be executed by a terminal (such as a picture book robot, etc.). This embodiment of the present application does not limit this.
  • FIG. 1 it is a schematic diagram of the hardware structure of a computer device 10 applicable to the embodiments of the present application.
  • a computer device 10 includes a processor 101 , a memory 102 , an input-output device 103 , and a bus 104 .
  • the processor 101 , the memory 102 and the input/output device 103 may be connected through a bus 104 .
  • the processor 101 is the control center of the computer device 10, and may be a general-purpose central processing unit (central processing unit, CPU), or other general-purpose processors, or the like.
  • the general-purpose processor may be a microprocessor or any conventional processor or the like.
  • processor 101 may include one or more CPUs, such as CPU 0 and CPU 1 shown in FIG. 1 .
  • the memory 102 may be read-only memory (ROM) or other type of static storage device that can store static information and instructions, random access memory (RAM) or other type of static storage device that can store information and instructions
  • ROM read-only memory
  • RAM random access memory
  • a dynamic storage device that can also be an electrically erasable programmable read-only memory (EEPROM), a magnetic disk storage medium, or other magnetic storage device, or can be used to carry or store instructions or data structures in the form of desired program code and any other medium that can be accessed by a computer, but is not limited thereto.
  • EEPROM electrically erasable programmable read-only memory
  • magnetic disk storage medium or other magnetic storage device, or can be used to carry or store instructions or data structures in the form of desired program code and any other medium that can be accessed by a computer, but is not limited thereto.
  • the memory 101 may exist independently of the processor 101 .
  • the memory 102 may be connected to the processor 101 through a bus 104 for storing data, instructions or program codes.
  • the processor 101 calls and executes the instructions or program codes stored in the memory 102, it can implement the neural network model training method and/or the image recognition method provided by the embodiments of the present application.
  • the memory 102 may also be integrated with the processor 101 .
  • the input and output device 103 is used to input parameter information such as sample images and images to be recognized, so that the processor 101 executes the instructions in the memory 102 according to the input parameter information to execute the neural network model training method provided by the embodiment of the present application, and / or image recognition method.
  • the input and output device 103 may be an operation panel or a touch screen, or any other device capable of inputting parameter information, which is not limited in this embodiment of the present application.
  • the bus 104 can be an industry standard architecture (industry standard architecture, ISA) bus, a peripheral component interconnect (peripheral component interconnect, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus or the like.
  • the bus can be divided into address bus, data bus, control bus and so on. For ease of presentation, only one thick line is used in FIG. 1, but it does not mean that there is only one bus or one type of bus.
  • FIG. 1 does not constitute a limitation on the computer device 10.
  • the computer device 10 may include more or less components than those shown, or Combining certain components, or different component arrangements.
  • the model used in the embodiments of the present application is a deep learning network model (or a neural network model, hereinafter referred to as a network model).
  • 2a and 2b are schematic diagrams of two deep learning network models provided in this embodiment of the present application.
  • the network model shown in FIG. 2 a includes: a front-end network 41 and a presentation network 42 .
  • the network model shown in FIG. 2 b includes a front-end network 41 , a representation network 42 and an extraction network 43 .
  • the input of the front-end network 41 is an image, and the output is the third feature map of the image.
  • the third feature map represents a feature map obtained by transforming the features (such as texture features, etc.) of the image input in the preceding section 41 into the main direction.
  • the input to the front-end network 41 is a sample image.
  • the input of the front-end network 41 is the image to be recognized.
  • the front-end network 41 may include a first neural network 411 and a second neural network 412 .
  • the first neural network 411 is used to perform feature extraction on the input image (ie, the input image), for example, at least one layer of convolution operation is performed on the input image to obtain a first feature map.
  • the first feature map may be a three-dimensional tensor, and an element in the tensor corresponds to a region in the input image, which may also be called a receptive field of a convolutional neural network.
  • FIG. 3 it is a schematic diagram of a logical structure of a first neural network 411 provided in an embodiment of the present application.
  • the size of the input image of the first neural network 411 is H*W*3, and the size of the output first feature map is H/4*W/4*64.
  • the first neural network 411 includes 4 convolutional layers (labeled conv1-1, conv1-2, conv2-1, conv2-2, respectively).
  • the second neural network 412 is used to correct the first feature map to obtain a third feature map.
  • the second neural network 412 is configured to perform feature extraction on the first feature map to obtain a second feature map, and perform point multiplication between the second feature map and the first feature map to obtain a third feature map.
  • the second neural network 412 is specifically configured to perform at least one layer of convolution operations on the first feature map to obtain the second feature map.
  • the second neural network 412 is specifically configured to perform at least one layer of convolution operation on the first feature map, and then perform at least one layer of pooling operation and/or on the first feature map after performing the convolution operation. Or the full connection operation to obtain the second feature map.
  • FIG. 4 a schematic diagram of a logical result of a second neural network 412 provided in an embodiment of the present application.
  • FIG. 4 is drawn based on FIG. 3 .
  • the size of the first feature map input by the second neural network 412 is H/4*W/4*64
  • the size of the output third feature map is H/4*W/4*64.
  • the second neural network 412 shown in FIG. 4 includes 2 convolutional layers, 1 fully connected layer and one dot product layer. This implementation manner can achieve more complex feature extraction, thereby helping to make the feature extraction result more accurate, and further helping to improve the accuracy of image recognition.
  • the first score map is the score map of the image input to the previous network.
  • the size of the third feature map is M 1 *N 1 *P 1
  • the size of the first score map is M 1 *N 1
  • P 1 is the dimension of the feature direction
  • M 1 *N 1 is perpendicular to the feature direction
  • the dimensions of the dimensions, M 1 , N 1 and P 1 are all positive integers.
  • FIG. 5 it is a schematic diagram of each dimension of a feature map according to an embodiment of the present application.
  • a feature map with a size of H/4*H/4*64 is used as an example for description.
  • the dimension of the feature direction of the feature map is 64, and the dimension perpendicular to the feature direction is H/4*H/4.
  • the description of each dimension of other feature maps is similar to this, and details are not repeated here.
  • the third feature map output by the second neural network 42 is used as the feature map used in the image recognition process.
  • the first score map and the third feature map are used to identify the image to be recognized.
  • the size of the third feature map is M 2 *N 2 *P 2
  • the size of the first score map is M 1 *N 1
  • P 2 is the size of the feature direction dimension
  • Both N 2 and P 2 are positive integers.
  • the extraction network 43 is used to perform feature extraction on the third feature map to obtain a fourth feature map.
  • the size of the fourth feature map is M 1 *N 1 *P 1 ;
  • P 1 is the size of the feature direction dimension, and
  • P 1 is a positive integer. That is to say, the feature extraction here is to further reduce the dimension of the feature map perpendicular to the feature direction, which helps to reduce the computational complexity of the image recognition process when the fourth feature map is used for image recognition in the future, thereby reducing the computational complexity of the image recognition process. Improve recognition efficiency.
  • the first score map and the fourth feature map obtained based on the to-be-recognized image are used to recognize the to-be-recognized image.
  • the specific examples in the following are all described by taking the network model shown in 2b as an example, which will be uniformly described here, and will not be repeated below.
  • the training stage includes the stage of acquiring training data and the stage of model training, which are described below:
  • the execution body of the method may be a computer device, and the method may include the following steps:
  • S101 Obtain a reference image set, where the reference image set includes multiple reference images; then, obtain a score map of each reference image in the multiple reference images.
  • the reference image set can be an existing data set, for example, the HPatches data set, and specifically can be a three-dimensional reconstruction data set or the like.
  • enhancement is performed based on the existing dataset to obtain sample images suitable for high homography transformation.
  • the sample images in the case of high homography transformation include: images with a homography transformation relationship.
  • the score map of an image can be represented by a matrix.
  • the value of the element in the i-th row and the j-th column in the matrix represents the probability that the pixel point (or pixel block) in the i-th row and the j-th column in the image is a feature point.
  • i and j are both integers greater than or equal to 0.
  • the score map of the reference image in the reference image set may be the score map of the corresponding image in the existing dataset, such as the HPatches dataset, so that, The score map of the image in the prior art is directly used instead of being obtained by calculation, which helps to reduce the computational complexity.
  • S102 Use multiple reference images (eg, each reference image) as sample images respectively, and use score maps of the multiple reference images as score maps of the corresponding sample images, respectively.
  • the homography changes are respectively performed on multiple reference images (eg, each reference image) to obtain multiple sample images.
  • Performing homography transformation on the reference image specifically includes: multiplying the reference image by a transformation matrix to obtain a sample image.
  • the homography transformation matrix can be predefined or randomly generated.
  • the transformation matrix has a one-to-one correspondence with the sample images obtained based on the reference image.
  • FIG. 7 it is a schematic diagram of a reference image applicable to the embodiment of the present application and a sample image obtained by performing homography transformation on the reference image.
  • H in FIG. 7 represents the transformation matrix used in the homography transformation.
  • mark the reference image as D mark the transformation matrix used when performing homography transformation on the reference image as H, and mark the pixel point d ij in the reference image (that is, the i-th row jth in the reference image
  • the pixel points of the column, i and j are both integers), are marked as s ij .
  • Multiply the pixel point d ij in the reference image by the homography transformation coefficient H i and the obtained pixel point is marked as The score is marked as
  • an embodiment of the present application provides a method for estimating a score in an image obtained after transformation, which may specifically include the following steps:
  • Step A expanding s ij into a matrix [s ij , 1,1].
  • the scores of multiple image blocks in the reference image before deformation and the homography transformation matrix H are used as input, and the The scores of the image blocks obtained after the deformation of multiple images are used as the output, and the transformation matrix T of the score map is obtained by fitting through the least squares method.
  • the image block in the reference image and the image block in the sample image physically represent the same object, there is a matching correspondence between the two image blocks.
  • Step C) based on the score map transformation matrix T, obtain specific:
  • the pixel point P' on the sample image is obtained by transforming the pixel point P on the reference image
  • the pixel point Q' on the sample image is obtained by transforming the pixel point Q on the reference image
  • n is the number of coincident points.
  • the pixel point P' on the sample image is obtained by transforming the pixel point P on the reference image, there is a pixel point Q in the reference image, where the pixel point Q is the pixel point in the neighborhood of the pixel point P , and Q is the estimated point of fitting, then Satisfy the following formula:
  • n is the number of neighbors, that is, the number of pixels in the neighborhood.
  • the neighborhood of the pixel point P can be predefined.
  • the embodiments of the present application do not limit the size and position of the neighborhood of the pixel point P.
  • each score of the feature point score map is jointly constrained by the points in its neighborhood, which increases the receptive field in the constraint, and at the same time makes up for the problem of sample distortion in the process of data enhancement, reducing the selection of feature points. chance.
  • the training data includes: sample images in the sample image set and a score map for each sample image.
  • the sample image set includes a reference image and an image obtained by performing homography transformation on the reference image.
  • FIG. 8 it is a schematic diagram of a relationship between reference data and training data according to an embodiment of the present application.
  • the reference data includes a reference image set and a score map of each reference image in the reference image set.
  • FIG. 8 illustrates that the reference image set includes reference image 1 and reference image 2 .
  • the training data includes a sample image set and a score map of each sample image in the sample image set.
  • Figure 8 illustrates that the sample image set includes: sample image 10 (ie, reference image 1), sample image 11 (ie, reference image 1 multiplied by image obtained after transformation matrix 11), sample image 12 (ie, the image obtained by multiplying reference image 1 by transformation matrix 12), sample image 20 (ie reference image 2), and sample image 21 (ie reference image 2 multiplied by transformation matrix 21) and so on.
  • sample image 10 ie, reference image 1
  • sample image 11 ie, reference image 1 multiplied by image obtained after transformation matrix 11
  • sample image 12 ie, the image obtained by multiplying reference image 1 by transformation matrix 12
  • sample image 20 ie reference image 2
  • sample image 21 ie reference image 2 multiplied by transformation matrix 21
  • the value range of is ([0,1]).
  • Construct a triple tri (D i , D j , D k ), where D i , D j , D k are image blocks, (D i , D j ) are matched pairs of similar image blocks, (D i , D k ) are matched pairs of dissimilar image patches.
  • dissimilar image matching blocks are randomly selected from the same image or different images at the same scale.
  • the size of the image can be adjusted to (H*2)*(W*2), H*W, (H/2)*(W/2) three sizes.
  • the score map corresponding to the (H*2)*(W*2) size image is obtained by interpolation, and the score map corresponding to the (H/2)*(W/2) size image is obtained by downsampling (max pooling) to get.
  • the training data is based on the annotation information of the natural scene, and the score map is estimated, which makes the estimated samples more inclined to the real scene, thereby helping to improve the accuracy of image recognition.
  • the computer equipment can first train the front-end network 41, and then train the representation network 42 respectively.
  • the computer equipment can first train the front-end network 41, and then train the representation network 42 and the extraction network 43 respectively.
  • the training of the representation network 42 and the training of the extraction network 43 may be performed in parallel, and the training sequence between the two may be in no particular order.
  • the process of training each network can be regarded as the process of obtaining the actual value of the parameters of the network (such as the value of each element in the convolution kernel, etc.).
  • the actual value here refers to the value of the parameters used by the network when applied to the image recognition stage.
  • the operation layers included in the first neural network 411 and the second neural network 412 in the preceding network 41 respectively, the size of the input of the operation layer, the size of the parameters of the operation layer, the size of the output of the operation layer, and the association between the operation layers relation.
  • the operation layer may include one or more of a convolution layer, a pooling layer, a fully connected layer, or a point product layer, and the like.
  • the parameters of the operation layer include the parameters used when performing the operation of the layer.
  • the parameters of the convolution layer include the number of layers of the convolution layer and the size of the convolution kernel used by each convolution layer.
  • the relationship between the operation layers can also be referred to as the connection relationship between the operation layers, for example, the output of which operation layer is used as the input of which operation layer.
  • the input of the first operation layer in the preceding network 41 is the input of the preceding network 41
  • the output of the last operation layer of the preceding network 41 is the output of the preceding network 41 .
  • the input to the preceding network 41 is an image.
  • the size of the input to the front-end network 41 is denoted as H*W*3.
  • H represents the height of the input image
  • W represents the width of the input image
  • 3 represents the number of channels.
  • the values of H and W can be predefined.
  • the output of the preceding network 41 is the third feature map.
  • the third feature map refers to a feature map obtained by rotating the feature of the image input to the front-end network 41 to the main direction.
  • the input size, parameter size and output size of different operation layers are adapted.
  • the "fit” here refers to the size that satisfies the operational relationship between matrices/tensors in mathematics. For example, if matrix A and matrix B satisfy the principle of dot product, the number of columns of matrix A is equal to the number of rows of matrix B. Other examples are not listed one by one.
  • the computer device can configure initial values for each parameter in the previous network 41 (for example, the parameters of each operation layer in the previous network 41 ), for example, the convolution kernel used in each convolution layer has an initial value. value.
  • the embodiment of the present application does not limit the initial value of each parameter, for example, it may be randomly generated.
  • the basic principle of performing training of the front-end network 41 is: based on the images in the sample image set and the initial values of the parameters in the front-end network 41, under the constraints of the confrontation network 44 and the twin network 45 of the front-end network 41, training is performed to achieve "The third feature map output by the front-end network 41 is a feature map obtained by transforming the features of the input image to the main direction.”
  • the parameters of the front-end network 41 used to achieve this purpose are used as the training results.
  • the connection relationship between the front-end network 41, the adversarial network 44 and the twin network 45 can be as shown in FIG. 9 .
  • the results of the training process are used as the values (or actual values) of the parameters of the front-end network in the image recognition process using the front-end network 41 .
  • the execution body of the method may be a computer device. As shown in Figure 10, the method may include the following steps:
  • S201 Input any image in the sample image set as an input image into the first neural network 411, and the first neural network 411 performs feature extraction on the input image to obtain a first feature map of the input image.
  • the first neural network 411 uses the initial values of the parameters of the first neural network to perform feature extraction on the input image to obtain a first feature map of the input image.
  • the first neural network performs a convolution operation with a preset number of layers on the input image to obtain a first feature map of the input image.
  • the first neural network 411 performs a 4-layer convolution operation on the input image to obtain the first feature map of the input image.
  • the first neural network 411 may also perform other operations on the input image to obtain the first feature map, which is not limited in this embodiment of the present application.
  • S202 Input the first feature map of the input image into the second neural network 412 to perform feature extraction on the first feature map to obtain a third feature map.
  • the third feature map can be understood as a feature map obtained by processing the second neural network 412 and converting the features of the input image (such as texture features, etc.) to the main direction.
  • the second neural network 412 sequentially performs a convolution operation and a full connection operation on the first feature map of the input image, and performs a dot product operation on the result of the full connection operation with the first feature map of the input image to obtain a third feature map.
  • the second neural network 412 when S202 is executed, sequentially performs a 2-layer convolution operation and a 1-layer full connection operation on the first feature map obtained in FIG. 3 , and compares the result of the full connection operation with the first feature map.
  • the feature map is dot-multiplied to obtain a third feature map. For example, based on FIG. 4, after the second neural network 412 sequentially performs 2-layer convolution operation and 1-layer full connection operation on the first feature map obtained in FIG.
  • h*w matrices of 2*2 can be obtained, and 2*
  • the kernel of 2*(h*w) is used as the direction matrix of the main direction of the characteristic of the corresponding channel, and the features that are cycled to the main direction are obtained by means of point multiplication.
  • the second neural network 412 can perform backpropagation during the training process of the preceding network 41 .
  • the adversarial network 44 and the twin network 45 of the preceding network 41 are constrained to train the preceding network 41 to obtain the actual values of the parameters of the preceding network 41 .
  • the working principle of the adversarial network 44 is described below through step S203, and the working principle of the twin network 45 is described through step S204.
  • the second neural network 412 may be referred to as a Local Spatial Transform Network (LSTN).
  • LSTN Local Spatial Transform Network
  • the design of LSTN, under the learning of generative adversarial network, enables the local area to be corrected to its main direction, so that the network can converge when training samples with high homography transformation.
  • the third feature map is used as the input of the adversarial network 44, and the adversarial network 44 performs a deconvolution operation on the third feature map to obtain a fifth feature map.
  • the size of the fifth feature map is the same as the size of the input image of the preceding network 41, for example, both are H*W*3.
  • the adversarial network 44 then divides the fifth feature map into multiple data blocks.
  • the adversarial network 44 performs a two-layer deconvolution operation on the third feature map to obtain a fifth feature map.
  • the adversarial network 44 divides the fifth feature map into multiple data blocks, which may include: the adversarial network 44 equally divides the fifth feature map into multiple data blocks.
  • This embodiment of the present application does not limit the size of each data block.
  • FIG. 11 it is a schematic diagram of a logical structure of an adversarial network 44 according to an embodiment of the present application.
  • FIG. 11 is drawn based on FIG. 4 .
  • the adversarial network 44 performs two-layer deconvolution operations on the third feature map, and obtains the size of H/2* respectively.
  • a feature map of H/2*32 and a feature map of size H*W*3 ie, the fifth feature map. Then, the elements in the matrix whose size is H*W*3 in each layer of the feature map with size H*W*3 are equally divided into 16*16 data blocks.
  • S204 Input the multiple data blocks generated by the adversarial network 44 into the siamese network 45.
  • the Siamese network 45 is constrained by the loss function to determine whether the third feature map is a feature map rotated to the main direction.
  • the basic idea of the Siamese network 45 is to minimize the feature distance between matched pairs of similar data blocks, while maximizing the feature distance of pairs of dissimilar data blocks.
  • the judgment result is that the third feature map is a feature map rotated to the main direction, then the training process for the front-end network 41 ends. Subsequently, the value of the parameter used when performing S201 and S202 this time may be used as the value of the parameter of the preceding network in the identification stage.
  • the front-end network 41 can feed back relevant information to the front-end network 41 to assist in adjusting the values of the parameters of the front-end network 41.
  • S201 is re-executed, and the cycle is repeated until S204 is executed one or more times, and the judgment result is that the third feature map is the feature map rotated to the main direction.
  • the embodiments of the present application do not limit the specific implementation of the adversarial network 44 and the twin network 45 to assist in adjusting the front-end network 41 .
  • the third feature map is correct by constructing the loss function constraint of the triple tri.
  • the loss function of the triplet is shown in Equation 1 below. The idea is to minimize the feature distance of matching pairs of similar image patches while maximizing the feature distance of matching pairs of dissimilar image patches, where M is the bias to ensure model convergence value.
  • the loss function of the triplet is used in the adversarial network of feature representation and main direction at the same time, which enlarges the distribution of dissimilar feature points, so that the subsequent matching can obtain the nearest neighbor features more accurately.
  • the operation layer may include: a convolution layer, a grouping weighting layer, or the like.
  • the parameters of the convolutional layer include the number of convolutional layers and the size of the convolutional kernel used by each convolutional layer.
  • the extraction network 43 includes a packet weighting layer 432 .
  • the grouping weighting layer 432 is used to: use a 1-channel convolution kernel to perform a convolution operation on the third feature map to obtain X fifth feature maps; wherein, the size of the feature direction of the fifth feature map is smaller than the feature of the third feature map The size of the direction; X is an integer greater than 2; the elements of the X fifth feature maps are weighted and summed to obtain the sixth feature map; the sixth feature map is extracted to obtain the first score map.
  • the specific description of the grouping weighting layer 432 in this implementation manner can be obtained by reasoning based on the following implementation manner, and will not be repeated here.
  • the extraction network 43 includes a convolutional layer 431 and a grouped weighting layer 432 .
  • the convolution layer 431 is used for: performing a convolution operation on the third feature map to obtain the seventh feature map.
  • This embodiment of the present application does not limit the number of layers of the convolution operation and the size of the convolution kernel.
  • the purpose of performing feature extraction on the third feature map to obtain the seventh feature map is to reduce the dimension size perpendicular to the feature direction.
  • the grouping weighting layer 432 is used to: use a 1-channel convolution kernel to perform a convolution operation on the seventh feature map to obtain X fifth feature maps.
  • the size of the feature direction of the fifth feature map is smaller than the size of the feature direction of the third feature map.
  • X is an integer greater than 2.
  • the elements of the X fifth feature maps are weighted and summed to obtain the sixth feature map. Feature extraction is performed on the sixth feature map to obtain a first score map.
  • the 1-channel convolution kernel can be understood as a convolution kernel whose dimension perpendicular to the feature direction is 1, and the dimension dimension in the feature direction is X.
  • the size of the sixth feature map is the same as that of the fifth feature map.
  • the purpose of performing feature extraction on the sixth feature map is to compress the dimension of the feature direction of the sixth feature map to 1.
  • the grouping weighting layer 432 may perform one or more layers of convolution operations on the sixth feature map to obtain the first score map.
  • the first score map is a two-dimensional matrix, that is, the dimension dimension of its feature direction is 1.
  • the design of the grouping weighting layer (or called the grouping weighting network) makes the calculation of the score map use both local and global information to find local features.
  • FIG. 12 it is a schematic diagram of a logical structure of an extraction network 43 according to an embodiment of the present application.
  • FIG. 12 is drawn based on FIG. 4 .
  • the convolutional layer 431 in the extraction network 43 is used to extract the third feature map of size H/4*H/4*64
  • the three feature maps are convolved to obtain a seventh feature map of size H/8*H/8*256.
  • the dimension of the feature direction of the seventh feature map is 256, and the dimension of the dimension perpendicular to the feature direction is H/8*H/8.
  • the grouping weighting layer 432 in the extraction network 43 is used to use a convolution kernel of 1*1*16 to perform a convolution operation on the seventh feature map with a size of H/8*H/8*256, and obtain 16 sizes of Feature map of H/8*H/8*16. Then, the elements of these 16 feature maps of size H/8*H/8*16 are weighted and summed to obtain a sixth feature map of size H/8*H/8*16. Wherein, weighted summation is performed on elements with the same coordinate position in the feature maps of different H/8*H/8*16, to obtain the element at the coordinate position in the sixth feature map. Next, a convolution operation is performed on the sixth feature map to obtain a first score map of size H/8*H/8.
  • Equation 2 The formula for weighted summation of the elements of 16 feature maps of size H/8*H/8*16 is shown in Equation 2:
  • sk denotes the channel-wise element-wise fractional representation.
  • i represents the i-th group
  • j represents the j-th element in the i-th group
  • the maximum value of k is the number of elements in a single channel
  • a ij represents the i-th group in the first channel.
  • the weight corresponding to the j elements (the weight is learned by backpropagation);
  • a k,ij represents the weight corresponding to the j-th element of the i-th group of the k-th channel;
  • p is the value of the corresponding element.
  • the loss function needs to be fed back constraints (not shown in FIG. 12 ).
  • the loss function of local feature extraction ie, the extraction network
  • the formula 3 shows:
  • sy is the label, and its value is not directly obtained from the corresponding pixel position in the dataset, but the score on the score map in the corresponding n*n area (n is custom, the recommended value is 9*9) is calculated, which is obtained by formula 2
  • the score of each pixel point, and then the maximum value of the scores in the n*n area is obtained as the score of the current point. It is the score corresponding to each pixel in the local area (the supplementary score of the pixel without corresponding score is 0.0).
  • sx is the score obtained by the forward push
  • sy is the benchmark score given by the dataset.
  • Equation 3 is expressed as a general neural network loss function method, which uses the benchmark data in the data set to constrain the data calculated by the neural network, and updates each parameter in the neural network through back propagation.
  • the operation layer may include a convolution layer and the like.
  • the parameters of the convolutional layer include the number of convolutional layers and the size of the convolutional kernel used by each convolutional layer.
  • FIG. 13 it is a schematic diagram showing the logical structure of the network 42 according to an embodiment of the present application.
  • FIG. 13 is drawn based on FIG. 4 .
  • the third feature map with size H/4*H/4*64 obtained based on FIG. 4 indicates that a convolutional layer in the network 42 performs a convolution operation on the third feature map, and then outputs the output result Convolution operation is performed on another convolutional layer to obtain the fourth feature map.
  • the size of the fourth feature map may be H/8*H/8*128. That is, after being processed by the representation network 42, the dimension perpendicular to the dimension of the feature is reduced. In this way, it is helpful to reduce the computational complexity in the subsequent image recognition process, thereby improving the image recognition efficiency.
  • the loss function needs to be fed back and constrained (not shown in FIG. 13 ).
  • the loss function of the local feature representation stage ie, the representation network
  • uses The triplet loss function is to construct similar matching pairs and dissimilar matching pairs, and then use formula 1 to minimize the distance of similar matching pairs and maximize the distance of dissimilar matching pairs.
  • all channels of a single element of the feature map are extracted as features, i.e. a matrix of dimension 1*128.
  • Equation 4 the loss function of the overall network can be established as shown in Equation 4:
  • the overall loss function is the sum of the local feature score (i.e. the extraction network) and the loss function of the feature representation (i.e. the representation network). and Represent the scores of points A and B, respectively, and A and B are respectively taken from two images with a homography transformation relationship.
  • the loss function of Equation 4 is a global loss calculation, which is different from the simple weighted addition of the loss function. The weighting in the global scope thus strengthens the constraints, making it possible to have a better impact on the overall loss.
  • the forward reasoning of the network includes the network structure shown in Figure 2a or Figure 2b, excluding the adversarial network and the Siamese network, etc.
  • FIG. 14 it is a schematic flowchart of an image recognition method according to an embodiment of the present application.
  • the method shown in Figure 14 includes the following steps:
  • the image recognition apparatus acquires an image to be recognized. For example, a picture book robot shoots a picture book and obtains an image to be recognized.
  • the image recognition apparatus uses the first neural network to perform feature extraction on the image to be recognized to obtain a first feature map.
  • the image recognition apparatus uses the second neural network to perform feature extraction on the first feature map to obtain a second feature map, and performs dot product on the second feature map and the first feature map to obtain a third feature map.
  • the third feature map represents a feature map obtained by transforming the features of the image to be recognized into the main direction.
  • the first neural network here may be any one of the trained first neural networks 411 provided above, and the second neural network may be any one of the trained second neural networks 412 provided above.
  • the image recognition apparatus uses the second neural network to perform at least one layer of convolution operations on the first feature map to obtain the second feature map.
  • the image recognition apparatus uses the second neural network to perform at least one layer of convolution operations on the first feature map to obtain the second feature map.
  • the image recognition apparatus uses the second neural network to perform at least one layer of convolution operations on the first feature map; and performs at least one layer of pooling operations and/or full pooling operations on the first feature map after performing the convolution operation.
  • the connection operation is performed to obtain the second feature map.
  • the image recognition apparatus obtains a first score map of the image to be recognized based on the third feature map.
  • the image recognition apparatus uses a 1-channel convolution kernel to perform a convolution operation on the third feature map to obtain X fifth feature maps; wherein the size of the feature directions of the fifth feature map is smaller than that of the third feature map X is an integer greater than 2; the elements of the X fifth feature maps are weighted and summed to obtain the sixth feature map; the sixth feature map is extracted to obtain the first score map.
  • a 1-channel convolution kernel to perform a convolution operation on the third feature map to obtain X fifth feature maps; wherein the size of the feature directions of the fifth feature map is smaller than that of the third feature map X is an integer greater than 2; the elements of the X fifth feature maps are weighted and summed to obtain the sixth feature map; the sixth feature map is extracted to obtain the first score map.
  • the image recognition apparatus performs feature extraction on the third feature map to obtain a seventh feature map; wherein the dimension of the third feature map perpendicular to the feature direction is larger than the dimension perpendicular to the feature direction of the seventh feature map Size; X is an integer greater than 2; use a 1-channel convolution kernel to perform a convolution operation on the seventh feature map to obtain X fifth feature maps; weight the elements of the X fifth feature maps to obtain the sixth Feature map; perform feature extraction on the sixth feature map to obtain a first score map.
  • X is an integer greater than 2
  • use a 1-channel convolution kernel to perform a convolution operation on the seventh feature map to obtain X fifth feature maps
  • weight the elements of the X fifth feature maps to obtain the sixth Feature map perform feature extraction on the sixth feature map to obtain a first score map.
  • the size of the image to be recognized is larger than the size of the first score map.
  • the image recognition device recognizes the image to be recognized based on the third feature map and the first score map.
  • the size of the third feature map is M1*N1*P1
  • the size of the first score map is M1*N1
  • P1 is the size of the feature direction dimension
  • M1*N1 is the size perpendicular to the feature direction dimension
  • M1, N1 and P1 are all positive integers.
  • the image recognition device directly recognizes the image to be recognized based on the third feature map and the first score map.
  • the size of the third feature map is M2*N2*P2
  • the size of the first score map is M1*N1
  • P2 is the size of the feature direction dimension
  • M1, N1, P1, M2, N2 and P2 All are positive integers.
  • the image recognition device performs feature extraction on the third feature map to obtain a fourth feature map; wherein, the size of the fourth feature map is M1*N1*P1; P1 is the size of the feature direction dimension, and P1 is a positive integer; then , based on the fourth feature map and the first score map, identify the image to be recognized; wherein, the size of the first score map is M1*N1.
  • step S305 For an example of a specific implementation of S305, reference may be made to the specific example in step S405 below.
  • the image recognition method provided by the embodiment of the present application utilizes the network described above. Since the network has rotation invariance, the image with rotation invariance in the image processing, the image is at any rotation angle in the plane, the image recognition device The extracted features hardly change, so the requirements for placing and shooting the images to be recognized are not high. In addition, compared with the technical solution in the prior art that uses a network without rotation invariance for image recognition, it helps to improve the accuracy of image recognition.
  • FIG. 15 it is a schematic flowchart of another image recognition method provided by an embodiment of the present application.
  • the method shown in Figure 15 includes the following steps:
  • the image recognition apparatus acquires two images that need to be matched.
  • One of the two images is the image to be recognized, and the other image is the sample image.
  • the image recognition device may be a picture book robot.
  • the image to be recognized is an image captured by a picture book robot, and the sample image is a certain page of a picture book stored in a predefined picture book database.
  • the image recognition device scales the image to be recognized to three scales such as (0.5, 1, 2), respectively inputs the scaled images to the network, and simultaneously inputs the sample images scaled to the same size to the network.
  • the network may be a network trained in the above-mentioned training phase.
  • 0.5, 1 and 2 represent the zoom factor, respectively.
  • scaling the images to be recognized to different sizes and performing image recognition based on the different sizes are optional steps. In this way, it helps to improve the accuracy of image recognition.
  • the image recognition apparatus uses the network to obtain score maps (S1, S2) and feature maps (F1, F2) at different scales through forward reasoning.
  • the score maps S1 and S2 can be considered as the first score map obtained after inputting the image to be recognized in S401 into the network, respectively, and the first score map obtained after inputting the sample image in S401 into the network.
  • the feature maps F1 and F2 can be considered as the fourth feature map obtained after inputting the image to be recognized in S401 into the network, and the fourth feature map obtained after inputting the sample image in S401 into the network.
  • the network in the image recognition stage does not include adversarial networks, twin networks, or networks that use a loss function for feedback adjustment.
  • the image recognition device uses an image retrieval technology (specifically, refer to the prior art), and performs the following steps based on the score maps (S1, S2) at different scales to determine the number of matching feature point pairs in F1 and F2 :
  • Search F2 for the most similar feature f2 to f1 for example, take the two features with the closest Euclidean distance as the most similar feature, where f2 corresponds to the score s2>T in S2.
  • f1 and f2 are a matched pair of feature points.
  • S405 If the number of matching feature point pairs in F1 and F2 is greater than or equal to a preset threshold, the image recognition device will use the sample image used for obtaining F2 as the recognition result of the image to be recognized. Otherwise, update the sample image, and execute S401-S405 again.
  • This embodiment provides a specific application example of the image recognition method, and the actual implementation is not limited to this.
  • the image recognition apparatus may be divided into functional modules according to the above method examples.
  • each functional module may be divided into each function, or two or more functions may be integrated into one processing module.
  • the above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. It should be noted that, the division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation.
  • FIG. 16 shows a schematic structural diagram of an image recognition apparatus 160 provided by an embodiment of the present application.
  • the image recognition device 160 is configured to perform the above-mentioned image recognition method, for example, to perform the image recognition method shown in FIG. 14 .
  • the image capturing device 160 may include a first acquiring unit 1601 , a feature extracting unit 1602 , a second acquiring unit 1603 and an identifying unit 1604 .
  • the first acquiring unit 1601 is used to acquire the image to be recognized.
  • the feature extraction unit 1602 is used to perform feature extraction on the image to be recognized by using the first neural network to obtain a first feature map; and, use the second neural network to perform feature extraction on the first feature map to obtain a second feature map, and use the
  • the second feature map is dot-multiplied with the first feature map to obtain a third feature map; wherein, the third feature map represents a feature map obtained by transforming the features of the image to be recognized into the main direction.
  • the second obtaining unit 1603 is configured to obtain the first score map of the image to be recognized based on the third feature map.
  • the identifying unit 1604 is configured to identify the to-be-identified image based on the third feature map and the first score map.
  • the first neural network may be the first neural network 411 above, and the second neural network may be the second neural network 412 above.
  • the first acquiring unit 1601 may perform S301
  • the feature extracting unit 1602 may perform S302 and S303
  • the second acquiring unit 1603 may perform S304
  • the identifying unit 1604 may perform S305.
  • the feature extraction unit 1602 is specifically configured to: use the second neural network to perform at least one layer of convolution operation on the first feature map to obtain the second feature map.
  • the feature extraction unit 1602 is specifically configured to: use the second neural network to perform at least one layer of convolution operation on the first feature map; perform at least one layer of pooling operation and/or on the first feature map after performing the convolution operation. Or the full connection operation to obtain the second feature map.
  • the size of the third feature map is M1*N1*P1
  • the size of the first score map is M1*N1
  • P1 is the size of the feature direction dimension
  • M1*N1 is the dimension perpendicular to the feature direction dimension
  • M1 and P1 are positive integers.
  • the size of the third feature map is M2*N2*P2
  • the size of the first score map is M1*N1
  • P2 is the size of the feature direction dimension
  • M1, N1, P1, M2, N2, and P2 are all positive integers
  • the identification unit 1604 is specifically used to: carry out feature extraction to the third feature map, and obtain the fourth feature map; wherein, the size of the fourth feature map is M1*N1*P1; P1 is the size of the feature direction dimension, and P1 is a positive integer; Based on the fourth feature map and the first score map, the image to be recognized is recognized; wherein, the size of the first score map is M1*N1.
  • the second obtaining unit 1603 is specifically configured to: use a 1-channel convolution kernel to perform a convolution operation on the third feature map to obtain X fifth feature maps; wherein, the size of the feature direction of the fifth feature map is less than The size of the feature direction of the third feature map; X is an integer greater than 2; the elements of the X fifth feature maps are weighted and summed to obtain the sixth feature map; the sixth feature map is extracted to obtain the first score map .
  • the second obtaining unit 1603 is specifically configured to: perform feature extraction on the third feature map to obtain a seventh feature map; wherein, the dimension perpendicular to the feature direction of the third feature map is greater than the dimension perpendicular to the seventh feature map.
  • the dimension size of the feature direction; X is an integer greater than 2; use a 1-channel convolution kernel to perform a convolution operation on the seventh feature map to obtain X fifth feature maps; weight the sum of the elements of the X fifth feature maps , obtain the sixth feature map; perform feature extraction on the sixth feature map to obtain the first score map.
  • the size of the image to be recognized is larger than the size of the first score map.
  • the functions implemented by the first acquiring unit 1601 , the feature extracting unit 1602 , the second acquiring unit 1603 and the identifying unit 1604 in the image recognition apparatus 160 may be implemented by the processor 101 in FIG. 1 .
  • Program code in memory 102 is implemented.
  • the chip system includes at least one processor 111 and at least one interface circuit 112 .
  • the processor may be the processor 111 shown in the solid line box in FIG. 11 (or the processor 111 shown in the dotted line box)
  • the one interface circuit may be the interface circuit 112 shown in the solid line box in FIG. 11 (or the interface circuit 112 shown in the dotted line box).
  • the two processors include the processor 111 shown in the solid line box and the processor 111 shown in the dotted line box in FIG. 11
  • the two interfaces includes the interface circuit 112 shown in the solid line box and the interface circuit 112 shown in the dashed line box in FIG. 11 . This is not limited.
  • the processor 111 and the interface circuit 112 may be interconnected by wires.
  • the interface circuit 112 may be used to receive signals (eg, from a vehicle speed sensor or an edge service unit).
  • the interface circuit 112 may be used to send signals to other devices (eg, the processor 111).
  • the interface circuit 112 may read the instructions stored in the memory and send the instructions to the processor 111 .
  • the image recognition apparatus can be caused to perform the various steps in the above embodiments.
  • the chip system may also include other discrete devices, which are not specifically limited in this embodiment of the present application.
  • Another embodiment of the present application further provides a computer-readable storage medium, where an instruction is stored in the computer-readable storage medium, and when the instruction is executed on an image recognition apparatus, the image recognition apparatus executes the method shown in the foregoing method embodiment Each step performed by the image recognition device in the flow.
  • the disclosed methods may be implemented as computer program instructions encoded in a machine-readable format on a computer-readable storage medium or on other non-transitory media or articles of manufacture.
  • FIG. 18 schematically shows a conceptual partial view of a computer program product provided by an embodiment of the present application, where the computer program product includes a computer program for executing a computer process on a computing device.
  • the computer program product is provided using the signal bearing medium 120 .
  • the signal bearing medium 120 may include one or more program instructions that, when executed by one or more processors, may provide the functions, or portions thereof, described above with respect to FIG. 14 .
  • reference to one or more features of S401 to S405 in FIG. 14 may be undertaken by one or more instructions associated with the signal bearing medium 120 .
  • the program instructions in Figure 18 also describe example instructions.
  • signal bearing medium 120 may include computer readable medium 121 such as, but not limited to, hard drives, compact discs (CDs), digital video discs (DVDs), digital tapes, memories, read-only memory (read only memory) -only memory, ROM) or random access memory (RAM), etc.
  • computer readable medium 121 such as, but not limited to, hard drives, compact discs (CDs), digital video discs (DVDs), digital tapes, memories, read-only memory (read only memory) -only memory, ROM) or random access memory (RAM), etc.
  • the signal bearing medium 120 may include a computer recordable medium 122 such as, but not limited to, memory, read/write (R/W) CDs, R/W DVDs, and the like.
  • a computer recordable medium 122 such as, but not limited to, memory, read/write (R/W) CDs, R/W DVDs, and the like.
  • signal bearing medium 120 may include communication medium 123, such as, but not limited to, digital and/or analog communication media (eg, fiber optic cables, waveguides, wired communication links, wireless communication links, etc.).
  • communication medium 123 such as, but not limited to, digital and/or analog communication media (eg, fiber optic cables, waveguides, wired communication links, wireless communication links, etc.).
  • Signal bearing medium 120 may be conveyed by a wireless form of communication medium 123 (eg, a wireless communication medium that conforms to the IEEE 802.11 standard or other transmission protocol).
  • the one or more program instructions may be, for example, computer-executable instructions or logic-implemented instructions.
  • an image recognition apparatus such as described with respect to FIG. 14 may be configured to provide, in response to one or more program instructions through computer readable medium 121 , computer recordable medium 122 , and/or communication medium 123 , Various operations, functions, or actions.
  • the computer may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • a software program it can be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer-executed instructions are loaded and executed on the computer, the flow or function according to the embodiments of the present application is generated in whole or in part.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • Computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website site, computer, server, or data center over a wire (e.g.
  • coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg infrared, wireless, microwave, etc.) means to transmit to another website site, computer, server or data center.
  • Computer-readable storage media can be any available media that can be accessed by a computer or data storage devices including one or more servers, data centers, etc., that can be integrated with the media.
  • Useful media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media (eg, solid state disks (SSDs)), and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了图像识别方法和装置,涉及神经网络技术领域,有助于提高图像识别准确率。该方法包括:获取待识别图像;使用第一神经网络对待识别图像进行特征提取,得到第一特征图;使用第二神经网络对第一特征图进行特征提取,得到第二特征图,并将第二特征图与第一特征图进行点乘,得到第三特征图;其中,第三特征图表示将待识别图像的特征变换到主方向后得到的特征图;基于第三特征图获得待识别图像的第一得分图;基于第三特征图和第一得分图,对待识别图像进行识别。

Description

图像识别方法和装置
本申请要求于2020年07月31日提交国家知识产权局、申请号为202010761239.6、申请名称为“图像识别方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及神经网络技术领域,尤其涉及图像识别方法和装置。
背景技术
随着技术的发展,市面上出现了一些具有读绘本功能的幼教产品机器人(简称绘本机器人)。绘本机器人在阅读绘本之前,需要准确识别绘本。具体的,机器人先通过摄像头采集到绘本的某一页的图像,再对该图像进行局部特征检测,然后,将检测结果与数据库中预存的绘本图像模板进行匹配,得到与该检测结果匹配度最高的图像,并将与该检测结果匹配度最高的图像作为待阅读图像。后续,绘本机器人阅读该待阅读图像。
上述识别绘本的方法,对绘本的摆放位置要求较高。例如,要求绘本平摊展开放在水平面上,该水平面与绘本机器人所在的水平面一致;还要求绘本与绘本机器人之间的距离、角度满足一定要求。另外,还要求绘本机器人站立不倒等。
然而,在实际应用中,年幼的孩子通常很难按照上述要求摆放绘本,这会导致绘本机器人对识别绘本的准确率会大幅度降低,甚至无法识别。
发明内容
本申请实施例提供了图像识别方法和装置,有助于提高图像识别准确率。
为了达到上述目的,本申请提供了以下技术方案:
第一方面,提供了一种图像识别方法,包括:首先,获取待识别图像。然后,使用第一神经网络对待识别图像进行特征提取,得到第一特征图。接着,使用第二神经网络对第一特征图进行特征提取,得到第二特征图,并将第二特征图与第一特征图进行点乘,得到第三特征图;其中,第三特征图表示将待识别图像的特征变换到主方向后得到的特征图。并且,基于第三特征图获得待识别图像的第一得分图。最后,基于第三特征图和第一得分图,对待识别图像进行识别。该技术方案中,使用第二神经网络对第一特征图进行特征提取,得到第二特征图,并将第二特征图与第一特征图进行点乘,得到第三特征图,有助于构建具有旋转不变性特征的网络,基于该网络进行图像识别,有助于提高图像识别的精确度。
在一种可能的设计中,使用第一神经网络对待识别图像进行特征提取,得到第一特征图,包括:使用第一神经网络对待识别图像进行至少一层卷积操作,得到第一特征图。
在一种可能的设计中,使用第二神经网络对第一特征图进行特征提取,得到第二特征图,包括:使用第二神经网络对第一特征图执行至少一层卷积操作,得到第二特 征图。该可能的设计,通过至少一层卷积操作对第一特征图进行特征提取,得到第二特征图,操作简单。
在一种可能的设计中,使用第二神经网络对第一特征图进行特征提取,得到第二特征图,包括:使用第二神经网络对第一特征图执行至少一层卷积操作;对执行卷积操作后的第一特征图执行至少一层池化操作和/或全连接操作,得到第二特征图。该可能的设计,通过至少一层卷积操作,以及至少一层池化操作和/或全连接操作对第一特征图进行特征提取,有助于实现更复杂的特征提取,从而有助于使得特征提取的结果更精准,进而有助于提高图像识别的精确度。
在一种可能的设计中,第三特征图的尺寸是M1*N1*P1,第一得分图的尺寸是M1*N1,P1是特征方向维度的尺寸,M1*N1是垂直于特征方向维度的尺寸,M1、N1和P1均是正整数。该可能的设计,直接使用第三特征图和第一得分图对待识别图像进行识别。该方案实现简单。
在一种可能的设计中,第三特征图的尺寸是M2*N2*P2,第一得分图的尺寸是M1*N1,P2是特征方向维度的尺寸,M1、N1、P1、M2、N2和P2均是正整数。基于第三特征图和第一得分图,对待识别图像进行识别,包括:对第三特征图进行特征提取,得到第四特征图;其中,第四特征图的尺寸是M1*N1*P1;P1是特征方向维度的尺寸,P1是正整数;基于第四特征图和第一得分图,对待识别图像进行识别;其中,第一得分图的尺寸是M1*N1。基于该可选的实现方式,使用第一得分图和对第三特征图进行特征提取后得到的特征图对待识别图像进行识别,有助于改变特征图的尺寸,由于通常情况下特征图的尺寸越大,图像识别过程效率越低,而特征图的尺寸越大,该特征图越能精确表示待识别图像;因此,改变特征图的尺寸,有助于平衡图像识别过程的效率和精确度,从而提高图像识别过程的整体性能。
在一种可能的设计中,M1*N1<M2*N2。这样,有助于降低用于图像识别过程的特征图的尺寸,从而降低图像识别过程的处理复杂度,以提高图像识别过程的处理效率。
在一种可能的设计中,基于第三特征图获得待识别图像的第一得分图,包括:使用1通道卷积核,对第三特征图执行卷积操作,得到X个第五特征图;其中,第五特征图的特征方向的尺寸小于第三特征图的特征方向的尺寸;X是大于2的整数;将X个第五特征图的元素加权求和,得到第六特征图;对第六特征图进行特征提取,得到第一得分图。该可能的设计,在获得得分图的过程中,仅对第三特征图的特征方向的尺寸进行了压缩,因此实现简单。
在一种可能的设计中,基于第三特征图获得待识别图像的第一得分图,包括:对第三特征图进行特征提取,得到第七特征图;其中,第三特征图的垂直于特征方向的维度尺寸大于第七特征图的垂直于特征方向的维度尺寸;X是大于2的整数;使用1通道卷积核,对第七特征图执行卷积操作,得到X个第五特征图;将X个第五特征图的元素加权求和,得到第六特征图;对第六特征图进行特征提取,得到第一得分图。该可能的设计,在获得得分图的过程中,对第三特征图的特征方向的尺寸和垂直于特征方向的尺寸均进行了压缩,因此有助于降低图像处理过程的复杂度,从而提高图像识别过程的处理效率。
在一种可能的设计中,待识别图像的尺寸大于第一得分图的尺寸。由于在图像识 别过程中所使用的第一得分图的尺寸(假设是a*b)表示该过程所使用的特征图中特征的个数,因此,该可能的设计中,如果待识别图像的特征是稠密特征,则第一得分图对应的特征图为稀疏特征,使用稀疏特征进行图像识别,有助于降低图像处理过程的复杂度,从而提高图像识别过程的处理效率。
第二方面,本申请提供了一种图像识别装置。
在一种可能的设计中,该图像识别装置用于执行上述第一方面提供的任一种方法。本申请可以根据上述第一方面提供的任一种方法,对该图像识别装置进行功能模块的划分。例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。示例性的,本申请可以按照功能将该图像识别装置划分为获取单元、特征提取单元和识别单元等。上述划分的各个功能模块执行的可能的技术方案和有益效果的描述均可以参考上述第一方面或其相应的可能的设计提供的技术方案,此处不再赘述。
在另一种可能的设计中,该图像识别装置包括:存储器和一个或多个处理器,该存储器和处理器耦合。该存储器用于存储计算机指令,该处理器用于调用该计算机指令,以执行如第一方面及其任一种可能的设计方式提供的任一种方法。
第三方面,本申请提供了一种计算机可读存储介质,如计算机非瞬态的可读存储介质。其上储存有计算机程序(或指令),当该计算机程序(或指令)在图像识别装置上运行时,使得该图像识别装置执行上述第一方面中任一种可能的实现方式提供的任一种方法。
第四方面,本申请提供了一种计算机程序产品,当其在计算机上运行时,使得第一方面中的任一种可能的实现方式提供的任一种方法被执行。
第五方面,本申请提供了一种芯片系统,包括:处理器,处理器用于从存储器中调用并运行该存储器中存储的计算机程序,执行第一方面中的实现方式提供的任一种方法。
可以理解的是,上述提供的任一种图像识别装置、计算机存储介质、计算机程序产品或芯片系统等均可以应用于上文所提供的对应的方法,因此,其所能达到的有益效果可参考对应的方法中的有益效果,此处不再赘述。
在本申请中,上述图像识别装置的名字对设备或功能模块本身不构成限定,在实际实现中,这些设备或功能模块可以以其他名称出现。只要各个设备或功能模块的功能和本申请类似,属于本申请权利要求及其等同技术的范围之内。
本申请的这些方面或其他方面在以下的描述中会更加简明易懂。
附图说明
图1为可适用于本申请实施例的一种计算机设备的硬件结构示意图;
图2a为本申请实施例提供的一种深度学习网络模型的示意图;
图2b为本申请实施例提供的另一种深度学习网络模型的示意图;
图3为本申请实施例提供的一种第一神经网络的逻辑结构示意图;
图4为本申请实施例提供的一种第二神经网络的逻辑结果示意图;
图5为本申请实施例提供的一种特征图各维度的示意图;
图6为本申请实施例提供的一种获取训练数据的方法的流程示意图;
图7为可适用于本申请实施例的一种参考图像以及对参考图像进行单应性变换后得到样本图像的示意图;
图8为本申请实施例提供的一种参考数据和训练数据之间的关系的示意图;
图9为本申请实施例提供的一种前段网络、对抗网络和孪生网络之间的连接关系示意图;
图10为本申请实施例提供的对前段网络进行训练的方法的流程示意图;
图11为本申请实施例提供的一种对抗网络的逻辑结构示意图;
图12为本申请实施例提供的一种提取网络的逻辑结构示意图;
图13为本申请实施例提供的一种表示网络的逻辑结构示意图;
图14为本申请实施例提供的一种图像识别过程的流程示意图;
图15为本申请实施例提供的另一种图像识别过程的流程示意图;
图16为本申请实施例提供的一种图像识别装置的结构示意图;
图17为本申请实施例提供的一种芯片系统的结构示意图;
图18为本申请实施例提供的一种计算机程序产品的概念性局部视图。
具体实施方式
首先,说明本申请中涉及的部分术语和技术:
特征:即图像特征,可以包括颜色特征、纹理特征等、形状特征以及局部特征点等。
全局特征/局部特征:全局特征是指图像的整体属性,常见的全局特征包括颜色特征、纹理特征和形状特征等。全局特征是使用一个图像的全部特征来代表该图像,这样的特征具有大量的冗余信息。局部特征是指图像的局部属性。局部特征是使用一个图像的局部特征点来代表该图像。每个局部特征点仅包含自身所处图像块的信息,其对图像的全局信息不感知。
特征点(即局部特征点):在图像处理中,同一个物体或场景,从不同的角度采集多个图像,如果该物体或场景的相同部分能够被识别出来的结果是相同的,那么,称为这些部分具有尺度不变性。具有“尺度不变性的像素点或像素块(即多个像素点构成的像素块)”即为特征点。在一个示例中,如果图像中的一个像素点是其邻域内的极值点(如最大或最小值的点),则确定该像素点是一个特征点。
图像块(image patch):图像中的一个局部正方形区域,如4*4像素,8*8像素的图像区域。其中,a*a像素表示宽和高分别是a个像素的正方形区域,a是大于等于1的整数。
单应性变换(homograph):又称为射影变换。它把一个射影平面上的点(三维齐次矢量)映射到另一个射影平面上。满足Y=H*X,其中H为3*3的矩阵(又叫做单应性矩阵),X为源图像中的像素点的位置坐标,Y为映射到的目标图像上对应像素点的位置坐标。在绘本识别中,绘本可看做是一个平面,其对应的几何变换子集为单应性变换,决定变换的单应性矩阵为由旋转、平移、缩放等性质组成的矩阵(如3*3矩阵)。如果一个图像经单应性变换得到另一个图像,则认为这两个图像之间具有单应性变换关系。
梯度方向直方图(histogram of oriented gradient,HOG):直方图又称质量分 布图,是一种统计报告图,由一系列高度不等的纵向条纹或线段表示数据分布情况,一般用横轴表示数据类型,纵轴表示分布情况。梯度方向直方图是用来计算局部图像梯度的方向信息的统计值。
主方向:在一个图像/图像块中,通过计算相邻像素间的梯度方向(即相邻像素的矢量差值的单位向量),建立梯度方向直方图,梯度方向直方图中的峰值所处梯度即为该图像/图像块的主方向。
卷积神经网络(convolutional neural network,CNN):是一种前馈神经网络,它的人工神经元可以响应一部分覆盖范围内的周围单元,对于大型图像处理有出色表现。
最大池化(max poling):池化层最直接的目的是降低下一层待处理的数据量。最大池化对某个滤波器抽取到若干特征值,只取得其中最大的池化层作为保留值,其他特征值全部抛弃,值最大代表只保留这些特征中最强的,抛弃其他弱的此类特征。
旋转不变性:在物理学里,假若物理系统的性质跟它在空间的取向无关,则该系统具有旋转不变性。在图像处理中,若图像在平面内任意旋转角度下,特征提取器对其提取的特征几乎不发生变化,则称该特征提取器具有旋转不变性。其中,特征提取器可以是绘本机器人,或绘本机器人中的功能模块,例如神经网络。
损失函数(loss function):损失函数是用来估量模型的预测值f(x)与真实值Y的不一致程度,它是一个非负实值函数,通常使用L(Y,f(x))来表示,损失函数越小,模型的鲁棒性就越好。一个最佳化问题的目标是将损失函数最小化。一个目标函数通常为一个损失函数本身或者为其负值。当一个目标函数为损失函数的负值时,目标函数的值寻求最大化。
稀疏特征、稠密特征:在局部特征检测中,若记录下图像中的每个像素的位置索引(index),每个index都应对应一个特征,则稀疏特征是指在index集合中,大多数的index为空,或者说大多数的index无对应的特征。而稠密特征是指,大多数的index不为空,即大多数的index具有其对应的特征描述。
局部特征检测算法:局部特征检测算法包括“提取”和“表示”两个部分。“提取”的目的是判断图像中每个像素点(或图像块)是否是特征点。“表示”即针对所有检测出的特征点,根据其邻域,表示成同一维度下的特征值。通过计算两个特征点的特征值的距离,即可判断两个特征点是否相似,进而可以根据两幅图像中相似特征点的个数或比率,判断两幅图像的相似程度。因此,局部特征检测算法的评判标准为:具有相同/相似区域的两幅图,特征点被成功匹配的匹配准确率。
高单应性变换场景,是指变换前后的特征表示差异非常大的场景(即变换前后所确定的特征点差异非常大),例如绘本识别场景。
在本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。
在本申请的实施例中,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、 “第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本申请的描述中,除非另有说明,“多个”的含义是两个或两个以上。
本申请中术语“至少一个”的含义是指一个或多个,本申请中术语“多个”的含义是指两个或两个以上,例如,多个第二报文是指两个或两个以上的第二报文。本文中术语“系统”和“网络”经常可互换使用。
应理解,在本文中对各种所述示例的描述中所使用的术语只是为了描述特定示例,而并非旨在进行限制。如在对各种所述示例的描述和所附权利要求书中所使用的那样,单数形式“一个(“a”,“an”)”和“该”旨在也包括复数形式,除非上下文另外明确地指示。
还应理解,本文中所使用的术语“和/或”是指并且涵盖相关联的所列出的项目中的一个或多个项目的任何和全部可能的组合。术语“和/或”,是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本申请中的字符“/”,一般表示前后关联对象是一种“或”的关系。
还应理解,在本申请的各个实施例中,各个过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
应理解,根据A确定B并不意味着仅仅根据A确定B,还可以根据A和/或其它信息确定B。
还应理解,术语“包括”(也称“includes”、“including”、“comprises”和/或“comprising”)当在本说明书中使用时指定存在所陈述的特征、整数、步骤、操作、元素、和/或部件,但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元素、部件、和/或其分组。
还应理解,术语“如果”可被解释为意指“当...时”(“when”或“upon”)或“响应于确定”或“响应于检测到”。类似地,根据上下文,短语“如果确定...”或“如果检测到[所陈述的条件或事件]”可被解释为意指“在确定...时”或“响应于确定...”或“在检测到[所陈述的条件或事件]时”或“响应于检测到[所陈述的条件或事件]”。
应理解,说明书通篇中提到的“一个实施例”、“一实施例”、“一种可能的实现方式”意味着与实施例或实现方式有关的特定特征、结构或特性包括在本申请的至少一个实施例中。因此,在整个说明书各处出现的“在一个实施例中”或“在一实施例中”、“一种可能的实现方式”未必一定指相同的实施例。此外,这些特定的特征、结构或特性可以任意适合的方式结合在一个或多个实施例中。
目前,通常采用如下局部检测算法进行绘本识别:
第一种:基于手工特征方式的局部检测算法,即局部特征点的提取和表示均是基于规则的。如极值点的判断中,需要逐一将每个像素点的像素值其与周围邻域像素点的像素值进行对比。主方向的判断中,需要逐一构建梯度方向直方图等。特征表示时,需要进行归一化、方向矫正等复杂的步骤。而这其中的每个步骤都需要通过实验设定固定的参数。
在判断两幅图像是否局部相似时,若相似区域(即物理上的同一区域在不同拍摄角度下的位置)的几何形变小,则在基于手工的局部特征检测中,其对应的特征变化较小,相对比较容易匹配正确。在绘本识别场景中,针对绘本中的同一页,当绘本所在的位置不同时,绘本机器人扫描到的该页的图像的特征变化较大。而当相似区域发生较大的几何形变时,其极值点的分布变化较大,当局部区域缩小或几何形变较大时,原本是极值点的像素点在手工的规则下可能不再被表示成为极值点,进而无法被确定为特征点,这会导致部分特征点的表示发生偏差,而无法正确匹配。
第二种:基于深度学习的方法的局部检测算法,即神经网络的输入为图像,输出为图像中每个像素点(或像素块)被认为是特征点的得分图(即每个像素点对应其能被标记为局部特征点的可能性,为0-1的概率值),以及每个像素点(或像素块)对应了特征值的特征图。该方法为非端到端的方法。一方面,该方法中特征的提取仍然依赖于手工特征提取,因此,同样会存在上述问题。另一方面,该神经网络通常是卷积神经网络,而卷积神经网络仅在一定程度上具有旋转不变性,且不会像上述方法一一样,对特征点做旋转和归一化,因此在高单应性变换场景中,变换前后的特征表示差异非常大而导致匹配准确率非常低。
基于此,本申请实施例提供了一种神经网络模型训练方法以及图像识别方法,应用于高单应性变换场景(如绘本识别场景)中。具体的:在模型训练阶段,基于多个图像训练具有旋转不变性的神经网络,更准确地说,训练相比于现有技术的卷积神经网络具有旋转不变性程度更高的神经网络。其中,该多个图像中包括具有单应性变换关系的图像。在图像识别阶段,基于该具有旋转不变性的神经网络对图像进行识别。这样,与现有技术相比,有助于使得变换前后的特征表示差异小,从而提高匹配准确率。
本申请实施例提供的神经网络模型训练方法和图像识别方法可以分别应用于相同或不同的计算机设备中。例如,神经网络模型训练方法可以由服务器或终端等计算机设备执行。图像识别方法可以由终端(如绘本机器人等)执行。本申请实施例对此不进行限定。
如图1所示,为可适用于本申请实施例的一种计算机设备10的硬件结构示意图。
参考图1,计算机设备10包括处理器101、存储器102、输入输出器件103以及总线104。其中,处理器101、存储器102以及输入输出器件103之间可以通过总线104连接。
处理器101是计算机设备10的控制中心,可以是一个通用中央处理单元(central processing unit,CPU),也可以是其他通用处理器等。其中,通用处理器可以是微处理器或者是任何常规的处理器等。
作为示例,处理器101可以包括一个或多个CPU,例如图1中所示的CPU 0和CPU 1。
存储器102可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、磁盘存储介 质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。
一种可能的实现方式中,存储器101可以独立于处理器101存在。存储器102可以通过总线104与处理器101相连接,用于存储数据、指令或者程序代码。处理器101调用并执行存储器102中存储的指令或程序代码时,能够实现本申请实施例提供的神经网络模型训练方法,和/或图像识别方法。
另一种可能的实现方式中,存储器102也可以和处理器101集成在一起。
输入输出器件103,用于输入样本图像、待识别图像等参数信息,以使处理器101根据输入的参数信息,执行存储器102中的指令以执行本申请实施例提供的神经网络模型训练方法,和/或图像识别方法。通常,输入输出器件103可以是操作盘或触摸屏,或者是其他任意能够输入参数信息的器件,本申请实施例不作限定。
总线104,可以是工业标准体系结构(industry standard architecture,ISA)总线、外部设备互连(peripheral component interconnect,PCI)总线或扩展工业标准体系结构(extended industry standard architecture,EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。为便于表示,图1中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
需要指出的是,图1中示出的结构并不构成对该计算机设备10的限定,除图1所示部件之外,该计算机设备10可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
以下,结合附图对本申请实施例提供的技术方案进行说明:
本申请实施例所采用的模型是深度学习网络模型(或神经网络模型,下文中简称网络模型)。如图2a和图2b所示,为本申请实施例提供的两种深度学习网络模型的示意图。
图2a所示的网络模型包括:前段网络41和表示网络42。
图2b所示的网络模型包括:前段网络41、表示网络42和提取网络43。
前段网络41的输入是图像,输出是该图像的第三特征图。其中,第三特征图表示将输入前段41的该图像的特征(如纹理特征等)变换到主方向后得到的特征图。在训练阶段,前段网络41的输入是样本图像。在图像识别阶段,前段网络41的输入是待识别图像。
可选的,前段网络41可以包括第一神经网络411和第二神经网络412。
第一神经网络411用于对输入的图像(即输入图像)进行特征提取,例如,对该输入图像进行至少一层卷积操作,得到第一特征图。第一特征图可以是一个三维张量,该张量中的一个元素对应输入图像中的一个区域,该区域也可以称作是卷积神经网络的感受野(receptive field)。
示例的,如图3所示,为本申请实施例提供的一种第一神经网络411的逻辑结构示意图。其中,第一神经网络411的输入图像的尺寸是H*W*3,输出的第一特征图的尺寸是H/4*W/4*64。第一神经网络411包括4层卷积层(分别标记为conv1-1,conv1-2,conv2-1,conv2-2)。
第二神经网络412用于对第一特征图进行校正,得到第三特征图。可选的,第二 神经网络412用于对第一特征图进行特征提取,得到第二特征图,并将第二特征图与第一特征图进行点乘,得到第三特征图。
在一种实现方式中,第二神经网络412具体用于对第一特征图执行至少一层卷积操作,得到第二特征图。
在另一种实现方式中,第二神经网络412具体用于对第一特征图执行至少一层卷积操作,然后对执行卷积操作后的第一特征图执行至少一层池化操作和/或全连接操作,得到第二特征图。示例的,如图4所示,为本申请实施例提供的一种第二神经网络412的逻辑结果示意图。图4是基于图3进行绘制的。第二神经网络412输入的第一特征图的尺寸是H/4*W/4*64,输出的第三特征图的尺寸是H/4*W/4*64。图4所示的第二神经网络412包括2层卷积层、1层全连接层和一层点乘层。该实现方式可以实现更复杂的特征提取,从而有助于使得特征提取的结果更精准,进而有助于提高图像识别的精确度。
基于图2a所示的网络模型:
表示网络42,用于基于第三特征图获得第一得分图。第一得分图是输入前段网络的图像的得分图。其中,第三特征图的尺寸是M 1*N 1*P 1,第一得分图的尺寸是M 1*N 1,P 1是特征方向维度的尺寸,M 1*N 1是垂直于特征方向维度的尺寸,M 1、N 1和P 1均是正整数。
如图5所示,为本申请实施例提供的一种特征图各维度的示意图。图5中是以尺寸为H/4*H/4*64的特征图为例进行说明的。在本申请实施例中,该特征图的特征方向的维度尺寸是64,垂直于特征方向的维度是H/4*H/4。其他特征图的各维度的说明与此类似,此处不再一一赘述。
基于图2a所示的网络模型,第二神经网络42输出的第三特征图作为图像识别的过程中所使用的特征图。应用于图像识别阶段时,第一得分图和第三特征图用于识别待识别图像。
结合图2a所示的网络模型和图4所示的第二神经网络412输出的第三特征图的尺寸可知,M 1*N 1*P 1等价于H/4*H/4*64,具体的,M 1=H/4,N 1=H/4,P 1=64。该情况下,第一得分图的尺寸是H/4*H/4。
基于图2b所示的网络模型:
表示网络42,用于基于第三特征图获得第一得分图。第三特征图的尺寸是M 2*N 2*P 2,第一得分图的尺寸是M 1*N 1,P 2是特征方向维度的尺寸,M 1、N 1、P 1、M 2、N 2和P 2均是正整数。可选的,M 1*N 1<M 2*N 2
提取网络43,用于对第三特征图进行特征提取,得到第四特征图。其中,第四特征图的尺寸是M 1*N 1*P 1;P 1是特征方向维度的尺寸,P 1是正整数。也就是说,这里的特征提取是为了进一步缩小特征图的垂直于特征方向维度的尺寸,这样,有助于使得后续使用第四特征图进行图像识别时,降低图像识别过程的计算复杂度,从而提高识别效率。
基于图2b所示的网络模型,应用于图像识别阶段时,基于待识别图像获得的第一得分图和第四特征图用于对待识别图像进行识别。下文中的具体示例均是以2b所示的网络模型为例进行说明的,此处统一说明,下文不再赘述。
本申请实施例提供的技术方案包括训练阶段和图像识别阶段,以下分别进行说明:
训练阶段
训练阶段包括获取训练数据阶段和模型训练阶段,以下分别进行说明:
a)、获取训练数据阶段
如图6所示,为本申请实施例提供的一种获取训练数据的方法的流程示意图,该方法的执行主体可以是计算机设备,该方法可以包括以下步骤:
S101:获取参考图像集,参考图像集包括多个参考图像;然后,获取该多个参考图像中的每个参考图像的得分图。
本申请实施例对参考图像集不进行限定。例如,参考图像集可以是现有的数据集,例如,HPatches数据集,具体可以是三维重建数据集等。
由于局部特征检测方法多用于三维建模,即时定位与地图构建(simultaneous localization and mapping,SLAM)等领域,这些领域中很少出现如绘本识别场景的高单应性变换情况,因此,这些领域所使用的训练数据集中通常不包含有高单应性变换情况下的样本图像。由于数据集的构建难度较大,成本较高,因此,在本申请的一些实施例中,基于现有的数据集进行增强,从而获得适用于高单应性变换情况下的样本图像。其中,高单应性变换情况下的样本图像包括:具有单应性变换关系的图像。增强过程可以参考S102~S103。
图像的得分图可以通过矩阵来表征。例如,该矩阵中第i行第j列的元素的取值表示该图像中的第i行第j列的像素点(或像素块)是特征点的概率。其中,i和j均是大于等于0的整数。在一个示例中,如果参考图像集是现有的数据集如HPatches数据集,则参考图像集中的参考图像的得分图可以是现有的数据集如HPatches数据集中相应图像的得分图,这样,可以直接使用现有技术中的图像的得分图,而不需要再通过计算获得,有助于降低计算复杂度。
在本申请的另一些实施例中,可以采用其他方式获得适用于高单应性变换情况下的样本图像,而非基于现有的训练数据集进行增强。相应的,参考图像集中的每个参考图像的得分图还可以通过其他方式获得,本申请实施例对此不进行限定。
S102:将多个参考图像(如每个参考图像)分别作为样本图像,并将该多个参考图像的得分图分别作为相应样本图像的得分图。并且,对多个参考图像(如每个参考图像)分别进行单应性变化,得到多个样本图像。
对参考图像进行单应性变化,具体包括:将参考图像乘以变换矩阵,得到样本图像。其中,单应性变换矩阵可以是预定义的,或者随机生成的。在S102中,对于任意一个参考图像,将该参考图像乘以一个或不同的多个变换矩阵,得到一个或多个样本图像。对于一个参考图像来说,变换矩阵与基于该参考图像得到的样本图像一一对应。
如图7所示,为可适用于本申请实施例的一种参考图像以及对参考图像进行单应性变换后得到样本图像的示意图。其中,图7中的H表示单应性变换时所采用的变换矩阵。
S103:对于基于单应性变换得到的每个样本图像来说,基于该样本图像对应的参考图像(即获得该样本图像时所采用的参考图像)的得分,以及该样本图像对应的变换矩阵(即获得该样本图像时所采用的变换矩阵),得到该样本图像的得分图。
以下,以对一个参考图像进行单应性变换得到一个样本图像为例,说明样本图像的得分图的获取方式:
首先,将该参考图像标记为D,对该参考图像进行单应性变换时所使用的变换矩阵标记为H,该参考图像中的像素点d ij(即该参考图像中的第i行第j列的像素点,i和j均是整数)的得分标记为s ij。将该参考图像中的像素点d ij乘以单应性变换系数H i,得到的像素点标记为
Figure PCTCN2021109680-appb-000001
Figure PCTCN2021109680-appb-000002
的得分标记为
Figure PCTCN2021109680-appb-000003
Figure PCTCN2021109680-appb-000004
由d ij映射得到,因此,
Figure PCTCN2021109680-appb-000005
的得分(即
Figure PCTCN2021109680-appb-000006
)受单应性变换矩阵H的影响。当图像局部产生形变而产生图像损失较大时(该损失与单应性矩阵的空间旋转参数和缩放参数具有相关关系,该相关关系可在b)的表达式中计算),就导致变换前的图像中的部分特征点映射到变换后的图像中时,被认为是特征点的可能性降低,而若映射前后,具有映射关系的像素点的得分维持不变,则会导致样本严重失真,从而导致后续的网络训练难以收敛。为此,本申请实施例提供了一种对变换后得到的图像中的得分进行估计的方法,具体可以包括以下步骤:
步骤A)、将s ij扩充为矩阵[s ij,1,1]。为了将数据进行做标准化处理,对矩阵[s ij,1,1]进行归一化操作,得到S=[a,b,c]。
步骤B)、根据参考图像和参考图像的得分图,计算参考图像变换到样本图像时所采用的得分图变换矩阵T=[λ 123]。
具体的:根据参考图像中的图像块与样本图像中的图像块之间的匹配对应关系,以参考图像中的多个图像块在形变前的得分和单应性变换矩阵H作为输入,以该多个图像形变后得到的图像块的得分作为输出,通过最小二乘法拟合,得到得分图的变换矩阵T。其中,如果参考图像中的图像块和样本图像中的图像块在物理上表示同一对象,则这两个图像块之间具有匹配对应关系。
步骤C)、基于得分图变换矩阵T,获取
Figure PCTCN2021109680-appb-000007
具体的:
若在变换过程中,样本图像上的像素点P'是参考图像上的像素点P经过变换得到的,样本图像上的像素点Q'是参考图像上的像素点Q经过变换得到的,并且,P与Q重合,则
Figure PCTCN2021109680-appb-000008
满足如下公式:
Figure PCTCN2021109680-appb-000009
其中,n是重合点的个数。
若在变换过程中,样本图像上的像素点P'是参考图像上的像素点P经过变换得到的,参考图像中存在像素点Q,其中,像素点Q是像素点P的邻域内的像素点,且Q为拟合的估计点,则
Figure PCTCN2021109680-appb-000010
满足如下公式:
Figure PCTCN2021109680-appb-000011
其中,n是邻近个数,即邻域内像素点的个数。
像素点P的邻域可以是预定义的。本申请实施例对像素点P的邻域大小和位置不进行限定。
需要说明的是,特征点得分图的每个分数由其邻域内的点共同约束,这就在约束中增加了感受野,同时弥补了数据增强过程中的样本失真问题,降低了特征点选取的偶然性。
至此,获取到了训练数据。训练数据包括:样本图像集中的样本图像和每个样本图像的得分图。其中,样本图像集包括参考图像和对参考图像进行单应性变换后得到 的图像。
如图8所示,为本申请实施例提供的一种参考数据和训练数据之间的关系的示意图。其中,参考数据包括参考图像集和参考图像集中的每个参考图像的得分图,图8中示意出了参考图像集包括参考图像1和参考图像2。训练数据包括样本图像集和样本图像集中的每个样本图像的得分图,图8中示意出了样本图像集包括:样本图像10(即参考图像1)、样本图像11(即参考图像1乘以变换矩阵11后得到的图像)、样本图像12(即参考图像1乘以变换矩阵12后得到的图像)、样本图像20(即参考图像2)和样本图像21(即参考图像2乘以变换矩阵21后得到的图像)等。图8中的双向箭头表示图像与其得分图之间的对应关系。
对于训练数据来说,将成对的两张输入的大小为H*W的样本图像,每张图像对应的图像块,图像块对应的特征矩阵(1*W维);每张图像块对应的得分的取值范围是([0,1])。构建三元组tri=(D i,D j,D k),其中,D i,D j,D k均为图像块,(D i,D j)是相似的图像块的匹配对,(D i,D k)是不相似的图像块的匹配对。其中,不相似的图像匹配块随机从同一张图,或者同一尺度下的不同图像中选取。
为了让图像在多尺度下具有鲁棒性,将多个尺度的图像作为训练的输入,本申请实施例可以根据训练数据,将图像的大小调成为(H*2)*(W*2)、H*W、(H/2)*(W/2)三个大小。其对应的得分图中,(H*2)*(W*2)大小图像对应的得分图通过插值得到,(H/2)*(W/2)大小图像对应的得分图通过下采样(max pooling)得到。
需要说明的是,训练数据是基于自然场景的标注信息,进行得分图估计,这就使得估计样本更加倾向于真实场景,从而有助于提高图像识别的精确度。
b)、模型训练阶段
基于图2a所示的网络模型,计算机设备可以先对前段网络41进行训练,再分别对表示网络42进行训练。
基于图2b所示的网络模型,计算机设备可以先对前段网络41进行训练,再分别对表示网络42和提取网络43进行训练。其中,对表示网络42和对提取网络43的训练可以是并列执行的,二者之间的训练顺序可以不分先后。
训练每个网络(包括前段网络41、表示网络42和提取网络43等)的过程可以认为是获得该网络的参数(如卷积核中每个元素的取值等)的实际值的过程。其中,这里的实际值是指应用于图像识别阶段时,该网络所使用的参数的值。
训练前段网络41
在训练前段网络41之前,可以预先配置如下信息:
前段网络41中的第一神经网络411和第二神经网络412分别包含的运算层,运算层的输入的尺寸、运算层的参数的尺寸、运算层的输出的尺寸,以及运算层之间的关联关系。其中,运算层可以包括:卷积层、池化层、全连接层或点乘层中的一项或多项等。运算层的参数包括执行该层运算时所使用的参数,例如,卷积层的参数包括卷积层的层数,以及每个卷积层所使用的卷积核的尺寸。运算层之间的关联关系,也可以称作是运算层之间的连接关系,例如,哪个运算层的输出作为哪个运算层的输入等。
可以理解的是,前段网络41中第一个运算层的输入是前段网络41的输入,前段网络41的最后一个运算层的输出是前段网络41的输出。
前段网络41的输入是图像。在一个示例中,前段网络41的输入的尺寸标记为 H*W*3。其中,H表示输入图像的高的尺寸,W表示输入图像的宽的尺寸,3表示通道数。H和W的取值可以是预定义的。
前段网络41的输出是第三特征图。第三特征图是指将输入前段网络41的图像的特征旋转到主方向后得到的特征图。
至此,完成了对前段网络的预配置过程。
对前段网络41进行预配置后,不同运算层的输入尺寸、参数尺寸和输出尺寸适配。这里的“适配”是指满足数学中矩阵/张量之间的运算关系的尺寸,例如,矩阵A和矩阵B满足点乘的原则是,矩阵A的列数等于矩阵B的行数。其他示例不再一一列举。
预配置过程结束之后,计算机设备可以为前段网络41中的各参数(例如,前段网络41中各运算层的参数)配置初始值,例如,每层卷积层所使用的卷积核均具有初始值。本申请实施例对各参数的初始值均不进行限定,例如可以是随机生成的。
执行训练前段网络41基本原理是:基于样本图像集中的图像,以及前段网络41中的各参数的初始值,在前段网络41的对抗网络44和孪生网络45的约束之下,进行训练,以实现“前段网络41输出的第三特征图是将其输入的图像的特征变换到主方向上后得到的特征图”。并将实现此目的时所使用的前段网路41的参数作为训练结果。其中,前段网络41、对抗网络44和孪生网络45之间的连接关系可以如图9所示。
训练过程的结果用于在使用前段网络41进行图像识别过程中作为前段网络的参数的值(或实际值)。
以下说明本申请实施例提供的对前段网络41进行训练的方法。该方法的执行主体可以是计算机设备。如图10所示,该方法可以包括以下步骤:
S201:将样本图像集中的任意一个图像作为输入图像输入到第一神经网络411中,第一神经网络411对该输入图像进行特征提取,得到该输入图像的第一特征图。
例如,第一神经网络411使用第一神经网络的参数的初始值,对该输入图像进行特征提取,得到该输入图像的第一特征图。
可选的,第一神经网络对该输入图像进行预设层数的卷积操作,得到该输入图像的第一特征图。示例的,基于图3,执行S201时,第一神经网络411对输入图像执行4层卷积操作,得到该输入图像的第一特征图。
需要说明的是,此仅为示例,实际实现时,第一神经网络411还可以对输入图像进行其他操作,从而得到第一特征图,本申请实施例对此不进行限定。
S202:将该输入图像的第一特征图输入到第二神经网络412中,以对该第一特征图进行特征提取,得到第三特征图。第三特征图可以理解为经第二神经网络412处理,将输入图像的特征(如纹理特征等)转换到主方向上后得到的特征图。
例如,第二神经网络412对该输入图像的第一特征图依次进行卷积操作和全连接操作,并将全连接操作的结果与该输入图像的第一特征图进行点乘操作,得到第三特征图。
示例的,基于图4,执行S202时,第二神经网络412依次对图3得到的第一特征图进行2层卷积操作和1层全连接操作,并将全连接操作的结果与该第一特征图进行点乘操作,得到第三特征图。例如,基于图4,第二神经网络412依次对图3得到的第一特征图进行2层卷积操作和1层全连接操作之后,可以得到h*w个2*2的矩阵, 将2*2*(h*w)的核作为对应通道的特征主方向的方向矩阵,通过点乘的方式,得到循环到主方向上的特征。
由于点乘是可微的,因此第二神经网络412能够在前段网络41的训练过程中进行反向传播。具体的,通过前段网络41的对抗网络44和孪生网络45进行约束,以对前段网络41进行训练,得到前段网络41的参数的实际值。以下通过步骤S203说明对抗网络44的工作原理,通过步骤S204说明孪生网络45的工作原理。
作为示例,第二神经网络412可以被称作是局部空间变换网络(LSTN)。LSTN的设计,在生成对抗网络的学习下,使得局部区域能够矫正到其主方向上,使得网络在训练高单应性变换样本时能够收敛。
S203:将第三特征图作为对抗网络44的输入,对抗网络44对第三特征图进行反卷积操作,得到第五特征图。其中,第五特征图的尺寸与前段网络41的输入图像的尺寸相同,如均是H*W*3。然后,对抗网络44将第五特征图分为多个数据块。
可选的,对抗网络44对第三特征图进行两层反卷积操作,得到第五特征图。
可选的,对抗网络44将第五特征图分为多个数据块,可以包括:对抗网络44将第五特征图均分为多个数据块。本申请实施例对每个数据块的大小不进行限定。
如图11所示,为本申请实施例提供的一种对抗网络44的逻辑结构示意图。图11是基于图4进行绘制的。具体的,基于图4得到的第三特征图(尺寸为H/4*H/4*64),对抗网络44对第三特征图进行两层反卷积操作,分别得到尺寸为H/2*H/2*32的特征图和尺寸为H*W*3的特征图(即第五特征图)。然后,将该尺寸为H*W*3的特征图中的每一层尺寸为H*W的矩阵中的元素均分为16*16的数据块。
S204:将对抗网络44生成的该多个数据块输入孪生网络45。由孪生网络45使用损失函数进行约束,来判断第三特征图是否是旋转到主方向上后的特征图。其中,孪生网络45的基本思想是最小化相似数据块匹配对之间的特征距离,同时最大化不相似数据块对的特征距离。
若是,即判断结果为第三特征图是旋转到主方向上后的特征图,则对前段网络41的训练过程结束。后续,可以将本次执行S201和S202时所使用的参数的值作为识别阶段时前段网络的参数的值。
若否,即判断结果为第三特征图不是旋转到主方向上后的特征图,则前段网络41可以向前段网络41反馈相关信息,以辅助调整前段网络41的参数的值,前段网络41的参数调整之后,重新执行S201,以此循环,直到某一次或多次执行S204时,判断结果为第三特征图是旋转到主方向上后的特征图为止。
本申请实施例对对抗网络44和孪生网络45辅助调整前段网络41的具体实现方式不进行限定。例如,可以参考现有技术中其他应用场景中,对前段网络41的参数的值的训练过程中的反馈调节过程,此处不再详述。
可选的,通过构建三元组tri的损失函数约束,来判断第三特征图是否正确。三元组的损失函数如以下公式1所示,其思想为:最小化相似图像块匹配对的特征距离,同时最大化不相似图像块匹配对的特征距离,其中M为确保模型收敛的偏置值。
公式1:L tri(D i,D j,D k)=Σ i,j,k∈Pmax(0,dist(D i,D j)-dist(D i,D k)+M)。
需要说明的是,三元组的损失函数,同时用在了特征表示和主方向的对抗网络上, 拉大了不相似特征点的分布,使得后续的匹配能更准确的得到最近邻特征。
训练提取网络43
在训练提取网络43之前,可以预先配置如下信息:
提取网络43包括的运算层,运算层的输入的尺寸、运算层的参数的尺寸、运算层的输出的尺寸,以及运算层之间的关联关系(即哪个运算层的输出作为哪个运算层的输入等)。其中,运算层可以包括:卷积层、或分组加权层等。卷积层的参数包括卷积层的层数,以及每个卷积层所使用的卷积核的尺寸。
在一种实现方式中,提取网络43包括分组加权层432。
分组加权层432用于:使用1通道卷积核,对第三特征图执行卷积操作,得到X个第五特征图;其中,第五特征图的特征方向的尺寸小于第三特征图的特征方向的尺寸;X是大于2的整数;将X个第五特征图的元素加权求和,得到第六特征图;对第六特征图进行特征提取,得到第一得分图。关于该实现方式中分组加权层432的具体说明,可以基于下述实现方式推理得到,此处不再赘述。
在另一种实现方式中,提取网络43包括卷积层431和分组加权层432。
卷积层431用于:对第三特征图进行卷积操作,得到第七特征图。本申请实施例对卷积操作的层数和卷积核的尺寸等均不进行限定。可选的,“对第三特征图进行特征提取,得到第七特征图”的目的是为了缩小了垂直于特征方向的维度尺寸。
分组加权层432用于:使用1通道卷积核,对第七特征图执行卷积操作,得到X个第五特征图。第五特征图的特征方向的尺寸小于第三特征图的特征方向的尺寸。X是大于2的整数。将X个第五特征图的元素加权求和,得到第六特征图。对第六特征图进行特征提取,得到第一得分图。
1通道卷积核,可以理解为是垂直于特征方向的维度尺寸是1,特征方向的维度尺寸是X的卷积核。第六特征图的尺寸与第五特征图的尺寸相同。对第六特征图进行特征提取的目的在于压缩第六特征图的特征方向的维度尺寸为1。具体的,分组加权层432可以对第六特征图进行一层或多层卷积操作,从而得到第一得分图。第一得分图是一个二维矩阵,也就是说,其特征方向的维度尺寸是1。
需要说明的是,分组加权层(或称为分组加权网络)的设计,使得得分图的计算同时利用局部和全局信息来查找局部特征。
如图12所示,为本申请实施例提供的一种提取网络43的逻辑结构示意图。图12是基于图4进行绘制的。具体的:基于图4得到的尺寸为H/4*H/4*64的第三特征图,提取网络43中的卷积层431用于对尺寸为H/4*H/4*64的第三特征图进行卷积操作,得到尺寸为H/8*H/8*256的第七特征图。该第七特征图的特征方向的维度是256,垂直于特征方向的维度尺寸是H/8*H/8。提取网络43中的分组加权层432用于使用1*1*16的卷积核,对尺寸为H/8*H/8*256的第七特征图进行卷积操作,分别得到16个尺寸为H/8*H/8*16的特征图。然后,将这16个尺寸为H/8*H/8*16的特征图的元素进行加权求和,得到尺寸为H/8*H/8*16的第六特征图。其中,将不同的H/8*H/8*16的特征图中坐标位置相同的元素进行加权求和,得到第六特征图中的该坐标位置的元素。接着,对第六特征图进行卷积操作,得到尺寸为H/8*H/8的第一得分图。
16个尺寸为H/8*H/8*16的特征图的元素进行加权求和的公式如公式2所示:
公式2:s k=Σ ijexp(a ij*p ij)/Σ kΣ ijexp(a k,ij*p k,ij)
其中,s k表示逐个通道逐个元素的分数表示。其中,在一个通道中,i表示第i组,j表示第i组中的第j个元素,k的最大值为单个通道中的元素个数;a ij表示第一个通道中第i组第j个元素对应的权重(该权重由反向传播学习而来);a k,ij表示第k个通道的第i组第j个元素对应的权重;p即为对应元素的值。
需要说明的是,实际实现时,在对提取网络43进行训练的过程中,需要损失函数进行反馈约束(图12中未示出),例如,局部特征提取(即提取网络)的损失函数如公式3所示:
公式3:L score(sx,sy)=log(Σ h,wexp(l(sx hw,sy hw)))
其中,
Figure PCTCN2021109680-appb-000012
sy是标签,其值并非直接从数据集中对应像素位置中获取,而是计算对应n*n区域(n为自定义的,建议值为9*9)中得分图上的分数,通过公式2取得每个像素点的分数,继而取得n*n区域中分数中最大值作为当前点的得分。
Figure PCTCN2021109680-appb-000013
为局部区域中的每个像素点对应的分数的(无对应分数的像素点补充分数为0.0)。sx为前向推到得到的分数,sy为数据集给出的基准分数。sx hw为计算得到的图像第h行第w列分数,sy hw为数据集给出的基准中图像第h行第w列的分数。公式3的表示为通用的神经网络损失函数方式,使用数据集中的基准数据来约束神经网络计算出来的数据,通过反向传播,来更新神经网络中的各个参数。
训练表示网络42
在训练表示网络42之前,可以预先配置如下信息:
表示网络42包括的运算层,运算层的输入的尺寸、运算层的参数的尺寸、运算层的输出的尺寸,以及运算层之间的关联关系(即哪个运算层的输出作为哪个运算层的输入等)。其中,运算层可以包括卷积层等。卷积层的参数包括卷积层的层数,以及每个卷积层所使用的卷积核的尺寸。
如图13所示,为本申请实施例提供的一种表示网络42的逻辑结构示意图。图13是基于图4进行绘制的。具体的:基于图4得到的尺寸为H/4*H/4*64的第三特征图,表示网络42中的一层卷积层对第三特征图进行卷积操作,然后将输出结果输出给另一层卷积层进行卷积运算,得到第四特征图。其中,第四特征图的尺寸可以是H/8*H/8*128。也就是说,经过表示网络42进行处理后,垂直于特征维度方向的尺寸减小了。这样,有助于降低后续图像识别过程中的计算复杂度,从而提高图像识别效率。
需要说明的是,实际实现时,在对表示网络42进行训练的过程中,需要损失函数进行反馈约束(图13中未示出),例如,局部特征表示阶段(即表示网络)的损失函数使用三元组损失函数,即构建相似匹配对和不相似匹配对,进而使用公式1,最小化相似匹配对的距离,最大化不相似匹配对的距离。在该阶段中,提取特征图的单个元素的所有通道作为特征,即1*128维度的矩阵。
另外需要说明的是,实际实现时,可以建立整体网络的损失函数如公式4所示:
公式4:
Figure PCTCN2021109680-appb-000014
其中,P表示所有匹配图像点的集合,p,q分别为P中的点,二者可能为相似点或不相似点。整体的损失函数是局部特征得分(即提取网络)和特征表示(即表示网络)的损失函数的总和。
Figure PCTCN2021109680-appb-000015
Figure PCTCN2021109680-appb-000016
分别表示A,B两个点的得分,而A、B分别从两张具有单应性变换关系的图像上取出。公式4的损失函数为全局的损失计算,它不同于简单的损失函数加权相加,其目的是通过相似和不相似的匹配对共同作用,将相似对于相似对的分数交叉相乘,并计算其在全局范围的比重,从而加强了约束,使得能够更好的对整体的损失产生影响。
图像识别阶段
在图像识别阶段,网络的前向推理包含如图2a或如图2b所示的网络结构,不包含对抗网络和孪生网络等。
如图14所示,为本申请实施例提供的一种图像识别方法的流程示意图。图14所示的方法包括以下步骤:
S301:图像识别装置获取待识别图像。例如,绘本机器人拍摄绘本,得到待识别图像。
S302:图像识别装置使用第一神经网络对待识别图像进行特征提取,得到第一特征图。
S303:图像识别装置使用第二神经网络对第一特征图进行特征提取,得到第二特征图,并将第二特征图与第一特征图进行点乘,得到第三特征图。其中,第三特征图表示将待识别图像的特征变换到主方向后得到的特征图。
这里的第一神经网络可以是上文中提供的任一种训练好的第一神经网络411,第二神经网络可以是上文提供的任一种训练好的的第二神经网络412。
在一种示例中,图像识别装置使用第二神经网络对第一特征图执行至少一层卷积操作,得到第二特征图。其具体实现过程可以参考上文计算机设备执行的相关步骤。
在另一种示例中,图像识别装置使用第二神经网络对第一特征图执行至少一层卷积操作;对执行卷积操作后的第一特征图执行至少一层池化操作和/或全连接操作,得到第二特征图。其具体实现过程可以参考上文计算机设备执行的相关步骤。
S304:图像识别装置基于第三特征图获得待识别图像的第一得分图。
在一种示例中,图像识别装置使用1通道卷积核,对第三特征图执行卷积操作,得到X个第五特征图;其中,第五特征图的特征方向的尺寸小于第三特征图的特征方向的尺寸;X是大于2的整数;将X个第五特征图的元素加权求和,得到第六特征图;对第六特征图进行特征提取,得到第一得分图。其具体实现过程可以参考上文计算机设备执行的相关步骤。
在一种示例中,图像识别装置对第三特征图进行特征提取,得到第七特征图;其中,第三特征图的垂直于特征方向的维度尺寸大于第七特征图的垂直于特征方向的维度尺寸;X是大于2的整数;使用1通道卷积核,对第七特征图执行卷积操作,得到X个第五特征图;将X个第五特征图的元素加权求和,得到第六特征图;对第六特征图进行特征提取,得到第一得分图。其具体实现过程可以参考上文计算机设备执行的相关步骤。
可选的,待识别图像的尺寸大于第一得分图的尺寸。
S305:图像识别装置基于第三特征图和第一得分图,对待识别图像进行识别。
在一种示例中,第三特征图的尺寸是M1*N1*P1,第一得分图的尺寸是M1*N1,P1是特征方向维度的尺寸,M1*N1是垂直于特征方向维度的尺寸,M1、N1和P1均是正整数。该情况下,图像识别装置直接基于第三特征图和第一得分图,对待识别图像进行识别。
在另一种示例中,第三特征图的尺寸是M2*N2*P2,第一得分图的尺寸是M1*N1,P2是特征方向维度的尺寸,M1、N1、P1、M2、N2和P2均是正整数。该情况下,图像识别装置对第三特征图进行特征提取,得到第四特征图;其中,第四特征图的尺寸是M1*N1*P1;P1是特征方向维度的尺寸,P1是正整数;然后,基于第四特征图和第一得分图,对待识别图像进行识别;其中,第一得分图的尺寸是M1*N1。可选的,M1*N1<M2*N2。
关于S305的的具体实现方式的示例可以参考以下步骤S405中的具体示例。
本申请实施例提供的图像识别方法,利用了上文中描述的网络,由于该网络具有旋转不变性,具有旋转不变性的图像在图像处理中,该图像在平面内任意旋转角度下,图像识别装置对其提取的特征几乎不发生变化,因此,对待识别图像的摆放要求和拍摄要求均不高。另外,与现有技术中使用不具有旋转不变性的网络进行图像识别的技术方案相比,有助于提高图像识别的精确度。
以下通过一个具体示例,说明本申请实施例提供的图像识别过程。
如图15所示,为本申请实施例提供的另一种图像识别方法的流程示意图。图15所示的方法包括以下步骤:
S401:图像识别装置获取需要进行匹配的两张图像。这两张图像中的其中一张图像是待识别图像,另一张图像是样本图像。示例的,图像识别装置可以是绘本机器人。
例如,应用于绘本识别过程中时,待识别图像是绘本机器人拍摄得到的图像,样本图像是预定义的绘本数据库中存储的绘本的某一页。
S402:图像识别装置将待识别图像缩放到三个尺度如(0.5,1,2)下,将缩放后得到的图像分别输入到网络,同时,将缩放到同样尺寸的样本图像输入到网络。其中,该网络可以是上述训练阶段训练好的网络。0.5、1和2分别表示缩放倍数。
需要说明的是,将待识别图像缩放到不同尺寸,并基于不同尺寸进行图像识别,是可选的步骤。这样,有助于提高图像识别的精确度。
S403:图像识别装置使用该网络经过前向推理,得到不同尺度下的得分图(S1,S2)和特征图(F1,F2)。
例如,结合图2b,得分图S1和S2可以认为是分别将S401中的待识别图像输入网络后,得到的第一得分图,以及将S401中的样本图像输入网络后得到的第一得分图。特征图F1和F2可以认为分别是将S401中的待识别图像输入网络后,得到的第四特征图,以及将S401中的样本图像输入网络后得到的第四特征图。
关于图像识别阶段该网络的工作原理,可以参考上文中对该网络进行训练的过程,此处不再赘述。需要说明的是,与训练过程中该网络的工作原理相比,图像识别阶段的网络不包含对抗网络、孪生网络、也可以不包含使用损失函数进行反馈调节的网络。
S404:图像识别装置使用图像检索技术(具体可以参考现有技术),并基于不同 尺度下的得分图(S1,S2)执行以下步骤,以确定F1和F2中相匹配的特征点对的个数:对于F1中的特征点f1来说,其对应于S1中的得分s1>T,其中,T为得分阈值,低于该阈值不认为是特征点。在F2中搜索与f1最相似的特征f2,例如,将欧式距离最近的两个特征作为最相似的特征,其中,f2对应于S2中的得分s2>T。f1和f2为一个相匹配的特征点对。
S405:如果F1和F2中相匹配的特征点对的个数大于等于预设阈值,则图像识别装置将用于获得F2时所使用的样本图像,作为待识别图像的识别结果。否则,更新样本图像,重新执行S401-S405。
本实施例提供了一种图像识别方法的具体应用示例,实际实现时不限于此。
上述主要从方法的角度对本申请实施例提供的方案进行了介绍。为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本申请实施例可以根据上述方法示例对图像识别装置进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
如图16所示,图16示出了本申请实施例提供的图像识别装置160的结构示意图。该图像识别装置160用于执行上述的图像识别方法,例如,执行图14所示的图像识别方法。示例的,图像是被装置160可以包括第一获取单元1601、特征提取单元1602、第二获取单元1603和识别单元1604。
第一获取单元1601,用于获取待识别图像。特征提取单元1602,用于使用第一神经网络对待识别图像进行特征提取,得到第一特征图;以及,使用第二神经网络对第一特征图进行特征提取,得到第二特征图,并将第二特征图与第一特征图进行点乘,得到第三特征图;其中,第三特征图表示将待识别图像的特征变换到主方向后得到的特征图。第二获取单元1603,用于基于第三特征图获得待识别图像的第一得分图。识别单元1604,用于基于第三特征图和第一得分图,对待识别图像进行识别。
作为示例,第一神经网络可以是上文中的第一神经网络411,第二神经网络可以是上文中的第二神经网络412。结合图14,第一获取单元1601可以执行S301,特征提取单元1602可以执行S302和S303,第二获取单元1603可以执行S304,识别单元1604可以执行S305。
可选的,特征提取单元1602具体用于:使用第二神经网络对第一特征图执行至少一层卷积操作,得到第二特征图。
可选的,特征提取单元1602具体用于:使用第二神经网络对第一特征图执行至少一层卷积操作;对执行卷积操作后的第一特征图执行至少一层池化操作和/或全连接操 作,得到第二特征图。
可选的,第三特征图的尺寸是M1*N1*P1,第一得分图的尺寸是M1*N1,P1是特征方向维度的尺寸,M1*N1是垂直于特征方向维度的尺寸,M1、N1和P1均是正整数。
可选的,第三特征图的尺寸是M2*N2*P2,第一得分图的尺寸是M1*N1,P2是特征方向维度的尺寸,M1、N1、P1、M2、N2和P2均是正整数;识别单元1604具体用于:对第三特征图进行特征提取,得到第四特征图;其中,第四特征图的尺寸是M1*N1*P1;P1是特征方向维度的尺寸,P1是正整数;基于第四特征图和第一得分图,对待识别图像进行识别;其中,第一得分图的尺寸是M1*N1。
可选的,M1*N1<M2*N2。
可选的,第二获取单元1603具体用于:使用1通道卷积核,对第三特征图执行卷积操作,得到X个第五特征图;其中,第五特征图的特征方向的尺寸小于第三特征图的特征方向的尺寸;X是大于2的整数;将X个第五特征图的元素加权求和,得到第六特征图;对第六特征图进行特征提取,得到第一得分图。
可选的,第二获取单元1603具体用于:对第三特征图进行特征提取,得到第七特征图;其中,第三特征图的垂直于特征方向的维度尺寸大于第七特征图的垂直于特征方向的维度尺寸;X是大于2的整数;使用1通道卷积核,对第七特征图执行卷积操作,得到X个第五特征图;将X个第五特征图的元素加权求和,得到第六特征图;对第六特征图进行特征提取,得到第一得分图。
可选的,待识别图像的尺寸大于第一得分图的尺寸。
关于上述可选方式的具体描述可以参见前述的方法实施例,此处不再赘述。此外,上述提供的任一种图像识别装置160的解释以及有益效果的描述均可参考上述对应的方法实施例,不再赘述。
作为示例,结合图1,图像识别装置160中的第一获取单元1601、特征提取单元1602、第二获取单元1603和识别单元1604实现的功能可以通过图1中的处理器101执行图1中的存储器102中的程序代码实现。
本申请实施例还提供一种芯片系统,如图17所示,该芯片系统包括至少一个处理器111和至少一个接口电路112。作为示例,当该芯片系统110包括一个处理器和一个接口电路时,则该一个处理器可以是图11中实线框所示的处理器111(或者是虚线框所示的处理器111),该一个接口电路可以是图11中实线框所示的接口电路112(或者是虚线框所示的接口电路112)。当该芯片系统110包括两个处理器和两个接口电路时,则该两个处理器包括图11中实线框所示的处理器111和虚线框所示的处理器111,该两个接口电路包括图11中实线框所示的接口电路112和虚线框所示的接口电路112。对此不作限定。
处理器111和接口电路112可通过线路互联。例如,接口电路112可用于接收信号(例如从车速传感器或边缘服务单元接收信号)。又例如,接口电路112可用于向其它装置(例如处理器111)发送信号。示例性的,接口电路112可读取存储器中存储的指令,并将该指令发送给处理器111。当所述指令被处理器111执行时,可使得图像识别装置执行上述实施例中的各个步骤。当然,该芯片系统还可以包含其他分立器件,本申请实施例对此不作具体限定。
本申请另一实施例还提供一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当指令在图像识别装置上运行时,该图像识别装置执行上述方法实施例所示的方法流程中该图像识别装置执行的各个步骤。
在一些实施例中,所公开的方法可以实施为以机器可读格式被编码在计算机可读存储介质上的或者被编码在其它非瞬时性介质或者制品上的计算机程序指令。
图18示意性地示出本申请实施例提供的计算机程序产品的概念性局部视图,所述计算机程序产品包括用于在计算设备上执行计算机进程的计算机程序。
在一个实施例中,计算机程序产品是使用信号承载介质120来提供的。所述信号承载介质120可以包括一个或多个程序指令,其当被一个或多个处理器运行时可以提供以上针对图14描述的功能或者部分功能。因此,例如,参考图14中S401~S405的一个或多个特征可以由与信号承载介质120相关联的一个或多个指令来承担。此外,图18中的程序指令也描述示例指令。
在一些示例中,信号承载介质120可以包含计算机可读介质121,诸如但不限于,硬盘驱动器、紧密盘(CD)、数字视频光盘(DVD)、数字磁带、存储器、只读存储记忆体(read-only memory,ROM)或随机存储记忆体(random access memory,RAM)等等。
在一些实施方式中,信号承载介质120可以包含计算机可记录介质122,诸如但不限于,存储器、读/写(R/W)CD、R/W DVD、等等。
在一些实施方式中,信号承载介质120可以包含通信介质123,诸如但不限于,数字和/或模拟通信介质(例如,光纤电缆、波导、有线通信链路、无线通信链路、等等)。
信号承载介质120可以由无线形式的通信介质123(例如,遵守IEEE 802.11标准或者其它传输协议的无线通信介质)来传达。一个或多个程序指令可以是,例如,计算机可执行指令或者逻辑实施指令。
在一些示例中,诸如针对图14描述的图像识别装置可以被配置为,响应于通过计算机可读介质121、计算机可记录介质122、和/或通信介质123中的一个或多个程序指令,提供各种操作、功能、或者动作。
应该理解,这里描述的布置仅仅是用于示例的目的。因而,本领域技术人员将理解,其它布置和其它元素(例如,机器、接口、功能、顺序、和功能组等等)能够被取而代之地使用,并且一些元素可以根据所期望的结果而一并省略。另外,所描述的元素中的许多是可以被实现为离散的或者分布式的组件的、或者以任何适当的组合和位置来结合其它组件实施的功能实体。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件程序实现时,可以全部或部分地以计算机程序产品的形式来实现。该计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行计算机执行指令时,全部或部分地产生按照本申请实施例的流程或功能。计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,计算机指令可以从一个网站站点、计算机、服务器或者数据中心通过有线(例如 同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可以用介质集成的服务器、数据中心等数据存储设备。可用介质可以是磁性介质(例如,软盘、硬盘、磁带),光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。

Claims (20)

  1. 一种图像识别方法,其特征在于,包括:
    获取待识别图像;
    使用第一神经网络对所述待识别图像进行特征提取,得到第一特征图;
    使用第二神经网络对所述第一特征图进行特征提取,得到第二特征图,并将所述第二特征图与所述第一特征图进行点乘,得到第三特征图;其中,所述第三特征图表示将所述待识别图像的特征变换到主方向后得到的特征图;
    基于所述第三特征图获得所述待识别图像的第一得分图;
    基于所述第三特征图和所述第一得分图,对所述待识别图像进行识别。
  2. 根据权利要求1所述的方法,其特征在于,所述使用第二神经网络对所述第一特征图进行特征提取,得到第二特征图,包括:
    使用所述第二神经网络对所述第一特征图执行至少一层卷积操作,得到所述第二特征图。
  3. 根据权利要求1所述的方法,其特征在于,所述使用第二神经网络对所述第一特征图进行特征提取,得到第二特征图,包括:
    使用所述第二神经网络对所述第一特征图执行至少一层卷积操作;
    对执行卷积操作后的所述第一特征图执行至少一层池化操作和/或全连接操作,得到所述第二特征图。
  4. 根据权利要求1至3任一项所述的方法,其特征在于,
    所述第三特征图的尺寸是M1*N1*P1,所述第一得分图的尺寸是M1*N1,P1是特征方向维度的尺寸,所述M1*N1是垂直于特征方向维度的尺寸,M1、N1和P1均是正整数。
  5. 根据权利要求1至3任一项所述的方法,其特征在于,所述第三特征图的尺寸是M2*N2*P2,所述第一得分图的尺寸是M1*N1,P2是特征方向维度的尺寸,M1、N1、P1、M2、N2和P2均是正整数;
    所述基于所述第三特征图和所述第一得分图,对所述待识别图像进行识别,包括:
    对所述第三特征图进行特征提取,得到第四特征图;其中,所述第四特征图的尺寸是M1*N1*P1;P1是特征方向维度的尺寸,P1是正整数;
    基于所述第四特征图和所述第一得分图,对所述待识别图像进行识别;其中,所述第一得分图的尺寸是M1*N1。
  6. 根据权利要求5所述的方法,其特征在于,M1*N1<M2*N2。
  7. 根据权利要求1至6任一项所述的方法,其特征在于,所述基于所述第三特征图获得所述待识别图像的第一得分图,包括:
    使用1通道卷积核,对所述第三特征图执行卷积操作,得到X个第五特征图;其中,所述第五特征图的特征方向的尺寸小于所述第三特征图的特征方向的尺寸;X是大于2的整数;
    将所述X个第五特征图的元素加权求和,得到第六特征图;
    对所述第六特征图进行特征提取,得到所述第一得分图。
  8. 根据权利要求1至6任一项所述的方法,其特征在于,所述基于所述第三特征 图获得所述待识别图像的第一得分图,包括:
    对所述第三特征图进行特征提取,得到第七特征图;其中,所述第三特征图的垂直于特征方向的维度尺寸大于所述第七特征图的垂直于特征方向的维度尺寸;X是大于2的整数;
    使用1通道卷积核,对所述第七特征图执行卷积操作,得到X个第五特征图;
    将所述X个第五特征图的元素加权求和,得到第六特征图;
    对所述第六特征图进行特征提取,得到所述第一得分图。
  9. 根据权利要求1至8任一项所述的方法,其特征在于,所述待识别图像的尺寸大于所述第一得分图的尺寸。
  10. 一种图像识别装置,其特征在于,包括:
    第一获取单元,用于获取待识别图像;
    特征提取单元,用于使用第一神经网络对所述待识别图像进行特征提取,得到第一特征图;以及,使用第二神经网络对所述第一特征图进行特征提取,得到第二特征图,并将所述第二特征图与所述第一特征图进行点乘,得到第三特征图;其中,所述第三特征图表示将所述待识别图像的特征变换到主方向后得到的特征图;
    第二获取单元,用于基于所述第三特征图获得所述待识别图像的第一得分图;
    识别单元,用于基于所述第三特征图和所述第一得分图,对所述待识别图像进行识别。
  11. 根据权利要求10所述的装置,其特征在于,
    所述特征提取单元具体用于:使用所述第二神经网络对所述第一特征图执行至少一层卷积操作,得到所述第二特征图。
  12. 根据权利要求10所述的装置,其特征在于,所述特征提取单元具体用于:
    使用所述第二神经网络对所述第一特征图执行至少一层卷积操作;
    对执行卷积操作后的所述第一特征图执行至少一层池化操作和/或全连接操作,得到所述第二特征图。
  13. 根据权利要求10至12任一项所述的装置,其特征在于,
    所述第三特征图的尺寸是M1*N1*P1,所述第一得分图的尺寸是M1*N1,P1是特征方向维度的尺寸,所述M1*N1是垂直于特征方向维度的尺寸,M1、N1和P1均是正整数。
  14. 根据权利要求10至12任一项所述的装置,其特征在于,所述第三特征图的尺寸是M2*N2*P2,所述第一得分图的尺寸是M1*N1,P2是特征方向维度的尺寸,M1、N1、P1、M2、N2和P2均是正整数;所述识别单元具体用于:
    对所述第三特征图进行特征提取,得到第四特征图;其中,所述第四特征图的尺寸是M1*N1*P1;P1是特征方向维度的尺寸,P1是正整数;
    基于所述第四特征图和所述第一得分图,对所述待识别图像进行识别;其中,所述第一得分图的尺寸是M1*N1。
  15. 根据权利要求14所述的装置,其特征在于,M1*N1<M2*N2。
  16. 根据权利要求10至15任一项所述的装置,其特征在于,所述第二获取单元具体用于:
    使用1通道卷积核,对所述第三特征图执行卷积操作,得到X个第五特征图;其中,所述第五特征图的特征方向的尺寸小于所述第三特征图的特征方向的尺寸;X是大于2的整数;
    将所述X个第五特征图的元素加权求和,得到第六特征图;
    对所述第六特征图进行特征提取,得到所述第一得分图。
  17. 根据权利要求10至15任一项所述的装置,其特征在于,所述第二获取单元具体用于:
    对所述第三特征图进行特征提取,得到第七特征图;其中,所述第三特征图的垂直于特征方向的维度尺寸大于所述第七特征图的垂直于特征方向的维度尺寸;X是大于2的整数;
    使用1通道卷积核,对所述第七特征图执行卷积操作,得到X个第五特征图;
    将所述X个第五特征图的元素加权求和,得到第六特征图;
    对所述第六特征图进行特征提取,得到所述第一得分图。
  18. 根据权利要求10至17任一项所述的装置,其特征在于,所述待识别图像的尺寸大于所述第一得分图的尺寸。
  19. 一种图像识别装置,其特征在于,包括:存储器和处理器,所述存储器用于存储计算机程序,所述处理器用于调用所述计算机程序,以执行权利要求1-9任一项所述的方法。
  20. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机程序,当所述计算机程序在计算机上运行时,使得所述计算机执行权利要求1-9任一项所述的方法。
PCT/CN2021/109680 2020-07-31 2021-07-30 图像识别方法和装置 WO2022022695A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010761239.6A CN112084849A (zh) 2020-07-31 2020-07-31 图像识别方法和装置
CN202010761239.6 2020-07-31

Publications (1)

Publication Number Publication Date
WO2022022695A1 true WO2022022695A1 (zh) 2022-02-03

Family

ID=73735207

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/109680 WO2022022695A1 (zh) 2020-07-31 2021-07-30 图像识别方法和装置

Country Status (2)

Country Link
CN (1) CN112084849A (zh)
WO (1) WO2022022695A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114842220A (zh) * 2022-03-24 2022-08-02 西北工业大学 基于多源影像匹配的无人机视觉定位方法
CN117197487A (zh) * 2023-09-05 2023-12-08 东莞常安医院有限公司 一种免疫胶体金诊断试纸条自动识别系统
CN117765779A (zh) * 2024-02-20 2024-03-26 厦门三读教育科技有限公司 基于孪生神经网络的儿童绘本智能化导读方法及系统

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084849A (zh) * 2020-07-31 2020-12-15 华为技术有限公司 图像识别方法和装置
CN113065585B (zh) * 2021-03-23 2021-12-28 北京亮亮视野科技有限公司 图像合成模型的训练方法、装置与电子设备
CN114185431B (zh) * 2021-11-24 2024-04-02 安徽新华传媒股份有限公司 一种基于mr技术的智能媒体交互方法
CN114155492A (zh) * 2021-12-09 2022-03-08 华电宁夏灵武发电有限公司 高空作业安全带挂绳高挂低用识别方法、装置和电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105678311A (zh) * 2016-01-12 2016-06-15 北京环境特性研究所 一种用于模板识别的空间目标isar图像处理方法
CN108229497A (zh) * 2017-07-28 2018-06-29 北京市商汤科技开发有限公司 图像处理方法、装置、存储介质、计算机程序和电子设备
US20190065817A1 (en) * 2017-08-29 2019-02-28 Konica Minolta Laboratory U.S.A., Inc. Method and system for detection and classification of cells using convolutional neural networks
CN109978077A (zh) * 2019-04-08 2019-07-05 南京旷云科技有限公司 视觉识别方法、装置和系统及存储介质
CN112084849A (zh) * 2020-07-31 2020-12-15 华为技术有限公司 图像识别方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105678311A (zh) * 2016-01-12 2016-06-15 北京环境特性研究所 一种用于模板识别的空间目标isar图像处理方法
CN108229497A (zh) * 2017-07-28 2018-06-29 北京市商汤科技开发有限公司 图像处理方法、装置、存储介质、计算机程序和电子设备
US20190065817A1 (en) * 2017-08-29 2019-02-28 Konica Minolta Laboratory U.S.A., Inc. Method and system for detection and classification of cells using convolutional neural networks
CN109978077A (zh) * 2019-04-08 2019-07-05 南京旷云科技有限公司 视觉识别方法、装置和系统及存储介质
CN112084849A (zh) * 2020-07-31 2020-12-15 华为技术有限公司 图像识别方法和装置

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114842220A (zh) * 2022-03-24 2022-08-02 西北工业大学 基于多源影像匹配的无人机视觉定位方法
CN114842220B (zh) * 2022-03-24 2024-02-27 西北工业大学 基于多源影像匹配的无人机视觉定位方法
CN117197487A (zh) * 2023-09-05 2023-12-08 东莞常安医院有限公司 一种免疫胶体金诊断试纸条自动识别系统
CN117197487B (zh) * 2023-09-05 2024-04-12 东莞常安医院有限公司 一种免疫胶体金诊断试纸条自动识别系统
CN117765779A (zh) * 2024-02-20 2024-03-26 厦门三读教育科技有限公司 基于孪生神经网络的儿童绘本智能化导读方法及系统
CN117765779B (zh) * 2024-02-20 2024-04-30 厦门三读教育科技有限公司 基于孪生神经网络的儿童绘本智能化导读方法及系统

Also Published As

Publication number Publication date
CN112084849A (zh) 2020-12-15

Similar Documents

Publication Publication Date Title
WO2022022695A1 (zh) 图像识别方法和装置
WO2019100724A1 (zh) 训练多标签分类模型的方法和装置
US9984280B2 (en) Object recognition system using left and right images and method
JP2018506788A (ja) 物体の再同定の方法
CN109446889B (zh) 基于孪生匹配网络的物体追踪方法及装置
CN110909618B (zh) 一种宠物身份的识别方法及装置
AU2020104423A4 (en) Multi-View Three-Dimensional Model Retrieval Method Based on Non-Local Graph Convolutional Network
CN110059728B (zh) 基于注意力模型的rgb-d图像视觉显著性检测方法
WO2020098257A1 (zh) 一种图像分类方法、装置及计算机可读存储介质
WO2023124278A1 (zh) 图像处理模型的训练方法、图像分类方法及装置
CN111783748A (zh) 人脸识别方法、装置、电子设备及存储介质
CN110852311A (zh) 一种三维人手关键点定位方法及装置
CN108875505B (zh) 基于神经网络的行人再识别方法和装置
CN111291768A (zh) 图像特征匹配方法及装置、设备、存储介质
JP6107531B2 (ja) 特徴抽出プログラム及び情報処理装置
CN114419349B (zh) 一种图像匹配方法和装置
CN112084952B (zh) 一种基于自监督训练的视频点位跟踪方法
CN114140623A (zh) 一种图像特征点提取方法及系统
JP7336653B2 (ja) ディープラーニングを利用した屋内位置測位方法
WO2021179822A1 (zh) 人体特征点的检测方法、装置、电子设备以及存储介质
KR101715782B1 (ko) 물체 인식 시스템 및 그 물체 인식 방법
WO2021237973A1 (zh) 图像定位模型获取方法及装置、终端和存储介质
CN114550022A (zh) 模型训练方法及装置、电子设备和可读存储介质
CN110738225B (zh) 图像识别方法及装置
CN111275183A (zh) 视觉任务的处理方法、装置和电子系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21849545

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21849545

Country of ref document: EP

Kind code of ref document: A1