CN113011309A

CN113011309A - Image recognition method, apparatus, device, medium, and program product

Info

Publication number: CN113011309A
Application number: CN202110277135.2A
Authority: CN
Inventors: 王学占
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-03-15
Filing date: 2021-03-15
Publication date: 2021-06-22

Abstract

The application discloses an image recognition method, an image recognition device, an image recognition equipment, an image recognition medium and a program product, and relates to the artificial intelligence fields of natural language processing, computer vision, deep learning and the like. One embodiment of the method comprises: acquiring a target image; inputting a target image into a backbone network layer of a pre-trained target detection model to obtain image characteristics of the target image; inputting the image characteristics into a full-connection network layer of a target detection model to obtain an image category corresponding to a target image; and inputting the image characteristics into a convolution network layer of the target detection model to obtain the position information of the object on the target image.

Description

Image recognition method, apparatus, device, medium, and program product

Technical Field

The embodiment of the application relates to the field of computers, in particular to the field of artificial intelligence such as natural language processing, computer vision, deep learning and the like, and particularly relates to an image recognition method, device, equipment, medium and program product.

Background

In recent years, with the rapid development of Artificial Intelligence (AI) technology, the Intelligence of people's life is greatly improved. In general, AI addresses two types of problems: classification and positional regression: both the classification of the target image and the regression of the position coordinates of the object in the target image are performed.

Currently, classification of target images and positional regression of target images are achieved by one convolutional layer.

Disclosure of Invention

The embodiment of the application provides an image identification method, an image identification device, image identification equipment, an image identification medium and a program product.

In a first aspect, an embodiment of the present application provides an image recognition method, including: acquiring a target image; inputting a target image into a backbone network layer of a pre-trained target detection model to obtain image characteristics of the target image; inputting the image characteristics into a full-connection network layer of a target detection model to obtain an image category corresponding to a target image; and inputting the image characteristics into a convolution network layer of the target detection model to obtain the position information of the object on the target image.

In a second aspect, an embodiment of the present application provides an image recognition apparatus, including: an image acquisition module configured to acquire a target image; the first obtaining module is configured to input a target image into a backbone network layer of a pre-trained target detection model to obtain image characteristics of the target image; the second obtaining module is configured to input the image characteristics into a full-connection network layer of the target detection model to obtain an image category corresponding to the target image; and a third obtaining module configured to input the image features into a convolutional network layer of the target detection model to obtain position information of the object in the target image thereon.

In a third aspect, an embodiment of the present application provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in the first aspect.

In a fourth aspect, embodiments of the present application propose a non-transitory computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the method as described in the first aspect.

In a fifth aspect, embodiments of the present application propose a computer program product comprising a computer program that, when executed by a processor, implements the method as described in the first aspect.

According to the image recognition method, the image recognition device, the image recognition equipment, the image recognition medium and the program product, firstly, a target image is obtained; inputting the target image into a backbone network layer of a pre-trained target detection model to obtain the image characteristics of the target image; then inputting the image characteristics into a full-connection network layer of a target detection model to obtain the image category corresponding to the target image; and finally, inputting the image characteristics into a convolution network layer of the target detection model to obtain the position information of the object in the target image. The image classification of the target image can be obtained by utilizing the fully-connected network layer in the target detection model, and the position information of the object on the target image can be obtained by utilizing the convolution network layer in the target detection model, so that the classification and the position regression can be split into two independent individuals, and the image identification precision can be obviously improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings. The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of an image recognition method according to the present application;

FIG. 3 is a schematic diagram of the YOLO model according to the present application;

FIG. 4 is a flow diagram of another embodiment of an image recognition method according to the present application;

FIG. 5 is a diagram of an application scenario of an image recognition method according to the present application;

FIG. 6 is a flow diagram of one embodiment of a training target detection model of the present application;

FIG. 7 is a schematic block diagram of one embodiment of an image recognition device according to the present application;

fig. 8 is a block diagram of an electronic device for implementing the image recognition method according to the embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary system architecture 100 to which embodiments of the image recognition method or image recognition apparatus of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

101, 102, 103 to interact with the server 105 over the network 104 to receive or transmit video frames or the like. The

terminal devices

101, 102, 103 may have various client applications installed thereon, such as news-like applications, web browser applications, search-like applications, image processing-like applications, and so forth.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the above-described electronic apparatuses. It may be implemented as multiple pieces of software or software modules, or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may provide various services. For example, the server 105 may analyze and process videos displayed on the

terminal apparatuses

101, 102, 103 and generate a processing result (e.g., a video inserted into a bullet screen at an appropriate timing).

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the image recognition method provided in the embodiment of the present application may be executed by the

terminal devices

101, 102, and 103, or may be executed by the server 105.

It should be further noted that the target detection models after training may be stored locally in the

terminal devices

101, 102, 103. The exemplary system architecture 100 may not have a network 104 and server 105 at this point.

It should be noted that the server 105 may also store the target image locally, the server 105 may also acquire the target image from the

terminal devices

101, 102, and 103, and the server 105 may acquire the target image locally. The exemplary system architecture 100 may not have

terminal devices

101, 102, 103 and network 104 present at this time.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of an image recognition method according to the present application is shown. The image recognition method comprises the following steps:

step 201, acquiring a target image.

In the present embodiment, the execution subject of the image recognition method (e.g., the server 105 shown in fig. 1) may obtain the target image from a local or terminal device (e.g., the

terminal devices

101, 102, 103 shown in fig. 1); alternatively, the executing subject of the image recognition method (e.g.,

clients

101, 102, 103 shown in fig. 1) acquires the target image from locally or by a photographing device, such as a camera. The target image can comprise characters, animation, pictures and the like; the target image may be one or several frames of images in the video, and the target image may be a single picture.

Step 202, inputting the target image into a backbone network layer of a pre-trained target detection model to obtain the image characteristics of the target image.

In this embodiment, the executing entity may input the target image into a backbone network layer of a pre-trained target detection image to obtain an image feature of the target image. The target detection model may include a Deep Learning Network (DLN), such as a Convolutional Neural Network (CNN). Here, the object detection model may generally include a backbone network layer, a fully-connected network layer, and a convolutional network layer. The backbone network layer can be used for extracting image features from the target image; for example, vgg (visual Geometry group), resnet (residual Neural Network), Feature Pyramid Network (FPN), and the like.

Here, the image characteristics of the target image may include: an image category feature and a location information feature. The image category features may include all features for determining an image category, and the image category features may be used for determining an image category of the target image. The position information feature may include all features for including position information of the determination object on the target image, and the position information feature may be used to determine position information of the object on the target image.

It should be noted that the image features including the image category feature and the position information are an example, and the image features are not limited to the image category feature and the position information feature in this application, and are not described herein again.

In a specific example, the backbone network layer may be an FPN layer, and the FPN layer may be used to perform feature extraction on the target image to obtain a category feature to which an object belongs and a position information feature of the object in the image feature of the target image.

Step 203, inputting the image characteristics into the full-connection network layer of the target detection model to obtain the image category corresponding to the target image.

In this embodiment, the executing entity may input the image feature of the target image into the fully-connected network layer of the target detection network, so as to obtain the image category corresponding to the target image. The fully connected network layer described above may be used to determine the image class of the target image.

Here, the image category may be a category to which the object in the determination target image belongs; for example, if the category to which the object belongs is a person, the category of the target image is a person; for example, if the class to which the object belongs is girl, then the class of the target image is girl; for example, the class to which the object belongs is the background (e.g., mountain, water, cloud, fog, etc.), then the class of the target image is the background.

In a specific example, the executing entity may input an image category feature of the image features of the target image into a fully-connected network layer of the target detection network, so as to obtain an image category corresponding to the target image.

In one specific example, the image category may be a confidence level, and when the confidence level satisfies a preset confidence level threshold, the specific category of the target image is determined. The preset confidence threshold may be set by the category identification accuracy or manually.

Step 204, inputting the image characteristics into the convolution network layer of the target detection model to obtain the position information of the object on the target image.

In this embodiment, the executing subject may input the image feature into the convolutional network layer of the target detection model, and obtain the position information of the object in the target image thereon. The above-described convolutional network layer may be used to determine the position information of the object in the target image.

Here, the position information may be an arbitrary position on the target image; for example, in the middle. The objects may include creatures, text, emoticons, icons, pictures (e.g., background, etc.), and the like.

In a specific example, the execution subject may input the position information feature in the image feature of the target image into a convolutional network layer of the target detection network, and obtain the position information of the object in the target image thereon.

In one specific example, the position information of the object on the target image may be a geometric center of an anchor frame where the object is located; for example, the coordinates of the midpoint of the anchor frame (regular anchor frame) on the target image are position information of the object on the target image.

It should be noted that the execution sequence of step 203 and step 204 may be: step 204 is executed first, and then step 203 is executed; or, step 203 is executed first, and then step 204 is executed; alternatively, step 203 and step 204 are performed simultaneously.

The image identification method provided by the embodiment of the application comprises the steps of firstly obtaining a target image; inputting the target image into a backbone network layer of a pre-trained target detection model to obtain the image characteristics of the target image; then inputting the image characteristics into a full-connection network layer of a target detection model to obtain the image category corresponding to the target image; and finally, inputting the image characteristics into a convolution network layer of the target detection model to obtain the position information of the object in the target image. The image classification of the target image can be obtained by utilizing the fully-connected network layer in the target detection model, and the position information of the object on the target image can be obtained by utilizing the convolution network layer in the target detection model, so that the classification and the position regression can be split into two independent individuals, and the image identification precision can be obviously improved.

In the image recognition method provided in the embodiment of the present application, the fully connected network layer in step 203 may include: the system comprises at least one full-connection layer, wherein the backbone network layer is connected with one end of a first full-connection layer in the at least one full-connection layer, the at least one full-connection layer is sequentially connected, and one end of a last full-connection layer in the at least one full-connection layer is connected with the convolutional neural network layer.

It should be noted that the number of fully connected layers may be determined according to the image recognition accuracy and/or the device sensitivity.

In this implementation, the classification of the target image may be achieved through at least one fully connected layer.

In some optional implementations of this embodiment, the convolutional network layer in step 204 may include: and the backbone network layer is connected with one end of a first convolutional layer in the at least one convolutional layer, the at least one convolutional layer is sequentially connected, and one end of a last convolutional layer in the at least one convolutional layer is connected with the full-connection network layer.

Here, the target recognition model may include: a backbone network layer, at least one full connection layer and at least one convolutional layer; or, a backbone network layer, a fully connected network layer and at least one convolutional layer; or, a backbone network layer, at least one full connectivity layer, and a convolutional network layer.

In one specific example, the convolutional network layer may be: FCN (full Convolutional Networks) layers, which may include Convolutional layers and pooling layers.

It should be noted that the convolutional layers may be linear convolutional layers, and the number of convolutional layers may be determined according to the image recognition accuracy and/or the device sensitivity.

In this implementation, positional regression of the object in the target image may be achieved through at least one convolutional layer.

In some optional implementation manners of this embodiment, the target detection model may be a yolo (young Only Look once) model.

In the present implementation, the target detection model may be a YOLO v1 model, a YOLO v2 model, a YOLO v3 model, a YOLO v4 model, or the like.

In one specific example, in fig. 3, the YOLO model may include an input layer 31, a backbone network layer 32, a full-connectivity layer 33, and a convolutional layer 34; wherein the input layer 31 may be used to input a target image; the backbone (backbone) network layer 32 may be configured to extract features of the target image to obtain image features of the target image; the full connection layer 33 may be used to obtain an image category corresponding to the target image; the convolutional layer 34 may be used to obtain position information of objects on the target image. Wherein the number of the full connection layers 33 may be two.

In this implementation, the input parameters of the fully connected layer in the YOLO model are generally fixed; in the process of identifying the image, because the size of the image is not fixed, in order to ensure the accuracy of identifying the image by the YOLO model, the following processing is required to make the size information of the image consistent with the input parameters of the full connection layer:

the size information of the image can be preprocessed in advance, so that the size information of the preprocessed image is consistent with the input parameters of the full connection layer; or adjusting the input parameters of the full connection layer to make the adjusted parameters consistent with the size information of the image.

In this implementation, the identification of the image class of the target image and the location information of the object thereon may be achieved by a YOLO model.

With further reference to fig. 4, fig. 4 illustrates a flow 400 of one embodiment of an image recognition method according to the present application. The image recognition method comprises the following steps:

step 401, a target image is acquired.

terminal devices

clients

Step 402, inputting the target image into a backbone network layer of a pre-trained target detection model to obtain the image characteristics of the target image.

And 403, inputting the image characteristics into a full connection layer of the target detection model to obtain the image category corresponding to the target image.

In this embodiment, the executing entity may input the target image into the full connection layer of the target detection network, so as to obtain the image category corresponding to the target image. The above-described fully connected layer may be used to determine the image class of the target image.

Step 404, inputting the image features into the convolution layer of the target detection model to obtain the position information of the object on the convolution layer in the target image.

In this embodiment, the executing body may input image features into the convolutional layer of the target detection model, and obtain position information of the object on the convolutional layer in the target image. The convolutional layer may be used to determine positional information of an object in a target image.

In the present embodiment, the specific operations of steps 401-402 have been described in detail in step 201-202 in the embodiment shown in fig. 2, and are not described herein again.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the image recognition method in the present embodiment highlights the steps of classifying the target image and determining the position information of the object on the target image. Therefore, the scheme described in this embodiment can input the target image into the backbone network layer of the pre-trained target detection model to obtain the image characteristics of the target image; then inputting the image characteristics into a full connection layer of the target detection model to obtain the image category corresponding to the target image; and finally, inputting the image characteristics into the convolution layer of the target detection model to obtain the position information of the object on the convolution layer in the target image. The method can obtain the category of the target image by using the full-connection layer in the target detection model, and obtain the position information of the object on the target image by using the convolution layer in the target detection model, so that the classification and the position regression can be split into two independent individuals, and the accuracy of image identification can be obviously improved.

For ease of understanding, the following provides an application scenario in which the image recognition method of the embodiment of the present application may be implemented. Take as an example that a server (e.g., server 105 shown in fig. 1) acquires a target image from a terminal device (e.g.,

terminal devices

101, 102, 103 shown in fig. 1). In fig. 5, a server 502 receives a target image 503 transmitted by a terminal device 501; then, the target image 503 can be input into the backbone network layer 504 of the pre-trained target detection model to obtain the image feature 505 of the target image; then, the image features 505 of the target image may be input into the fully-connected network layer 506 of the target detection model to obtain the image category 507 of the target image; then, the image features 505 of the target image are input into the convolutional network layer 508 of the target detection model to obtain the position information 509 of the object on the target image, wherein the backbone network layer 504 is connected with the fully-connected network layer 506 and the convolutional network layer 508 respectively.

With further reference to FIG. 6, FIG. 6 is a flow 600 of one embodiment of training a target detection model in an image recognition method according to the present application. As shown in fig. 6, in this embodiment, the training step of training the target detection model includes:

step 601, a training sample set is obtained, wherein training samples in the training sample set include sample images and corresponding sample information labels.

In the present embodiment, the execution subject of the training step may be the same as or different from the execution subject of the image recognition method. If the two parameters are the same, the executing agent of the training step may store the model structure information of the trained target detection model and the parameter values of the model parameters locally after the target detection model is obtained through training. If the difference is not the same, the execution subject of the training step may send the model structure information of the trained target detection model and the parameter values of the model parameters to the execution subject of the image recognition method after the target detection model is obtained through training.

In this embodiment, the executing subject of the training step may acquire the training sample set in various ways. For example, the training sample set stored therein may be obtained from a database server through a wired connection or a wireless connection. As another example, the training sample set may be collected by a terminal device (e.g.,

terminal devices

101, 102, 103 shown in fig. 1). The training samples in the training sample set may include sample images and corresponding sample information labels. The sample image may be a single picture or one or several frames of images in a video. The sample information tag can be used for labeling the sample image, for example, the sample information tag can include a sample position information tag and/or a sample image category tag, wherein the sample position information tag can be used for labeling the position information of the object in the sample image, and the sample image category tag can be used for labeling the image category of the sample image.

Step 602, taking the sample image as the input of the target detection model, taking the sample information label as the output of the target detection model, training the initial model, and obtaining the target detection model.

In this embodiment, after obtaining the sample image and the sample information label, the execution subject may train the initial model by using the sample image and the sample information label to obtain the target detection model. During training, the executive body may use the sample image as an input of the target detection model, and use the corresponding input sample information label as an expected output, so as to obtain the target detection model. The initial model may be a probability model, a classification model, or other classifier in the prior art or future development technology, for example, the initial model may include any one of the following: extreme Gradient Boosting Tree model (XGBoost), logistic regression model (LR), deep neural network model (DNN), Gradient Boosting Decision Tree model (GBDT).

The method provided by the embodiment of the application is used for training based on the sample image and the sample information label to obtain the target detection model, so that the accurate identification of the image type of the sample image and/or the position information of the object on the sample image is realized.

In some optional implementation manners of this embodiment, training an initial model by using a sample image as an input of a target detection model and using a sample information tag as an output of the target detection model to obtain the target detection model includes: aiming at training samples in the training sample set, and executing the following training steps: inputting a sample image of a training sample into a backbone network layer of an initial model to obtain image characteristics of the sample image; inputting image characteristics of the sample image into a full-connection network layer of a target detection model to obtain a sample image category; inputting image characteristics of a sample image into a convolution network layer of a target detection model to obtain sample position information; generating image information of the sample image based on the sample image category and the sample position information; determining a total loss function value based on image information of the sample image and the sample information label; in response to the total loss function value satisfying a target value, taking the initial model as a target detection model; in response to the total loss function value not meeting the target value, continuing to perform the training step. And repeating the iteration for multiple times until a target detection model is trained.

In this implementation, the executing subject of the training step may input the training samples in the training sample set into the backbone network layer of the initial model. By performing feature extraction on the sample image of the training sample, the image features of the sample image can be obtained. Here, the initial model typically includes a backbone network layer, a fully-connected network layer, and a convolutional network layer. The backbone network layer of the initial model may be used to extract features from the sample image. The fully connected network layer of the initial model may be used to determine an image class of the sample image. The convolutional network layer of the initial model may be used to determine the location information of objects on it in the sample image.

Here, the initial model may be various existing neural network models created based on machine learning techniques. The Neural network model may have various existing Neural network structures (e.g., vgnet (visual Geometry Group network), resnet (residual Neural network), etc.).

In this implementation, the executing agent of the training step may input the image features of the sample image into the fully-connected network layer of the initial model to obtain the sample image category. The fully-connected network layer of the initial model can be used for inputting the image characteristics of the sample image into the convolutional network layer of the initial model to obtain the sample position information.

In this embodiment, the loss function is generally used to measure the degree of inconsistency between the predicted value and the actual value (e.g., the key-value pair tag) of the model. In general, the smaller the loss function value, the better the robustness of the model. The loss function may be set according to actual requirements. For example, the loss function may include a cross-entropy loss function.

In this implementation manner, the main executing body in the training step may compare the total loss function value with a preset target value, and determine whether the initial model is trained according to the comparison result, and if the total loss function value satisfies the preset target value, the main executing body in the training step may determine the initial model as the target detection model. The target value may be generally used to indicate the degree of inconsistency between the predicted value and the true value. That is, when the total loss function value reaches the target value, the predicted value may be considered to be close to or approximate the true value. The target value may be set according to actual demand.

In this implementation, the main body of the training step may continue to perform the training step described above when the total loss function value does not satisfy the target value.

In the implementation mode, the image characteristics of the sample image are obtained by using the backbone network layer of the initial model; then, based on the full-connection network layer and the convolution network layer of the initial model, the sample image category and the sample position information can be obtained; then, based on the sample image category and the sample position information, image information of the sample image may be generated; then, a total loss function value may be determined based on the image information of the sample image and the sample information label; finally, training of the initial model can be achieved based on the total loss function value and the target value to obtain the target detection model, and therefore accurate identification of the image category and the position information included in the image information in the target image is achieved.

The method provided by the above embodiment of the present application determines whether the initial model is trained completely according to the comparison result between the total loss function value and the target value, and when the total loss function value reaches the target value, the predicted value may be considered to be close to or approximate to the true value, and at this time, the initial model may be determined as the target detection model. The robustness of the model generated in this way is high.

In some optional implementations of this embodiment, the sample information tag may include: a sample image category label and/or a sample location information label.

In this implementation, the sample image category label may be used to label the image category of the sample image; the specimen-position-information tag can be used to label position information of an object on it in the specimen image.

In the implementation manner, accurate identification of the image can be realized through the sample image category label and/or the sample position information label.

With further reference to fig. 7, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of an image recognition apparatus, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 7, the image recognition apparatus 700 of the present embodiment may include: an image acquisition module 701, a first obtaining module 702, a second obtaining module 703 and a third obtaining module 704. Wherein, the image obtaining module 701 is configured to obtain a target image; a first obtaining module 702, configured to input a target image into a backbone network layer of a pre-trained target detection model, to obtain an image feature of the target image; a second obtaining module 703, configured to input the image features into a full-connection network layer of the target detection model, so as to obtain an image category corresponding to the target image; a third obtaining module 704 configured to input the image features into a convolutional network layer of the target detection model, and obtain position information of an object in the target image thereon.

In the present embodiment, in image recognition apparatus 700: the specific processing of the image obtaining module 701, the first obtaining module 702, the second obtaining module 703 and the third obtaining module 704 and the technical effects thereof can refer to the related descriptions of step 201 and step 204 in the corresponding embodiment of fig. 2, which are not described herein again. The first obtaining module, the second obtaining module and the third obtaining module may be the same module or different modules.

In some alternative implementations of this embodiment, the fully connected network includes at least one fully connected layer.

In some alternative implementations of this embodiment, the convolutional network includes at least one convolutional network layer.

In some optional implementations of this embodiment, the target detection model is: the YOLO model.

In some optional implementations of this embodiment, the image recognition apparatus further includes: a sample acquisition module (not shown in the figure) configured to acquire a training sample set, wherein training samples in the training sample set include sample images and corresponding sample information labels; and a model training module (not shown in the figure) configured to train the initial model by taking the sample image as an input of the target detection model and taking the sample information label as an output of the target detection model, so as to obtain the target detection model.

In some optional implementations of this embodiment, the sample information tag includes: a sample image category label and/or a sample location information label.

There is also provided, in accordance with an embodiment of the present application, an electronic device, a readable storage medium, and a computer program product.

FIG. 8 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above, such as the image recognition method. For example, in some embodiments, the image recognition method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the image recognition method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the image recognition method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Artificial intelligence is the subject of studying computers to simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural voice processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An image recognition method, comprising:

acquiring a target image;

inputting the target image into a backbone network layer of a pre-trained target detection model to obtain the image characteristics of the target image;

inputting the image characteristics into a full-connection network layer of the target detection model to obtain an image category corresponding to the target image;

and inputting the image characteristics into a convolution network layer of the target detection model to obtain the position information of the object in the target image.

2. The method of claim 1, wherein the fully connected network comprises at least one fully connected layer.

3. The method of claim 1 or 2, wherein the convolutional network comprises at least one convolutional network layer.

4. The method of claim 1, wherein the target detection model is trained based on:

acquiring a training sample set, wherein training samples in the training sample set comprise sample images and corresponding sample information labels;

and taking the sample image as the input of the target detection model, taking the sample information label as the output of the target detection model, and training an initial model to obtain the target detection model.

5. The method of claim 4, wherein the sample information label comprises: a sample image category label and/or a sample location information label.

6. An image recognition apparatus comprising:

an image acquisition module configured to acquire a target image;

a first obtaining module configured to input the target image into a backbone network layer of a pre-trained target detection model to obtain an image feature of the target image;

a second obtaining module configured to input the image features into a fully-connected network layer of the target detection model to obtain an image category corresponding to the target image;

a third obtaining module configured to input the image feature into a convolutional network layer of the target detection model, and obtain position information of an object in the target image thereon.

7. The apparatus of claim 6, wherein the fully connected network comprises at least one fully connected layer.

8. The apparatus of claim 6 or 7, wherein the convolutional network comprises at least one convolutional network layer.

9. The apparatus of claim 6, the apparatus further comprising:

a sample acquisition module configured to acquire a training sample set, wherein training samples in the training sample set include sample images and corresponding sample information labels;

and the model training module is configured to train an initial model by taking the sample image as the input of the target detection model and the sample information label as the output of the target detection model, so as to obtain the target detection model.

10. The apparatus of claim 9, wherein the sample information tag comprises: a sample image category label and/or a sample location information label.

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.

13. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-5.