CN111738280A

CN111738280A - Image identification method, device, equipment and readable storage medium

Info

Publication number: CN111738280A
Application number: CN202010604125.0A
Authority: CN
Inventors: 诸加丹
Original assignee: Tencent Technology Wuhan Co Ltd
Current assignee: Tencent Technology Wuhan Co Ltd
Priority date: 2020-06-29
Filing date: 2020-06-29
Publication date: 2020-10-02

Abstract

The embodiment of the application discloses an image identification method, an image identification device, image identification equipment and a readable storage medium, wherein the method comprises the following steps: acquiring a target image, inputting the target image into an image recognition model, and extracting image characteristics of the target image from the image recognition model; respectively inputting the image features into an image classification component and a position prediction component in an image recognition model; in the image classification component, carrying out convolution processing on the image characteristics through a first convolution parameter to obtain a classification convolution characteristic diagram; in the position prediction component, carrying out convolution processing on the image characteristics through a second convolution parameter to obtain a position convolution characteristic diagram; and outputting the object class of the target object in the target image according to the classification convolution feature map, and outputting the position information of the target object in the target image according to the position convolution feature map. By the adoption of the method and the device, the image recognition efficiency can be improved.

Description

Image identification method, device, equipment and readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an image recognition method, an image recognition device, an image recognition apparatus, and a readable storage medium.

Background

Currently, for the type determination and target selection of an article contained in an image, detection needs to be realized through different identification networks. For the class judgment of the article, the class judgment can be carried out through a classification network; for the object selection of the object, the object selection can be realized through the object detection network.

In the prior art, a method for determining the category of an article and framing a target is implemented through two different recognition networks, image features of an image need to be extracted from the two recognition networks respectively, and then the two recognition networks can perform convolution processing on the image features respectively, so that a classification result of the article and a predicted article boundary can be obtained respectively. That is to say, for the image feature extraction of the same image, two or more computations are required to be performed on two recognition networks for extraction, which brings a large amount of computation and affects the recognition efficiency.

Disclosure of Invention

The embodiment of the application provides an image identification method, an image identification device, image identification equipment and a readable storage medium, and the image identification efficiency can be improved.

An embodiment of the present application provides an image recognition method, including:

acquiring a target image, inputting the target image into an image recognition model, and extracting image characteristics of the target image from the image recognition model;

respectively inputting the image features into an image classification component and a position prediction component in an image recognition model; the image classification component comprises a first convolution parameter concerned with object classification, and the position prediction component comprises a second convolution parameter concerned with position information prediction;

in the image classification component, carrying out convolution processing on the image characteristics through a first convolution parameter to obtain a classification convolution characteristic diagram;

in the position prediction component, carrying out convolution processing on the image characteristics through a second convolution parameter to obtain a position convolution characteristic diagram;

and outputting the object class of the target object in the target image according to the classification convolution feature map, and outputting the position information of the target object in the target image according to the position convolution feature map.

An aspect of an embodiment of the present application provides an image recognition apparatus, including:

the image acquisition module is used for acquiring a target image and inputting the target image to the image recognition model;

the characteristic extraction module is used for extracting the image characteristics of the target image from the image recognition model;

the characteristic input module is used for respectively inputting the image characteristics into an image classification component and a position prediction component in the image recognition model; the image classification component comprises a first convolution parameter concerned with object classification, and the position prediction component comprises a second convolution parameter concerned with position information prediction;

the first feature convolution module is used for performing convolution processing on the image features through the first convolution parameters in the image classification component to obtain a classification convolution feature map;

the second feature convolution module is used for performing convolution processing on the image features through second convolution parameters in the position prediction component to obtain a position convolution feature map;

the class output module is used for outputting the object class of the target object in the target image according to the classified convolution characteristic graph;

and the position output module is used for outputting the position information of the target object in the target image according to the position convolution characteristic diagram.

The target image comprises K grid pixels after image division, the image recognition model comprises a residual error network, and the residual error network comprises a first feature extraction unit, a second feature extraction unit, a third feature extraction unit and a convolution layer; k is an integer greater than 1;

the feature extraction module includes:

the first pixel characteristic output unit is used for inputting the target image to the first characteristic extraction unit, and extracting the pixel characteristic of each grid pixel in the target image in the first characteristic extraction unit to obtain K first pixel characteristics;

the second pixel feature output unit is used for inputting the K first pixel features into the second feature extraction unit, and performing convolution processing on each first pixel feature in the second feature extraction unit to obtain K second pixel features;

the third pixel feature output unit is used for inputting the K second pixel features into the third feature extraction unit, and performing convolution processing on each second pixel feature in the third feature extraction unit to obtain K third pixel features;

the target pixel feature selection unit is used for selecting K target pixel features from the K first pixel features, the K second pixel features and the K third pixel features;

the characteristic convolution unit is used for inputting the K target pixel characteristics into the convolution layer and carrying out convolution processing on the K target pixel characteristics in the convolution layer to obtain convolution pixel characteristics;

and the image characteristic generating unit is used for generating the image characteristic of the target image according to the convolution pixel characteristic.

Wherein, the device still includes:

the proportion obtains the module, is used for obtaining the original image, obtain the scaling of the picture;

the image zooming module is used for zooming the original image according to the image zooming proportion to obtain a transition image;

and the image dividing module is used for carrying out image division on the transition image to obtain a target image comprising K grid pixels.

Wherein the K target pixel features are K third pixel features; the convolution layer includes a first convolution layer;

the feature convolution unit includes:

the first convolution subunit is used for inputting the K third pixel characteristics into the first convolution layer, and performing convolution processing on each third pixel characteristic in the first convolution layer to obtain K initial convolution pixel characteristics;

and the first downsampling subunit is used for acquiring a first downsampling multiple corresponding to the third feature extraction unit, and downsampling the K initial convolution pixel features according to the first downsampling multiple to obtain the convolution pixel features.

Wherein the convolution layer comprises a first convolution layer and a second convolution layer;

the feature convolution unit includes:

the second convolution subunit is used for inputting the K third pixel characteristics into the first convolution layer, and performing convolution processing on each third pixel characteristic in the first convolution layer to obtain K initial convolution pixel characteristics;

the second downsampling subunit is used for acquiring a first downsampling multiple corresponding to the third feature extraction unit, and downsampling the K initial convolution pixel features according to the first downsampling multiple to obtain first downsampled pixel features;

the third downsampling subunit is used for acquiring a second downsampling multiple corresponding to the second feature extraction unit, and downsampling the K second pixel features according to the second downsampling multiple to obtain second downsampled pixel features;

the up-sampling sub-unit is used for determining a first image up-sampling multiple according to the first down-sampling multiple and the second down-sampling multiple, and up-sampling the first down-sampling pixel characteristic according to the first image up-sampling multiple to obtain an up-sampling pixel characteristic;

and the feature splicing subunit is used for splicing the second down-sampling pixel features and the up-sampling pixel features to obtain first splicing features if the K target pixel features are K second pixel features, inputting the first splicing features into a second convolutional layer, and performing convolution processing on the first splicing features in the second convolutional layer to obtain convolutional pixel features.

Wherein the convolutional layer further comprises a third convolutional layer;

the device still includes:

the first feature splicing module is used for splicing the second down-sampling pixel feature and the up-sampling pixel feature to obtain a first splicing feature if the K target pixel features are K first pixel features, inputting the first splicing feature to a second convolution layer, and performing convolution processing on the first splicing feature in the second convolution layer to obtain a convolution splicing feature;

the feature downsampling module is used for acquiring a third downsampling multiple corresponding to the first feature extraction unit, and downsampling the K first pixel features according to the third downsampling multiple to obtain a third downsampling pixel feature;

the characteristic up-sampling module is used for determining a second image up-sampling multiple according to a third down-sampling multiple and a second down-sampling multiple, and up-sampling the convolution splicing characteristic according to the second image up-sampling multiple to obtain an up-sampling splicing characteristic;

and the second feature splicing module is used for splicing the third down-sampling pixel feature and the up-sampling splicing feature to obtain a second splicing feature, inputting the second splicing feature into a third convolution layer, and performing convolution processing on the second splicing feature in the third convolution layer to obtain a convolution pixel feature.

Wherein, the category output module includes:

the existence probability acquiring unit is used for acquiring classification characteristic points in the classification convolution characteristic diagram and acquiring object existence probabilities corresponding to the classification characteristic points; the object existence probability refers to the probability that the target object exists in the prediction frame to which the classification characteristic point belongs; the prediction frame is used for predicting the position of the target object in the target image;

a prediction probability obtaining unit for obtaining a category prediction probability corresponding to the classification feature point;

the class determining unit is used for acquiring the maximum class prediction probability from the class prediction probability if the object existence probability is greater than the probability threshold, and determining the class corresponding to the maximum class prediction probability as the object class predicted by the classification feature point;

the class determination unit is further configured to determine the object class predicted by the classification feature point as the object class of the prediction frame to which the classification feature point belongs.

Wherein, the position output module includes:

the position parameter acquisition unit is used for acquiring position feature points in the position convolution feature map and acquiring predicted position parameters corresponding to the position feature points;

the pixel block acquisition unit is used for acquiring a grid pixel block corresponding to the position characteristic point from the K grid pixels; the position characteristic points are obtained by convolution of grid pixel blocks, and the size of each grid pixel block is determined by the downsampling multiple corresponding to the K target pixel characteristics;

a central coordinate obtaining unit for obtaining a central position coordinate of the grid pixel block;

and the position information determining unit is used for determining the position information of the target object according to the predicted position parameters and the central position coordinates.

The predicted position parameters comprise position offset, width of a predicted frame and height of the predicted frame; the prediction frame is used for predicting the position of the target object in the target image;

a location information determination unit comprising:

the central position determining subunit is used for determining central position information corresponding to the prediction frame according to the position offset and the central position coordinate;

and the position information determining subunit is used for determining the position information of the prediction frame in the target image according to the central position information, the width of the prediction frame and the height of the prediction frame, and determining the position information of the prediction frame in the target image as the position information of the target object.

Wherein, the device still includes:

the sample acquisition module is used for acquiring an image sample, inputting the image sample into the sample image recognition model and extracting the sample characteristics of the image sample in the sample image recognition model;

the sample input module is used for respectively inputting the sample characteristics into a sample image classification component and a sample position prediction component of the sample image recognition model; the sample image classification component comprises a first sample convolution parameter concerned with object classification, and the sample position prediction component comprises a second sample convolution parameter concerned with position information prediction;

the classification convolution module is used for performing convolution processing on the sample characteristics through the first sample convolution parameter in the sample image classification component to obtain a sample classification convolution characteristic diagram;

the position convolution module is used for performing convolution processing on the sample characteristics through the second sample convolution parameters in the sample position prediction component to obtain a sample position convolution characteristic diagram;

the result determining module is used for determining the prediction object type of the sample object in the image sample according to the sample classification convolution characteristic graph and determining the prediction position information of the sample object according to the sample position convolution characteristic graph;

the loss value determining module is used for acquiring an object class label of the sample object, acquiring position label information of the sample object, and determining a total loss value according to the predicted object class, the object class label, the predicted position information and the position label information;

the parameter adjusting module is used for respectively adjusting the first sample convolution parameter and the second sample convolution parameter according to the total loss value to obtain a first convolution parameter corresponding to the first sample convolution parameter and a second convolution parameter corresponding to the second sample convolution parameter;

and the model determining module is used for determining a sample image classification component comprising the first convolution parameter as an image classification component, determining a sample position prediction component comprising the second convolution parameter as a position prediction component, and determining a sample image recognition model comprising the image classification component and the position prediction component as an image recognition model.

Wherein the loss value determination module comprises:

a classification loss determination unit for determining a classification loss value according to the predicted object class and the object class label;

a position loss determination unit for determining a position loss value based on the position tag information and the predicted position information;

and the loss value generating unit is used for generating a total loss value according to the classification loss value and the position loss value.

Wherein, parameter adjustment module includes:

the first derivation unit is used for determining a first partial derivative between the total loss value and the prediction object category, and adjusting the first sample convolution parameter according to the first partial derivative to obtain a first convolution parameter;

and the second derivation unit is used for determining a second partial derivative between the total loss value and the predicted position information, and adjusting the second sample convolution parameter according to the second partial derivative to obtain a second convolution parameter.

One aspect of the present application provides a computer device, comprising: a processor, a memory, a network interface;

the processor is connected with the memory and the network interface, wherein the network interface is used for providing data communication functions, the memory is used for storing computer programs, and the processor is used for calling the computer programs to execute the method in one aspect of the embodiment of the application.

An aspect of the present application provides a computer-readable storage medium storing a computer program, the computer program comprising program instructions that, when executed by a processor, perform a method in an aspect of an embodiment of the present application.

In one aspect of the application, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided by one aspect of the embodiments of the present application.

In the embodiment of the application, after the image features of a target image are extracted, the image features are respectively input into an image classification component and a position prediction component, convolution processing is performed on the image features through a first convolution parameter in the image classification component, a classification convolution feature map can be obtained, and then the object class of a target object in the target image can be output according to the classification convolution feature map; meanwhile, the image features are convoluted through the second convolution parameters in the position prediction component, a position convolution feature map can be obtained, and the position information of the target object in the target image can be output according to the position convolution feature map. Because the first convolution parameter in the image classification component focuses on object classification, the second convolution parameter in the position prediction component focuses on position information prediction, and the first convolution parameter and the second convolution parameter are independent and have different respective focus points, the classification accuracy can be improved when the first convolution parameter is used for carrying out image classification on image features, and the accuracy of position information prediction can be improved when the second convolution parameter is used for carrying out position information prediction on the image features; meanwhile, because the image classification component and the position prediction component are both contained in the image recognition model, the image recognition model extracts the image features of the target image once, and the image features are respectively input into the image classification component and the position prediction component, so that the sharing of the image features can be realized, the calculated amount can be reduced, and the recognition efficiency can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a diagram of a network architecture provided by an embodiment of the present application;

fig. 2a is a schematic view of a scenario provided by an embodiment of the present application;

FIG. 2b is a diagram illustrating a correspondence between grid pixel blocks and classification feature points according to an embodiment of the present application;

fig. 2c is a schematic diagram of a grid pixel block and a corresponding position feature point according to an embodiment of the present application;

fig. 2d is an application scenario diagram provided in the embodiment of the present application;

fig. 2e is an application scenario diagram provided in the embodiment of the present application;

fig. 3 is a schematic flowchart of an image recognition method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of image feature extraction provided in an embodiment of the present application;

FIG. 5 is a schematic diagram of determining a classification convolution feature map and a location convolution feature map according to an embodiment of the present application;

FIG. 6 is a schematic flow chart of model training provided by an embodiment of the present application;

fig. 7 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present application;

fig. 8 is a schematic diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The scheme provided by the embodiment of the application belongs to computer vision technology (CV) and Machine Learning (ML) belonging to the field of artificial intelligence.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

Referring to fig. 1, fig. 1 is a diagram of a network architecture according to an embodiment of the present disclosure. As shown in fig. 1, the network architecture may include a service server 1000 and a background server cluster, where the background server cluster may include a plurality of background servers, and as shown in fig. 1, the network architecture may specifically include a background server 100a, a background server 100b, background servers 100c and …, and a background server 100 n. As shown in fig. 1, the backend server 100a, the backend server 100b, the backend servers 100c, …, and the backend server 100n may be respectively connected to the service server 1000 through a network, so that each backend server may perform data interaction with the service server 1000 through the network connection, so that the service server 1000 may receive service data from each backend server.

Each background server shown in fig. 1 corresponds to a user terminal, and may be configured to store service data of the corresponding user terminal. Each user terminal may be integrally installed with a target application, and when the target application runs in each user terminal, the background server corresponding to each user terminal may store service data in the application and perform data interaction with the service server 1000 shown in fig. 1. The target application may include an application having a function of displaying data information such as text, images, audio, and video. For example, the application may be an image recognition application, such as a watermark recognition application, which may be used for a user to upload a picture and view a sensitive watermark or violation marker of the picture; the application may also be an image classification application that may be used for a user to upload a picture and view a class of objects (e.g., elephant, tiger, lark, etc.) to which objects (e.g., people, animals, etc.) contained in the picture belong; the application can also be a video detection application, and can be used for uploading videos by users and detecting whether violation marks or sensitive watermarks exist in video frames of the videos. The service server 1000 in the present application may collect service data from the background of the applications (such as the background server cluster described above), for example, the service data may be images uploaded by users or videos uploaded by users. Based on the collected service data, the service server 1000 may determine the object class to which the object included in the service data belongs and the position information of the object in the service data (image or video). Further, the service server 1000 may send the object type and the location information to a background server, so that the user may determine, through a user terminal corresponding to the background server, the object type to which the object in the video frame of the image or the video belongs, and view the location information of the object in the video frame of the image or the video, and the user may perform subsequent processing according to the viewed object type and the location information. If the user uploads the video a, the service server 1000 detects that a sensitive watermark exists at a certain position of the video frame a of the video a, the service server 1000 may return the detection result to a background server corresponding to the user terminal of the user, and the user can view the detection result on a display page of the user terminal, and then the user may delete the sensitive watermark in the video a through the position information (at the position of the video frame a in the video a) and the object type (sensitive watermark) returned by the service server 1000, so as to ensure that the video a is legal.

In the embodiment of the present application, one user terminal may be selected from a plurality of user terminals as a target user terminal, and the target user terminal may include: and intelligent terminals with data information display and playing functions are carried by smart phones, tablet computers, desktop computers and the like. For example, in the embodiment of the present application, a user terminal corresponding to the backend server 100a shown in fig. 1 may be used as the target user terminal, and the target application may be integrated in the target user terminal, and at this time, the backend server 100a corresponding to the target user terminal may perform data interaction with the service server 1000.

For example, when a user uses a target application (e.g., a video detection application) in a user terminal, the service server 1000 may detect and collect a video uploaded by the user through a background server corresponding to the user terminal, the service server 1000 may determine whether a target object (e.g., a person, an animal, a sensitive watermark, etc.) exists in the video, and if a target object exists in the video, the service server 1000 determines an object type to which the target object belongs and location information of the target object in the video, and returns the object type and the location information to the background server, so that the user may view the object type and the location information on a display page of the user terminal corresponding to the background server, and perform subsequent processing according to the object type and the location information.

Alternatively, it is understood that the backend server may detect the service data (such as images or videos) collected to the respective corresponding user terminals, and determine the object class to which the object included in the service data belongs, and the position information of the object in the service data (images or videos). The user can check the object type and the position information determined by the background server on the display page of the user terminal corresponding to the background server.

It is understood that the method provided by the embodiment of the present application can be executed by a computer device, including but not limited to a user terminal or a service server. The user terminal and the service server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

For easy understanding, please refer to fig. 2a, and fig. 2a is a schematic view of a scenario provided by an embodiment of the present application. The service server shown in fig. 2a may be the service server 1000, and the terminal a shown in fig. 2a may be any one user terminal selected from the user terminal cluster in the embodiment corresponding to fig. 1, for example, the user terminal may be the user terminal 100 a.

As shown in fig. 2a, a user a uploads an image 20a through a user terminal a, the user terminal a can send the image 20a to a service server, and the service server can perform image scaling on the image 20a and perform image division after receiving the image 20a, where the specific method is that the service server can obtain an image scaling ratio, and according to the image scaling ratio, the service server can perform image scaling on the image 20a to obtain a transition image; subsequently, the service server may divide the transition image to obtain a target image including K grid pixels. The image scaling may be an artificial value, for example, the image scaling may be 416 × 416, 318 × 318, 256 × 256 … …, and the specific representation form of the image scaling is not limited in the present application and is not illustrated here. Where K is an integer value greater than 1, and the value of K is associated with the image scale, for example, if the image scale is 416 × 416, K may be 416 × 416 — 173056. As shown in fig. 2a, the size of the image 20a is 16 × 8 (length is 16, width is 8), the scaling of the image obtained by the service server is 8 × 8 (length is 8, width is 8), and the service server needs to scale the image 20a to an image with size of 8 × 8. As shown in fig. 2a, the service server may scale the length (16) of the image 20a to 8, with a scaling factor between 16 and 8 of 1/2, and then in order to keep the shape of the image 20a unchanged, the service server may scale the width (8) of the image 20a to 4(8 × 1/2 — 4). Then, as shown in fig. 2a, after scaling the image 20a in length and width, the size of the image 20a becomes 8 × 4, but does not reach 8 × 8, and in order to make the size reach 8 × 8, the difference between the size 8 × 8 and the size 8 × 4 (e.g., the region P and the region Q in fig. 2 a) may be filled.

As shown in fig. 2a, after filling the difference portion, i.e. the region P and the region Q, a transition image 20b can be obtained, wherein the size of the image 20b is 8 × 8. Subsequently, the service server may perform image division on the transition image 20b, for example, because the image scaling is 8 × 8, the service server may divide the transition image 20b into 64(8 × 8 ═ 64) grid pixels, where the size of each grid pixel is the same, and thus, the target image 20c (including 64 grid pixels) may be obtained. The business server may input the target image comprising 64 grid pixels into the image recognition model.

Further, in the image recognition model, the image features of the target image 20c may be extracted, the image features may be input to an image classification component in the image recognition model, and the convolution processing may be performed on the image features through the first convolution parameter in the image classification component, so as to obtain a classification convolution feature map; the image features are input to a position prediction component in the image recognition model, and convolution processing can be performed on the image features through a second convolution parameter in the position prediction component, so that a position convolution feature map can be obtained.

As shown in fig. 2a, the classification convolution map output by the image classification component includes classification feature point m1, classification feature point m2, classification feature point m3 and classification feature point m 4. Wherein each classification feature point in the classification convolution map corresponds to one grid pixel block of 64 grid pixels in the target image 20c, and each classification feature point is obtained by convolving the grid pixel block corresponding to the classification feature point; the size of each grid pixel block is determined by the down-sampling multiple in the image recognition model, for example, the down-sampling multiple is 1/4, and the size of each grid pixel block may be 4 × 4. Here, the down-sampling multiple may be used to down-sample 64 grid pixels, so that the number of grid pixels in the target image 20c may be changed, and the size of the target image 20c may be changed. For easy understanding, please refer to fig. 2b, fig. 2b is a schematic diagram illustrating a correspondence between grid pixel blocks and classification feature points according to an embodiment of the present application. As shown in fig. 2b, classification feature point m1 corresponds to a grid pixel block g1 in the target image 20c, classification feature m2 corresponds to a grid pixel block g2 in the target image 20c, classification feature point m3 corresponds to a grid pixel block g3 in the target image 20c, and classification feature point m4 corresponds to a grid pixel block g4 in the target image 20 c.

As shown in fig. 2b, the sizes of the grid pixel block g1, the grid pixel block g2, the grid pixel block g3, and the grid pixel block g4 are all 4 × 4 (including 16 grid pixels in total), that is, the downsampling multiple in the image recognition model is 1/4, 64 grid pixels in the target image 20c are divided into 4 grid pixel blocks (the grid pixel block g1, the grid pixel block g2, the grid pixel block g3, and the grid pixel block g4) by the downsampling multiple 1/4, the classification feature point m1 is obtained by performing multiple convolution on the grid pixel block g1, the classification feature point m2 is obtained by performing multiple convolution on the grid pixel block g2, the classification feature point m3 is obtained by performing multiple convolution on the grid pixel block g3, and the classification feature point m4 is obtained by performing multiple convolution on the grid pixel block g 4.

As shown in fig. 2a, the position convolution map outputted by the position prediction module includes 4 position feature points, which are respectively the position feature point d1, the position feature point d2, the position feature point d3 and the position feature point d 4. Each position feature point in the position convolution feature map corresponds to one grid pixel block of 64 grid pixels in the target image 20c, and each position feature point is obtained by performing multiple convolutions on the grid pixel block corresponding to the position feature point; wherein the size of each grid pixel block is determined by the down-sampling multiple in the image recognition model. For easy understanding, please refer to fig. 2c, fig. 2c is a schematic diagram illustrating a correspondence between grid pixel blocks and position feature points according to an embodiment of the present application. As shown in fig. 2c, the positional feature point d1 corresponds to the grid pixel block g1 in the target image 20c, the positional feature point d2 corresponds to the grid pixel block g2 in the target image 20c, the positional feature point d3 corresponds to the grid pixel block g3 in the target image 20c, and the positional feature point d4 corresponds to the grid pixel block g4 in the target image 20 c. As shown in fig. 2c, the sizes of the grid pixel block g1, the grid pixel block g2, the grid pixel block g3, and the grid pixel block g4 are all 4 × 4 (including 16 grid pixels in total), that is, the downsampling multiple in the image recognition model is 1/4, 64 grid pixels in the target image 20c are divided into 4 grid pixel blocks (the grid pixel block g1, the grid pixel block g2, the grid pixel block g3, and the grid pixel block g4) by the downsampling multiple 1/4, the position feature point d1 is obtained by performing multiple convolution on the grid pixel block g1, the position feature point d2 is obtained by performing multiple convolution on the grid pixel block g2, the position feature point d3 is obtained by performing multiple convolution on the grid pixel block g3, and the position feature point d4 is obtained by performing multiple convolution on the grid pixel block g 4.

With reference to fig. 2b and 2c, it can be seen that, in the image recognition model, 64 grid pixels of the target image 20c can be divided by the downsampling multiple to obtain a plurality of grid pixel blocks, and classification feature points or position feature points corresponding to each grid pixel block can be obtained by convolving the grid pixel blocks, so that a classification convolution feature map or a position convolution feature map can be obtained. The convolution parameters adopted when the grid pixel blocks are convolved to obtain the classification characteristic points are different from the convolution parameters adopted when the grid pixel blocks are convolved to obtain the position characteristic points.

With reference to fig. 2b and 2c, it can be seen that the classification feature point m1 and the position feature point d1 in fig. 2a correspond to the same grid pixel block g1, the classification feature point m2 and the position feature point d2 correspond to the same grid pixel block g2, the classification feature point m3 and the position feature point d3 correspond to the same grid pixel block d3, and the classification feature point m4 and the position feature point d4 correspond to the same grid pixel block g 4. The output parameter corresponding to each classification feature point is [ object existence probability: class prediction probability ], and the output parameter corresponding to each position feature point is [ position offset, prediction frame width, prediction frame height ], wherein, when the classification feature point corresponds to the same grid pixel block as the position feature point, one output parameter of the classification feature point [ object existence probability: the class prediction probability corresponds to the output parameters [ position offset, prediction frame width, prediction frame height ] of a position feature point, and then, by combining the output parameters of the classification feature point and the output parameters of the position feature point, whether the target object is contained in the prediction frame corresponding to the position feature point can be determined, the object class of the object in the prediction frame can be determined by the class prediction probability in the classification feature point, and meanwhile, the position information of the prediction frame in the target image can be determined by the position offset, the width of the prediction frame and the height of the prediction frame. For example, as shown in fig. 2a, the prediction frame corresponding to the position feature point d1 includes a prediction frame h1 and a prediction frame h2, the output parameters corresponding to the classification feature point m1 include the output parameters corresponding to the prediction frame h1 and the output parameters corresponding to the prediction frame h2 (i.e., the object existence probability and the class prediction probability corresponding to the prediction frame h1 and the prediction frame h 2), for example, the output parameters corresponding to the classification feature point m1 are [ 0.10: 0.01, 0.03, 0.98] and [ 0.23: 0.02, 0.01, 0.94], wherein the output parameters [ 0.10: 0.01, 0.03, 0.98] corresponds to the prediction box h1, the value 0.10 in the output parameter is the object existence probability in the prediction box h1 is 0.10, and the values 0.01, 0.03 and 0.98 in the output parameter are prediction category probabilities, wherein the prediction category corresponding to the value 0.01 is elephant, the prediction category corresponding to the value 0.03 is wolf, and the prediction category corresponding to the value 0.98 is monkey. Since the object existence probability 0.10 is less than the probability threshold 0.5, the service server can determine that no object exists in the prediction box h1, and can directly filter the prediction class probability 0.01, the prediction class probability 0.03 and the prediction class probability 0.98, and no longer determine the object class of the object in the prediction box h1 according to the prediction class probabilities. Similarly, the output parameter [ 0.23: 0.23 of the object existence probability 0.02, 0.01, 0.94 is less than the probability threshold 0.5, and the service server can also directly transmit the output parameter [ 0.23: 0.02, 0.01, 0.94], 0.01, and 0.94, and the object class of the object in the prediction box h2 is no longer determined based on these prediction class probabilities.

As shown in fig. 2a, the prediction box corresponding to the position feature point d2 includes a prediction box e1 and a prediction box e2, and the output parameters corresponding to the classification feature point m2 are the output parameters corresponding to the prediction box e1 and the output parameters corresponding to the prediction box e2 (i.e., the object existence probability and the class prediction probability corresponding to the prediction boxes e1 and e2, respectively). For example, the output parameter corresponding to the classification feature point m2 is [ 0.98: 0.01, 0.97, 0.03] and [ 0.99: 0.02, 0.76, 0.56], wherein the output parameters [ 0.98: 0.01, 0.97, 0.03 corresponding to prediction block e1, output parameter [ 0.98: the value 0.98 in 0.01, 0.97, 0.03 is the object existence probability in the prediction box e1, and the output parameter [ 0.98: the numerical value 0.01, the numerical value 0.97 and the numerical value 0.03 in 0.01, 0.97 and 0.03 are prediction category probabilities, wherein the prediction category corresponding to the numerical value 0.01 is a cartoon, the prediction category corresponding to the numerical value 0.97 is an illegal watermark, and the prediction category corresponding to the numerical value 0.03 is a letter. Since the object existence probability 0.98 is greater than the probability threshold 0.5, the service server may determine that the object exists in the prediction box e1, and since the maximum prediction class is 0.97 among the prediction class probability 0.01, the prediction class probability 0.97, and the prediction class probability 0.03, the service server may determine the prediction class "illegal watermark" corresponding to the maximum prediction class 0.97 as the object class of the object in the prediction box e 1. Similarly, the output parameter [ 0.99: 0.02, 0.76, 0.56], since the object existence probability 0.99 is greater than the probability threshold 0.5, the service server may determine that the object exists in the prediction box e2, and the service server may determine the maximum prediction class probability 0.76 from the prediction class probabilities 0.02, 0.76, and 0.56, and determine the prediction class "illegal watermark" corresponding to the maximum prediction class probability as the object class of the object in the prediction box e 2.

Similarly, the service server can determine that the prediction box r1 corresponding to the position feature point d3 contains an object of 'monkey' according to the output parameters of the classification feature point m3, and the prediction box r2 contains an object of 'monkey'; similarly, the service server may determine that no object exists in the prediction box f1 corresponding to the location feature point d4 and no object exists in the prediction box f2 according to the output parameter of the classification feature point m 4.

Further, the position information of the prediction frame h1, the prediction frame h2, the prediction frame e1, the prediction frame e2, the prediction frame r1, the prediction frame r2, the prediction frame f1 and the prediction frame f2 in the target image 20c is determined based on the output parameters [ position offset amount, width of the prediction frame, height of the prediction frame ] of the position feature point d1, the position feature point d2, the position feature point d3 and the position feature point d4, and the position information of the prediction frame in the target image 20c may be as shown in fig. 2 a.

It can be seen that the object in the prediction block e1 and the prediction block e2 is the violation watermark "www..... com", because the prediction block e1 includes the entire content of the violation watermark "www … … com", and the prediction block e2 does not include the entire content of the violation watermark "www … … com", the score (which can be understood as the coverage of the object) of the prediction block e1 is higher than that of the prediction block e2, and because the overlap ratio of the prediction block e1 and the prediction block e2 is high, the service server may delete the prediction block e2, and determine the prediction block e1 as the position information of the violation watermark "www … … com" in the grid pixel block g2 (i.e. the target image 20 c).

Similarly, the service server may determine, according to the output parameter of the classification feature point m3, that the prediction frame r1 corresponding to the location feature point d3 includes an object of "monkey", and the prediction frame r1 is location information of the object of "monkey" in the grid pixel block g3 (i.e., in the target image 20 c). Further, the service server may determine that the image 20a contains the sensitive watermark "www … … com" and determine the position information of the sensitive watermark "www … … com" in the image 20a and the position information of the "monkey" in the image 20 a. Meanwhile, the service server may delete the prediction blocks (e.g., the prediction block h1, the prediction block h2, the prediction block f1, and the prediction block f2) that do not include the object, so that the service server may obtain the recognition result of the target image 20c (the prediction block r1 includes the object of "monkey", and the prediction block e1 includes the violation-sensitive watermark of "www … … com"). The service server may return the identification result to the background server of the user terminal a, and the user a may view the result on the display page of the user terminal a, and perform subsequent processing, for example, the user a may delete the illegal watermark "www … … com" to ensure that the image 20a is a legal image.

For easy understanding, please refer to fig. 2d together, and fig. 2d is a diagram of an application scenario provided in the embodiment of the present application. The service server shown in fig. 2d may be the service server 1000, and the terminal a shown in fig. 2d may be any one user terminal selected from the user terminal cluster in the embodiment corresponding to fig. 1, for example, the user terminal may be the user terminal 100 a.

As shown in fig. 2d, the user a uploads an image 20a (such as the image 20a in the embodiment corresponding to fig. 2 a) through the user terminal a, the user terminal a may send the image 20a to the service server, and after receiving the image 20a, the service server may identify the image 20a, and determine the target object included in the image 20a and the position information of the target object in the image 20 a. As shown in fig. 2d, the service server may determine that the image 20a includes the illegal watermark of the target object "www … … com" and the target object "monkey", and the service server frames the position of the illegal watermark of the target object "www … … com" and the position of the target object "monkey" in the image 20 a. Then, the service server may return the identification result to the background server corresponding to the user terminal a, and the user terminal a may create an object selection interface according to the identification result returned by the service server, so that the user a may select an object that is desired to be deleted in the object selection interface. As shown in fig. 2d, in the object selection interface, the user terminal a may mark the illegal watermark "www … … com" (e.g., graying out the area S where the illegal watermark is located), so that the user a may know that the object in the area S is the illegal object. As shown in fig. 2d, the user a selects the illegal watermark "www … … com" in the area S in the object selection interface, and the user terminal a may delete the illegal watermark "www … … com" from the image 20 a. After the deletion is successful, the user a can check the "deleted successful" information in the deletion successful interface, if the user a does not have any other object to be deleted, the user a can click "confirm", through the click triggering operation of the user a, the user terminal a can jump to the image viewing interface, the user a can check the image after the object is deleted in the image viewing interface, as shown in fig. 2d, the image 20a in the image viewing interface does not contain the illegal object, and the image 20a after the illegal watermark "www … … com" is deleted is a legal image.

For easy understanding, please refer to fig. 2e, and fig. 2e is a diagram of an application scenario provided by an embodiment of the present application. The service server shown in fig. 2e may be the service server 1000, and the terminal B shown in fig. 2e may be any one user terminal selected from the user terminal cluster in the embodiment corresponding to fig. 1, for example, the user terminal may be the user terminal 100B.

As shown in fig. 2e, the user B may be an image auditing staff, and may audit the images uploaded by other users. As shown in fig. 2e, the user B may send the image 20a uploaded by another user (such as the image 20a in the embodiment corresponding to fig. 2 a) to the service server through the user terminal, so as to detect whether the image 20a includes the violation object. As shown in fig. 2d, after receiving the image 20a, the service server may identify the image 20a, determine that the image 20a includes the illegal watermark "www … … com", and then determine that the image 20a is an illegal image. Subsequently, the service server may return the identification result (the image 20a is an illegal image and contains the illegal watermark "www … … com") to a background server of the user terminal B, and after receiving the identification result, the user terminal B may query information (e.g., information such as name, user photo, home address, and the like) of an image provider (e.g., the user a) of the illegal image 20a, and create a user viewing interface, and display the information of the user a on the user viewing interface, so that the user B may view the information of the user a uploading the illegal image 20a in the user viewing interface, and perform subsequent processing (e.g., notify the user a that the uploaded image 20a is an illegal image and the review fails).

For easy understanding, please refer to fig. 3, and fig. 3 is a schematic flowchart of an image recognition method according to an embodiment of the present application. As shown in fig. 3, the process may include:

step S101, acquiring a target image, inputting the target image into an image recognition model, and extracting image features of the target image in the image recognition model.

In the present application, the target image includes K mesh pixels obtained by image division, where K is an integer greater than 1. It can be understood that after an original image is acquired, the original image may be subjected to image scaling according to an image scaling ratio to obtain a transition image, and then the transition image is subjected to image division to obtain a target image including K mesh pixels. Taking the embodiment corresponding to fig. 2a as an example, the image 20a uploaded by the user a is an original image, the size of the original image 20a is 16 × 8, the obtained image scaling is 8 × 8, the length 16 of the original image 20a may be scaled to a length of 8 according to the image scaling, the width 8 of the original image 20a may be scaled to a width of 4 according to a ratio 1/2 between the length 16 and the length 8, after the scaling of the length and the width is performed, the obtained image size is 8 × 4, and the size 8 × 8 of the image scaling is not reached yet, the difference portion may be filled, so that the transition image 20b may be obtained, and then, the transition image 20b may be subjected to image division, so that the target image 20c including 8 × 8 — 64 grid pixels may be obtained.

Further, the target image may be input into an image recognition model, in which image features of the target image may be extracted. For ease of understanding, please refer to fig. 4 together, and fig. 4 is a schematic diagram of an image feature extraction provided in an embodiment of the present application. As shown in fig. 4, the image recognition model includes a residual error network (including a first feature extraction unit, a second feature extraction unit, and a third feature extraction unit), a first convolution layer, a second convolution layer, and a third convolution layer. A target image comprising 416 × 416 grid pixels may be input into an image recognition model, and a first feature extraction unit (comprising a DBL module, a res1 module, a res2 module, and a res8 module) of the image recognition model may extract a pixel feature of each grid pixel in the target image, so as to obtain 416 × 416 first pixel features; the DBL module is called Darknetconv2d _ BN _ leak, that is, a convolutional layer (Darknetconv2d layer) + a Batch Normalization (BN) layer + an activation function (leak _ relu) layer, and is a basic module in an image recognition model, which may be used to perform convolution processing on input data (e.g., a target image). The resn (n may be 1, 2, 4, or 8) module is composed of a zero padding (zero padding) layer + a DBL module + n res units, the resn is a module in the image recognition model, and the resn may enable the image recognition model to use a residual structure of ResNet (residual network), and the image recognition model may extract image features of deeper layers of input data (e.g., a target image) by using the residual structure; wherein res unit consists of two DBL modules + residual layer (add layer). It can be seen that both the DBL module and the resn module can be used for performing convolution processing on input data, and the first feature extraction unit including the DBL module and the resn module can perform convolution on a target image to extract a first pixel feature of the target image. The image recognition model may be a yolo (young Only Look Once) model.

Subsequently, the 416 × 416 first pixel features may be input to a second feature extraction unit (including res8), where each first pixel feature may be subjected to a convolution process, thereby obtaining 416 × 416 second pixel features; subsequently, the 416 × 416 second pixel features may be input to a third feature extraction unit, where each second pixel feature may be subjected to convolution processing, so that 416 × 416 third pixel features may be obtained. Subsequently, the K third pixel features may be input into a first convolution layer (including DBL) where each of the third pixel features may be convolved, such that 416 × 416 initial convolved pixel features may be obtained; according to the first downsampling multiple (e.g. 1/32 in fig. 4) corresponding to the third feature extraction unit, the 416 × 416 initial convolved pixel features may be downsampled, that is, the 416 × 416 initial convolved pixel features are divided into a plurality of convolved pixel feature points, where one convolved pixel feature point is obtained by convolving 32 × 32 initial convolved pixel features, and then by dividing, the 416 × 416 initial convolved pixel features may be divided into 13 convolved pixel feature points, so that a convolved pixel feature including 13 × 13 convolved pixel feature points may be obtained, as shown in fig. 4, the convolved pixel feature including 13 × 13 convolved pixel feature points may be determined as the image feature of the target image.

Optionally, it is understood that, according to the first downsampling multiple (1/32) corresponding to the third feature extraction unit, the feature obtained by downsampling the 416 × 416 initial convolution pixel features may be determined as the first downsampled pixel feature (including 13 × 13 convolution pixel feature points); then, the 416 × 416 second pixel features may be downsampled according to a second downsampling multiple (1/16 shown in fig. 4) corresponding to the second feature extraction unit, that is, the 416 × 416 second pixel features are divided into a plurality of pixel feature points, where one convolved pixel feature point is obtained by convolving 16 × 16 second pixel features, and then the 416 × 416 second pixel features may be divided into 26 pixel feature points by the division, so that a second downsampled pixel feature including 26 × 26 pixel feature points may be obtained; subsequently, a first image upsampling multiple of 2 may be determined according to the first downsampling multiple (1/32) and the second downsampling multiple (1/16), the first downsampled pixel feature comprising 13 × 13 convolved pixel feature points may be upsampled according to the first image upsampling multiple of 2, the size of the first downsampled pixel feature comprising 13 × 13 convolved pixel feature points may be 2 times enlarged, and the upsampled pixel feature comprising 26 × 26 convolved pixel sub-feature points may be obtained; subsequently, the second downsampled pixel feature including 26 × 26 pixel feature points may be stitched with the upsampled pixel feature including 26 × 26 convolved pixel sub-feature points to obtain a first stitched feature, as shown in fig. 4, the first stitched feature may be input into a second convolution layer (including a DBL module), after the convolution processing is performed on the first stitched feature in the second convolution layer, a convolved pixel feature including 26 × 26 convolved pixel feature points may be obtained, and the convolved pixel feature including 26 × 26 convolved pixel feature points may be determined as the image feature of the target image.

Optionally, it is understood that, after the convolution processing is performed on the first convolution layer in the second convolution layer, the obtained feature may be determined as a convolution concatenation feature (including 26 × 26 convolution pixel feature points), and then, the 416 × 416 first pixel features may be downsampled according to a third downsampling multiple (1/8 shown in fig. 4) corresponding to the first feature extraction unit, that is, the 416 × 416 first pixel features are divided into a plurality of pixel feature points, where, if one convolution pixel feature point includes 8 × 8 second pixel features, the 416 × 416 first pixel features may be divided into 52 × 52 pixel feature points by the division, so that a third downsampled pixel feature including 52 × 52 pixel feature points may be obtained; then, a second image upsampling multiple of 2 can be determined according to the third downsampling multiple (1/8) and the second downsampling multiple (1/16), the convolution and concatenation feature comprising 26 × 26 convolution pixel feature points can be upsampled according to the second image upsampling multiple of 2, the size of the convolution and concatenation feature comprising 26 × 26 convolution pixel feature points can be 2 times enlarged, and therefore the upsampling and concatenation feature comprising 52 × 52 convolution pixel sub-feature points can be obtained; subsequently, the third downsampled pixel feature including 52 × 52 pixel feature points may be stitched with the upsampled stitching feature including 52 × 52 convolved pixel sub-feature points to obtain a second stitched feature, as shown in fig. 4, the second stitched feature may be input into a third convolution layer (including a DBL module), after the convolution processing is performed on the second stitched feature in the third convolution layer, a convolved pixel feature including 52 × 52 convolved pixel feature points may be obtained, and the convolved pixel feature including 52 × 52 convolved pixel feature points may be determined as the image feature of the target image.

In summary, the image feature of the target image may be a feature including 13 × 13 convolution pixel feature points, may also be a feature including 26 × 26 convolution pixel feature points, and may also be a feature including 52 × 52 convolution pixel feature points; since the image features of 26 × 26 convolution pixel feature points are obtained by performing double upsampling on 13 × 13 convolution pixel feature points, the image features of 26 × 26 convolution pixel feature points are more specific than those of 13 × 13 convolution pixel feature points, and similarly, the image features of 52 × 52 convolution pixel feature points are more specific than those of 26 convolution pixel feature points.

Step S102, inputting image characteristics into an image classification component and a position prediction component in an image recognition model respectively; the image classification component comprises a first convolution parameter focusing on object classification, and the position prediction component comprises a second convolution parameter focusing on position information prediction.

In the present application, both the image classification component and the location prediction component are included in the image recognition model. Wherein the image classification component comprises a first convolution parameter focused on object classification, that is, the image classification component can be used to identify an object class of a target object in a target image; the position prediction component contains a second convolution parameter that is focused on for position information prediction, i.e. the position prediction component can be used to determine the position information of the target object in the target image.

Step S103, in the image classification component, performing convolution processing on the image features through the first convolution parameters to obtain a classification convolution feature map.

In this application, the first convolution parameter is a parameter in the image classification component, and may be used to perform convolution processing on the image feature, and the first convolution parameter may perform convolution processing on each convolution pixel feature point in the image feature, so as to obtain a classification convolution feature including a plurality of classification feature points. Each classification feature point is obtained by convolving one convolution pixel feature point, for example, the image feature includes 13 × 13 convolution pixel feature points, and after the image feature is convolved by the first convolution parameter, a classification convolution feature map including 13 × 13 classification feature points can be obtained.

And step S104, in the position prediction component, performing convolution processing on the image characteristics through the second convolution parameters to obtain a position convolution characteristic diagram.

In this application, the second convolution parameter is a parameter in the position prediction component, and may be used to perform convolution processing on the image feature, and each convolution pixel feature point in the image feature may be subjected to convolution processing by the second convolution parameter, so that a classification convolution feature including a plurality of position feature points may be obtained. Each position feature point is obtained by convolving one convolved pixel feature point, for example, the image feature includes 13 × 13 convolved pixel feature points, and after the image feature is convolved by the second convolution parameter, a position convolved feature map including 13 × 13 position feature points can be obtained.

For ease of understanding, please refer to fig. 5 together, and fig. 5 is a schematic diagram illustrating a deterministic classification convolution feature map and a position convolution feature map according to an embodiment of the present application. As shown in fig. 5, a target image including 416 × 416 grid pixels is input into an image recognition model, and the image recognition model may extract image features of the target image. For a specific implementation manner of the image recognition model extracting the image features of the target image with 416 × 416 grid pixels, reference may be made to the description of the image recognition model extracting the image features in the embodiment corresponding to fig. 4. Inputting the image features into an image classification component in an image recognition model, and performing convolution processing on the image features through a first convolution parameter of the image classification component to obtain a classification convolution feature map. For example, taking an example that the image feature includes 13 × 13 convolution pixel feature points, as shown in fig. 5, the image feature including 13 × 13 convolution pixel feature points is input to an image classification component (including a DBL module and a conv module), and the image classification component may perform convolution processing on each convolution pixel feature point in the 13 × 13 convolution pixel feature points, so as to obtain a classification convolution feature map including 13 × 13 classification feature points. And inputting the image features into a position prediction component in an image recognition model, and performing convolution processing on the image features through a second convolution parameter of the image position prediction component to obtain a position convolution feature map.

For example, taking an example that the image feature includes 13 × 13 convolution pixel feature points, as shown in fig. 5, the image feature including 13 × 13 convolution pixel feature points is input to a position prediction component (including a DBL module and a conv module), and each convolution pixel feature point in the 13 × 13 convolution pixel feature points is convolved by the position prediction component, so that a position convolution feature map including 13 × 13 position feature points can be obtained. As shown in fig. 5, the 52 × 52 classification convolution feature map corresponds to more specific image features (including 52 × 52 convolution pixel feature points) than the 26 × 26 classification convolution feature map (including 26 × 26 convolution pixel feature points); 26 × 26 classification convolution feature maps correspond to image features (including 26 × 26 convolution pixel feature points), and more specifically, 13 × 13 classification convolution feature maps correspond to image features (including 13 × 13 convolution pixel feature points), so that 13 × 13 classification convolution feature maps can be used to classify large objects (e.g., objects with a large area), 26 × 26 classification convolution feature maps can be used to classify medium objects (e.g., objects with a suitable area) in the target image, and 52 × 52 classification convolution feature maps can be used to classify small objects (e.g., objects with a small area) in the target image. Similarly, 13 × 13 location convolution feature maps may be used to determine location information of large objects in the target image, 26 × 26 classification convolution feature maps may be used to determine location information of medium objects in the target image, and 52 × 52 classification convolution feature maps may be used to determine location information of small objects in the target image.

And step S105, outputting the object class of the target object in the target image according to the classification convolution characteristic diagram, and outputting the position information of the target object in the target image according to the position convolution characteristic diagram.

In the present application, each classification feature point in the classification convolution feature map corresponds to one grid pixel block in the target image, for example, taking the embodiment corresponding to fig. 2a as an example, the classification feature point m1 in fig. 2a corresponds to a 4 × 4 grid pixel block g1 in the target image 20c, and the classification feature point m1 is obtained by performing convolution on the 4 × 4 grid pixel block g1 multiple times through an image recognition model. When performing image classification, an object existence probability output by a classification feature point may be obtained, where the object existence probability refers to a probability that a target object (e.g., a person, an animal, a website, a letter, a text, etc.) exists in a prediction box to which the classification feature point belongs, and the prediction box is used for predicting a position of the target object in a target image. For example, taking the embodiment corresponding to fig. 2a as an example, the classification feature point m1 corresponds to a grid pixel block g1, and the 4 × 4 grid pixel block g1 is included in 64 grid pixels of the target image 20 c. For example, taking the embodiment corresponding to fig. 4 as an example, when the third feature extraction unit outputs the third pixel feature, if the third pixel feature is determined as the target pixel feature, the size of the grid pixel block (i.e., 32 × 32) may be determined according to the downsampling multiple (1/32) of the third feature extraction unit, that is, the target image including 416 × 416 grid pixels is divided into the target images including 13 × 13 grid pixel blocks (each grid pixel block is composed of 32 × 32 grid pixels).

It will be appreciated that by performing a plurality of convolutions of the grid pixel blocks, the classification feature points can be obtained. And the classification feature point can output one or more output parameters, wherein each output parameter comprises an object existence probability and a class prediction probability, and each output parameter corresponds to one prediction box. Further, a classification feature point output class prediction probability may be obtained, when the object existence probability output by the classification feature point is greater than a probability threshold, it may be determined that an object exists in the prediction frame, then a maximum class prediction probability may be determined from the class prediction probabilities, and a class corresponding to the maximum class prediction probability is determined as the class of the object predicted by the classification feature point; then, the object class predicted by the classification feature point may be determined as the object class of the prediction frame to which the classification feature point belongs (i.e., the object class of the object included in the prediction frame).

In the present application, each position feature point in the position convolution feature map corresponds to one grid pixel block in the target image, for example, taking the embodiment corresponding to fig. 2a as an example, the convolution feature point d1 in fig. 2a corresponds to the 4 × 4 grid pixel block g1 in the target image 20c, and the position feature point d1 is obtained by continuously convolving the 4 × 4 grid pixel block g1 with the image recognition model. That is, after the image recognition model convolves the same grid pixel block with different convolution parameters (e.g., the first convolution parameter and the second convolution parameter), the classification feature point and the position feature point can be obtained respectively. And the output parameters of the classification feature points correspond to the prediction frames corresponding to the position feature points. By classifying the output parameters of the feature points, it can be determined whether an object exists in the prediction frame corresponding to the location feature points, and to which object class the object belongs. The output parameter of the position feature point may determine position information of the prediction frame in the target image (that is, position information of the target object in the prediction frame in the target image), specifically, a prediction position parameter (including a position offset, a width of the prediction frame, and a height of the prediction frame) corresponding to the position feature point in the position convolution feature map may be obtained, then, a center position coordinate corresponding to the position convolution feature point may be obtained, and the position information of the target object may be determined according to the prediction position parameter and the center position coordinate. The specific method may be that, according to the position offset and the center position coordinate, the center position information corresponding to the prediction frame predicted by the position feature point may be determined; then, based on the center position information, the width of the prediction frame, and the height of the prediction frame, the position information of the prediction frame in the target image may be determined, and the position information of the prediction frame in the target image may be determined as the position information of the target object. The position offset is an offset of a vertex position (e.g., an upper left corner coordinate and an upper right corner coordinate) of the prediction frame with respect to the grid pixel block, and the central position information of the prediction frame in the grid pixel block can be obtained by adding or subtracting the position offset and the central position coordinate of the grid pixel block.

For example, taking the embodiment corresponding to FIG. 2a as an example, the predicted position parameters outputted by the position feature point d3 in the position convolution feature map are [ a, w, h ], wherein, the prediction frame corresponding to the prediction position parameter [ a, w, h ] is a prediction frame r1, wherein the parameter a is the position offset, please refer to fig. 2c, since the grid pixel block corresponding to the position feature point d3 is the grid pixel block g3, this parameter a may be used to characterize the upper left corner coordinate of the prediction box r1 as compared to the center position of the grid pixel block g3, the parameter a is added to the coordinates of the center position of the grid pixel block g3, so that the information of the center position of the prediction frame r1 in the grid pixel block g3 can be obtained, based on the center position information, the parameter w (the width of the prediction frame r 1) and the parameter h (the height of the prediction frame r 1), the position information of the prediction frame r1 in the grid pixel block g3 is obtained. And because the prediction frame r1 has the highest score (the highest coverage to the object) in the prediction frame corresponding to the grid pixel block g3, the position information of the prediction frame r1 can be determined as the position information of the target object in the grid pixel block g 3.

In the embodiment of the application, after the image features of a target image are extracted, the image features are respectively input into an image classification component and a position prediction component, convolution processing is performed on the image features through a first convolution parameter in the image classification component, a classification convolution feature map can be obtained, and then the object class of a target object in the target image can be output according to the classification convolution feature map; meanwhile, the image features are convoluted through the second convolution parameters in the position prediction component, a position convolution feature map can be obtained, and the position information of the target object in the target image can be output according to the position convolution feature map. Because the first convolution parameter in the image classification component focuses on object classification, the second convolution parameter in the position prediction component focuses on position information prediction, and the first convolution parameter and the second convolution parameter are independent and have different respective focus points, the classification accuracy can be improved when the first convolution parameter is used for carrying out image classification on image features, and the accuracy of position information prediction can be improved when the second convolution parameter is used for carrying out position information prediction on the image features; and because the image features are shared by the image classification component and the position prediction component, the image features do not need to be calculated twice, so that the classification accuracy can be improved, and the calculation amount can be saved, thereby improving the identification efficiency.

Further, please refer to fig. 6, where fig. 6 is a schematic flowchart of a model training process provided in the embodiment of the present application. As shown in fig. 6, the process may include:

step S201, obtaining an image sample, inputting the image sample into a sample image recognition model, and extracting a sample feature of the image sample in the sample image recognition model.

In the application, in order to improve the accuracy of the object type and the position information of the target object predicted by the image recognition model, the image recognition model can be trained and adjusted, so that the trained and adjusted image recognition model is optimal. The image recognition model herein may include a yolo (yolo) model, which may be a yolo _ v1 model, a yolo _ v2 model, a yolo _ v3 model, and the like. The yolo _ v1 model, the yolo _ v2 model and the yolo _ v3 model may extract image features of the input image through a feature extraction network (e.g., residual network + convolutional layer), and the image features are input to an image classification component and a position prediction component in the image recognition model respectively as shared features.

After the image sample is acquired, the image sample can be input into the sample image identification model, and the deep image features (sample features) of the image sample can be effectively extracted through the residual error network and the convolution layer in the sample image identification model. For a specific implementation manner of extracting the sample features of the image sample by the sample image recognition model, reference may be made to the description of the image features of the target image extracted by the image recognition model in the embodiment corresponding to fig. 4, which will not be repeated herein.

Step S202, respectively inputting sample characteristics into a sample image classification component and a sample position prediction component of a sample image recognition model; the sample image classification component includes a first sample convolution parameter focused on object classification, and the sample position prediction component includes a second sample convolution parameter focused on position information prediction.

Step S203, in the sample image classification component, performing convolution processing on the sample characteristics through the first sample convolution parameters to obtain a sample classification convolution characteristic diagram.

And step S204, in the sample position prediction component, carrying out convolution processing on the sample characteristics through the second sample convolution parameters to obtain a sample position convolution characteristic diagram.

And S205, determining the prediction object type of the sample object in the image sample according to the sample classification convolution characteristic diagram, and determining the prediction position information of the sample object according to the sample position convolution characteristic diagram.

In the present application, for a specific implementation manner of step S202 to step S205, refer to the description of step S102 to step S105 in the embodiment corresponding to fig. 3, which will not be described again here.

Step S206, obtaining the object class label of the sample object, obtaining the position label information of the sample object, and determining the total loss value according to the predicted object class, the object class label, the predicted position information and the position label information.

In the application, a classification loss value can be determined according to the object class label of the sample object and the prediction object class; from the position tag information (real position information) of the sample object and the predicted position information, a position loss value can be determined; based on the classification loss value and the position loss value, a total loss value for image classification and position prediction can be generated. For example, the classification loss value and the position loss value may be added, and the total loss value may be determined as the result of the addition.

Step S207, respectively adjusting the first sample convolution parameter and the second sample convolution parameter according to the total loss value, so as to obtain a first convolution parameter corresponding to the first sample convolution parameter and a second convolution parameter corresponding to the second sample convolution parameter.

In the present application, a first sample convolution parameter in a sample image classification component may be adjusted according to the total loss value, specifically, a partial derivative between the total loss value and the prediction object class may be calculated, and the first sample convolution parameter may be adjusted according to the partial derivative, so as to obtain a first convolution parameter; the second sample convolution parameter in the sample position prediction component may also be adjusted according to the total loss value, specifically, a partial derivative between the total loss value and the predicted position information may be calculated, and the second sample convolution parameter may be adjusted according to the partial derivative, so that the second convolution parameter may be obtained.

Step S208, determining a sample image classification component comprising the first convolution parameter as an image classification component, determining a sample image classification component as an image classification component, determining a sample position prediction component comprising the second convolution parameter as a position prediction component, and determining a sample image recognition model comprising the image classification component and the position prediction component as an image recognition model.

In the application, parameters in the residual error network and parameters in the convolutional layer in the image recognition model can be adjusted through the total loss value, so that the target image can be more specifically expressed through image features extracted by the residual error network and the convolutional layer. After the adjustment, the image recognition model can be used in the application, in the image recognition model, the image classification component including the first convolution parameter can more accurately recognize the object class of the target object in the target image, and the position prediction component including the second convolution parameter can more accurately determine the position information of the target object.

Further, please refer to fig. 7, and fig. 7 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present application. The image recognition means may be a computer program (including program code) running on a computer device, for example, the image recognition means is an application software; the apparatus may be used to perform the corresponding steps in the methods provided by the embodiments of the present application. The image recognition apparatus 1 may include: the system comprises an image acquisition module 11, a feature extraction module 12, a feature input module 13, a first feature convolution module 14, a second feature convolution module 15, a category output module 16 and a position output module 17.

The image acquisition module 11 is used for acquiring a target image and inputting the target image into the image recognition model;

a feature extraction module 12, configured to extract image features of the target image in the image recognition model;

the characteristic input module 13 is used for respectively inputting the image characteristics into an image classification component and a position prediction component in the image recognition model; the image classification component comprises a first convolution parameter concerned with object classification, and the position prediction component comprises a second convolution parameter concerned with position information prediction;

the first feature convolution module 14 is configured to perform convolution processing on the image features through the first convolution parameter in the image classification component to obtain a classification convolution feature map;

the second feature convolution module 15 is configured to perform convolution processing on the image features through the second convolution parameters in the position prediction component to obtain a position convolution feature map;

a category output module 16, configured to output an object category of the target object in the target image according to the classified convolution feature map;

and the position output module 17 is configured to output position information of the target object in the target image according to the position convolution feature map.

For specific implementation manners of the image obtaining module 11, the feature extracting module 12, the feature input module 13, the first feature convolution module 14, the second feature convolution module 15, the category output module 16, and the position output module 17, reference may be made to the descriptions of step S101 to step S105 in the embodiment corresponding to fig. 3, and details will not be repeated here.

referring to fig. 7, the feature extraction module 12 may include: a first pixel feature output unit 121, a second pixel feature output unit 122, a third pixel feature output unit 123, a target pixel feature selection unit 124, a feature convolution unit 125, and an image feature generation unit 126.

A first pixel feature output unit 121, configured to input the target image to the first feature extraction unit, extract a pixel feature of each grid pixel in the target image in the first feature extraction unit, to obtain K first pixel features;

the second pixel feature output unit 122 is configured to input the K first pixel features to the second feature extraction unit, and perform convolution processing on each first pixel feature in the second feature extraction unit to obtain K second pixel features;

the third pixel feature output unit 123 is configured to input the K second pixel features to the third feature extraction unit, and perform convolution processing on each second pixel feature in the third feature extraction unit to obtain K third pixel features;

a target pixel feature selection unit 124 configured to select K target pixel features from the K first pixel features, the K second pixel features, and the K third pixel features;

a feature convolution unit 125, configured to input the K target pixel features into the convolution layer, and perform convolution processing on the K target pixel features in the convolution layer to obtain convolution pixel features;

and an image feature generating unit 126, configured to generate an image feature of the target image according to the convolved pixel features.

For a specific implementation manner of the first pixel characteristic output unit 121, the second pixel characteristic output unit 122, the third pixel characteristic output unit 123, the target pixel characteristic selection unit 124, the characteristic convolution unit 125, and the image characteristic generation unit 126, reference may be made to the description of step S101 in the embodiment corresponding to fig. 3, which will not be repeated herein.

Referring to fig. 7, the image recognition apparatus 1 may include: the image acquisition module 11, the feature extraction module 12, the feature input module 13, the first feature convolution module 14, the second feature convolution module 15, the category output module 16 and the position output module 17 may further include a scale acquisition module 18, an image scaling module 19 and an image division module 20.

A scale obtaining module 18, configured to obtain an original image and obtain an image scaling;

the image scaling module 19 is configured to perform image scaling on the original image according to the image scaling ratio to obtain a transition image;

and an image dividing module 20, configured to perform image division on the transition image to obtain a target image including K grid pixels.

For a specific implementation manner of the scale obtaining module 18, the image scaling module 19, and the image dividing module 20, reference may be made to the description of obtaining the target image in step S101 in the embodiment corresponding to fig. 3, which will not be described herein again.

referring to fig. 7, the feature convolution unit 125 may include: a first convolution sub-unit 1251 and a first downsampling sub-unit 1252.

A first convolution subunit 1251, configured to input the K third pixel features into the first convolution layer, and perform convolution processing on each third pixel feature in the first convolution layer to obtain K initial convolution pixel features;

and the first downsampling subunit 1252 is configured to obtain a first downsampling multiple corresponding to the third feature extraction unit, and downsample the K initial convolution pixel features according to the first downsampling multiple to obtain the convolution pixel features.

For specific implementation manners of the first convolution subunit 1251 and the first lower sampling subunit 1252, reference may be made to the description in step S101 in the embodiment corresponding to fig. 3, which will not be described herein again.

referring to fig. 7, the feature convolution unit 125 may include: a second convolution sub-unit 1253, a second downsampling sub-unit 1254, a third downsampling sub-unit 1255, an upsampling sub-unit 1256, and a feature splicing sub-unit 1257.

A second convolution subunit 1253, configured to input the K third pixel features into the first convolution layer, and perform convolution processing on each third pixel feature in the first convolution layer to obtain K initial convolution pixel features;

the second downsampling subunit 1254 is configured to obtain a first downsampling multiple corresponding to the third feature extraction unit, and perform downsampling on the K initial convolution pixel features according to the first downsampling multiple to obtain a first downsampling pixel feature;

the third downsampling subunit 1255 is configured to obtain a second downsampling multiple corresponding to the second feature extraction unit, and perform downsampling on the K second pixel features according to the second downsampling multiple to obtain second downsampled pixel features;

the upsampling subunit 1256 is configured to determine an upsampling multiple of the first image according to the first downsampling multiple and the second downsampling multiple, and perform upsampling on the first downsampled pixel feature according to the upsampling multiple of the first image to obtain an upsampled pixel feature;

the feature splicing subunit 1257 is configured to splice the second downsampled pixel features and the upsampled pixel features to obtain first splicing features if the K target pixel features are K second pixel features, input the first splicing features to the second convolution layer, and perform convolution processing on the first splicing features in the second convolution layer to obtain convolution pixel features.

For specific implementation manners of the second convolution sub-unit 1253, the second downsampling sub-unit 1254, the third downsampling sub-unit 1255, the upsampling sub-unit 1256, and the feature splicing sub-unit 1257, reference may be made to the description in step S101 in the embodiment corresponding to fig. 3, and details will not be repeated here.

Wherein the convolutional layer further comprises a third convolutional layer;

referring to fig. 7, the image recognition apparatus 1 may include: the image obtaining module 11, the feature extracting module 12, the feature input module 13, the first feature convolution module 14, the second feature convolution module 15, the category output module 16, the position output module 17, the scale obtaining module 18, the image scaling module 19, and the image dividing module 20 may further include: a first feature stitching module 21, a feature downsampling module 22, a feature upsampling module 23, and a second feature stitching module 24.

The first feature stitching module 21 is configured to, if the K target pixel features are K first pixel features, stitch the second downsampled pixel feature with the upsampled pixel feature to obtain a first stitch feature, input the first stitch feature to the second convolution layer, and perform convolution processing on the first stitch feature in the second convolution layer to obtain a convolution stitch feature;

the feature downsampling module 22 is configured to obtain a third downsampling multiple corresponding to the first feature extraction unit, and downsample the K first pixel features according to the third downsampling multiple to obtain a third downsampled pixel feature;

the feature upsampling module 23 is configured to determine a second image upsampling multiple according to the third downsampling multiple and the second downsampling multiple, and upsample the convolution splicing feature according to the second image upsampling multiple to obtain an upsampling splicing feature;

and the second feature splicing module 24 is configured to splice the third downsampling pixel feature and the upsampling splicing feature to obtain a second splicing feature, input the second splicing feature to a third convolution layer, and perform convolution processing on the second splicing feature in the third convolution layer to obtain a convolution pixel feature.

For a specific implementation manner of the first feature splicing module 21, the feature downsampling module 22, the feature upsampling module 23, and the second feature splicing module 24, reference may be made to the description in step S101 in the embodiment corresponding to fig. 3, and details will not be described here.

Referring to fig. 7, the category output module 16 may include: an existence probability acquisition unit 161, a prediction probability acquisition unit 162, and a category determination unit 163.

An existence probability obtaining unit 161, configured to obtain classification feature points in the classification convolution feature map, and obtain object existence probabilities corresponding to the classification feature points; the object existence probability refers to the probability that the target object exists in the prediction frame to which the classification characteristic point belongs; the prediction frame is used for predicting the position of the target object in the target image;

a prediction probability obtaining unit 162 configured to obtain a category prediction probability corresponding to the classification feature point;

a category determining unit 163, configured to, if the object existence probability is greater than the probability threshold, obtain the maximum category prediction probability from the category prediction probabilities, and determine a category corresponding to the maximum category prediction probability as an object category predicted by the classification feature point;

the class determining unit 163 is further configured to determine the object class predicted by the classification feature point as the object class of the prediction frame to which the classification feature point belongs.

For a specific implementation manner of the existence probability obtaining unit 161, the prediction probability obtaining unit 162, and the category determining unit 163, reference may be made to the description in step S105 in the embodiment corresponding to fig. 3, which will not be described herein again.

Referring to fig. 7, the position output module 17 may include: a position parameter acquisition unit 171, a pixel block acquisition unit 172, a center coordinate acquisition unit 173, and a position information determination unit 174.

A position parameter obtaining unit 171, configured to obtain position feature points in the position convolution feature map, and obtain predicted position parameters corresponding to the position feature points;

a pixel block obtaining unit 172, configured to obtain a grid pixel block corresponding to the position feature point in the K grid pixels; the position characteristic points are obtained by convolution of grid pixel blocks, and the size of each grid pixel block is determined by the downsampling multiple corresponding to the K target pixel characteristics;

a center coordinate acquiring unit 173 for acquiring a center position coordinate located at the grid pixel block;

and a position information determining unit 174 configured to determine position information of the target object based on the predicted position parameter and the center position coordinates.

For specific implementation of the position parameter obtaining unit 171, the pixel block obtaining unit 172, the center coordinate obtaining unit 173, and the position information determining unit 174, reference may be made to the description in step S105 in the embodiment corresponding to fig. 3, and details will not be repeated here.

referring to fig. 7, the location information determining unit 174 may include: a center position determining subunit 1741 and a position information determining subunit 1742.

A central position determining subunit 1741, configured to determine, according to the position offset and the central position coordinate, central position information corresponding to the prediction frame;

a position information determining subunit 1742, configured to determine, according to the center position information, the width of the prediction frame, and the height of the prediction frame, the position information of the prediction frame in the target image, and determine the position information of the prediction frame in the target image as the position information of the target object.

For specific implementation of the central position determining subunit 1741 and the position information determining subunit 1742, reference may be made to the description in step S105 in the embodiment corresponding to fig. 3, which will not be described herein again.

Referring to fig. 7, the image recognition apparatus 1 may include: the image obtaining module 11, the feature extracting module 12, the feature input module 13, the first feature convolution module 14, the second feature convolution module 15, the category output module 16, the position output module 17, the scale obtaining module 18, the image scaling module 19, the image dividing module 20, the first feature stitching module 21, the feature down-sampling module 22, the feature up-sampling module 23, and the second feature stitching module 24 may further include: a sample acquisition module 25, a sample input module 26, a classification convolution module 27, a position convolution module 28, a result determination module 29, a loss value determination module 30, a parameter adjustment module 31, and a model determination module 32.

The sample obtaining module 25 is configured to obtain an image sample, input the image sample into the sample image recognition model, and extract a sample feature of the image sample in the sample image recognition model;

a sample input module 26, configured to input sample features into a sample image classification component and a sample position prediction component of the sample image identification model, respectively; the sample image classification component comprises a first sample convolution parameter concerned with object classification, and the sample position prediction component comprises a second sample convolution parameter concerned with position information prediction;

the classification convolution module 27 is configured to perform convolution processing on the sample features through the first sample convolution parameter in the sample image classification component to obtain a sample classification convolution feature map;

the position convolution module 28 is configured to perform convolution processing on the sample features through the second sample convolution parameter in the sample position prediction component to obtain a sample position convolution feature map;

a result determining module 29, configured to determine a prediction object type of the sample object in the image sample according to the sample classification convolution feature map, and determine prediction position information of the sample object according to the sample position convolution feature map;

a loss value determining module 30, configured to obtain an object class label of the sample object, obtain position label information of the sample object, and determine a total loss value according to the predicted object class, the object class label, the predicted position information, and the position label information;

the parameter adjusting module 31 is configured to respectively adjust the first sample convolution parameter and the second sample convolution parameter according to the total loss value, so as to obtain a first convolution parameter corresponding to the first sample convolution parameter and a second convolution parameter corresponding to the second sample convolution parameter;

a model determining module 32, configured to determine a sample image classification component including the first convolution parameter as an image classification component, determine a sample position prediction component including the second convolution parameter as a position prediction component, and determine a sample image recognition model including the image classification component and the position prediction component as an image recognition model.

For a specific implementation manner of the sample obtaining module 25, the sample input module 26, the classification convolution module 27, the position convolution module 28, the result determining module 29, the loss value determining module 30, the parameter adjusting module 31, and the model determining module 32, reference may be made to the description of step S201 to step S208 in the embodiment corresponding to fig. 6, which will not be described again here.

Referring to fig. 7, the loss value determining module 30 may include: classification loss determining section 301, position loss determining section 302, and loss value generating section 303.

A classification loss determining unit 301, configured to determine a classification loss value according to the predicted object class and the object class label;

a location loss determination unit 302 for determining a location loss value based on the location tag information and the predicted location information;

a loss value generating unit 303, configured to generate a total loss value according to the classification loss value and the position loss value.

For specific implementation manners of the classification loss determining unit 301, the position loss determining unit 302, and the loss value generating unit 303, reference may be made to the description in step S206 in the embodiment corresponding to fig. 6, and details will not be described here.

Referring to fig. 7, the parameter adjusting module 31 may include: a first derivation unit 311 and a second derivation unit 312.

A first derivation unit 311, configured to determine a first partial derivative between the total loss value and the prediction object category, and adjust the first sample convolution parameter according to the first partial derivative to obtain a first convolution parameter;

the second derivation unit 312 is configured to determine a second partial derivative between the total loss value and the predicted position information, and adjust the second sample convolution parameter according to the second partial derivative to obtain a second convolution parameter.

For specific implementation of the first derivation unit 311 and the second derivation unit 312, reference may be made to the description in step S207 in the embodiment corresponding to fig. 6, and details are not repeated here.

Further, please refer to fig. 8, where fig. 8 is a schematic diagram of a computer device according to an embodiment of the present application. As shown in fig. 8, the computer apparatus 1000 may include: at least one processor 1001, such as a CPU, at least one network interface 1004, a user interface 1003, memory 1005, at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display (Display) and a Keyboard (Keyboard), and the network interface 1004 may optionally include a standard wired interface and a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1005 may optionally also be at least one storage device located remotely from the aforementioned processor 1001. As shown in fig. 8, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the computer device 1000 shown in fig. 8, the network interface 1004 is mainly used for network communication with the service server; the user interface 1003 is an interface for providing a user with input; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:

It should be understood that the computer device 1000 described in this embodiment of the present application may perform the description of the image recognition method in the embodiment corresponding to fig. 3 to fig. 6, and may also perform the description of the image recognition apparatus 1 in the embodiment corresponding to fig. 7, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Further, here, it is to be noted that: an embodiment of the present application further provides a computer-readable storage medium, where a computer program executed by the aforementioned image recognition computer device 1000 is stored in the computer-readable storage medium, and the computer program includes program instructions, and when the processor executes the program instructions, the descriptions of the image recognition method in the embodiments corresponding to fig. 3 to fig. 6 can be executed, so that the descriptions will not be repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application.

The terms "first," "second," and the like in the description and in the claims and drawings of the embodiments of the present application are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprises" and any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, apparatus, product, or apparatus that comprises a list of steps or elements is not limited to the listed steps or modules, but may alternatively include other steps or modules not listed or inherent to such process, method, apparatus, product, or apparatus.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The method and the related apparatus provided by the embodiments of the present application are described with reference to the flowchart and/or the structural diagram of the method provided by the embodiments of the present application, and each flow and/or block of the flowchart and/or the structural diagram of the method, and the combination of the flow and/or block in the flowchart and/or the block diagram can be specifically implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block or blocks.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. An image recognition method, comprising:

acquiring a target image, inputting the target image into an image recognition model, and extracting image features of the target image in the image recognition model;

inputting the image features into an image classification component and a position prediction component in the image recognition model respectively; the image classification component comprises a first convolution parameter focusing on object classification, and the position prediction component comprises a second convolution parameter focusing on position information prediction;

in the image classification component, performing convolution processing on the image features through the first convolution parameters to obtain a classification convolution feature map;

in the position prediction component, performing convolution processing on the image features through the second convolution parameters to obtain a position convolution feature map;

2. The method of claim 1, wherein the target image comprises K mesh pixels after image division, and the image recognition model comprises a residual network comprising a first feature extraction unit, a second feature extraction unit, a third feature extraction unit, and a convolutional layer; k is an integer greater than 1;

the extracting of the image features of the target image in the image recognition model comprises:

inputting the target image into the first feature extraction unit, and extracting the pixel features of each grid pixel in the target image in the first feature extraction unit to obtain K first pixel features;

inputting the K first pixel features into the second feature extraction unit, and performing convolution processing on each first pixel feature in the second feature extraction unit to obtain K second pixel features;

inputting the K second pixel features into the third feature extraction unit, and performing convolution processing on each second pixel feature in the third feature extraction unit to obtain K third pixel features;

selecting K target pixel features from the K first pixel features, the K second pixel features and the K third pixel features;

inputting the K target pixel features into the convolution layer, and performing convolution processing on the K target pixel features in the convolution layer to obtain convolution pixel features;

and generating the image characteristics of the target image according to the convolution pixel characteristics.

3. The method of claim 2, further comprising:

acquiring an original image and acquiring an image scaling;

according to the image scaling, carrying out image scaling on the original image to obtain a transition image;

and carrying out image division on the transition image to obtain a target image comprising K grid pixels.

4. The method of claim 2, wherein the K target pixel features are the K third pixel features; the convolutional layers comprise a first convolutional layer;

the inputting the K target pixel features into the convolutional layer, and performing convolution processing on the K target pixel features in the convolutional layer to obtain convolutional pixel features, includes:

inputting the K third pixel features into the first convolution layer, and performing convolution processing on each third pixel feature in the first convolution layer to obtain K initial convolution pixel features;

and acquiring a first downsampling multiple corresponding to the third feature extraction unit, and downsampling the K initial convolution pixel features according to the first downsampling multiple to obtain the convolution pixel features.

5. The method of claim 2, wherein the convolutional layers comprise a first convolutional layer and a second convolutional layer;

acquiring a first downsampling multiple corresponding to the third feature extraction unit, and downsampling the K initial convolution pixel features according to the first downsampling multiple to obtain first downsampling pixel features;

acquiring a second downsampling multiple corresponding to the second feature extraction unit, and downsampling the K second pixel features according to the second downsampling multiple to obtain second downsampling pixel features;

determining a first image up-sampling multiple according to the first down-sampling multiple and the second down-sampling multiple, and up-sampling the first down-sampling pixel characteristic according to the first image up-sampling multiple to obtain an up-sampling pixel characteristic;

and if the K target pixel features are the K second pixel features, splicing the second down-sampling pixel features and the up-sampling pixel features to obtain first splicing features, inputting the first splicing features to the second convolution layer, and performing convolution processing on the first splicing features in the second convolution layer to obtain convolution pixel features.

6. The method of claim 5, wherein the convolutional layer further comprises a third convolutional layer;

the method further comprises the following steps:

if the K target pixel features are the K first pixel features, splicing the second downsampling pixel feature and the upsampling pixel feature to obtain a first splicing feature, inputting the first splicing feature to the second convolution layer, and performing convolution processing on the first splicing feature in the second convolution layer to obtain a convolution splicing feature;

acquiring a third down-sampling multiple corresponding to the first feature extraction unit, and performing down-sampling on the K first pixel features according to the third down-sampling multiple to obtain a third down-sampling pixel feature;

determining a second image up-sampling multiple according to the third down-sampling multiple and the second down-sampling multiple, and up-sampling the convolution splicing feature according to the second image up-sampling multiple to obtain an up-sampling splicing feature;

and splicing the third down-sampling pixel characteristic and the up-sampling splicing characteristic to obtain a second splicing characteristic, inputting the second splicing characteristic into the third convolution layer, and performing convolution processing on the second splicing characteristic in the third convolution layer to obtain a convolution pixel characteristic.

7. The method of claim 2, wherein outputting the object class of the target object in the target image according to the classified convolution feature map comprises:

acquiring classification characteristic points in the classification convolution characteristic diagram, and acquiring object existence probability corresponding to the classification characteristic points; the object existence probability refers to the probability that the target object exists in the prediction frame to which the classification feature point belongs; the prediction frame is used for predicting the position of the target object in the target image;

acquiring category prediction probability corresponding to the classification feature points;

if the object existence probability is greater than a probability threshold, obtaining a maximum class prediction probability from the class prediction probabilities, and determining a class corresponding to the maximum class prediction probability as the class of the object predicted by the classification feature point;

and determining the object class predicted by the classification feature point as the object class of the prediction frame to which the classification feature point belongs.

8. The method of claim 2, wherein outputting the position information of the target object according to the position convolution feature map comprises:

acquiring position characteristic points in the position convolution characteristic diagram, and acquiring predicted position parameters corresponding to the position characteristic points;

acquiring a grid pixel block corresponding to the position characteristic point from the K grid pixels; the position feature points are obtained by convolution of the grid pixel blocks, and the size of each grid pixel block is determined by the downsampling multiple corresponding to the K target pixel features;

acquiring the coordinates of the center position of the grid pixel block;

and determining the position information of the target object according to the predicted position parameters and the central position coordinates.

9. The method of claim 8, wherein the predicted position parameters include a position offset, a width of a prediction box, and a height of the prediction box; the prediction frame is used for predicting the position of the target object in the target image;

the determining the position information of the target object according to the predicted position parameter and the central position coordinate includes:

determining central position information corresponding to the prediction frame according to the position offset and the central position coordinate;

and determining the position information of the prediction frame in the target image according to the center position information, the width of the prediction frame and the height of the prediction frame, and determining the position information of the prediction frame in the target image as the position information of the target object.

10. The method of claim 1, further comprising:

acquiring an image sample, inputting the image sample into a sample image recognition model, and extracting sample characteristics of the image sample in the sample image recognition model;

inputting the sample features into a sample image classification component and a sample position prediction component of the sample image recognition model respectively; the sample image classification component comprises a first sample convolution parameter focusing on object classification, and the sample position prediction component comprises a second sample convolution parameter focusing on position information prediction;

in the sample image classification component, carrying out convolution processing on the sample features through the first sample convolution parameters to obtain a sample classification convolution feature map;

in the sample position prediction component, performing convolution processing on the sample characteristics through the second sample convolution parameters to obtain a sample position convolution characteristic diagram;

determining the prediction object type of a sample object in the image sample according to the sample classification convolution feature map, and determining the prediction position information of the sample object according to the sample position convolution feature map;

obtaining an object class label of the sample object, obtaining position label information of the sample object, and determining a total loss value according to the predicted object class, the object class label, the predicted position information and the position label information;

respectively adjusting the first sample convolution parameter and the second sample convolution parameter according to the total loss value to obtain a first convolution parameter corresponding to the first sample convolution parameter and a second convolution parameter corresponding to the second sample convolution parameter;

determining a sample image classification component comprising the first convolution parameter as an image classification component, determining a sample position prediction component comprising the second convolution parameter as a position prediction component, and determining a sample image recognition model comprising the image classification component and the position prediction component as an image recognition model.

11. The method of claim 10, wherein determining a total loss value based on the predicted object class, the object class label, the predicted location information, and the location label information comprises:

determining a classification loss value according to the predicted object class and the object class label;

determining a position loss value according to the position label information and the predicted position information;

and generating the total loss value according to the classification loss value and the position loss value.

12. The method of claim 10, wherein the adjusting the first sample convolution parameter and the second sample convolution parameter according to the total loss value to obtain a first convolution parameter corresponding to the first sample convolution parameter and a second convolution parameter corresponding to the second sample convolution parameter comprises:

determining a first partial derivative between the total loss value and the prediction object category, and adjusting the first sample convolution parameter according to the first partial derivative to obtain a first convolution parameter;

and determining a second partial derivative between the total loss value and the predicted position information, and adjusting the second sample convolution parameter according to the second partial derivative to obtain a second convolution parameter.

13. An image recognition apparatus, comprising:

the characteristic input module is used for respectively inputting the image characteristics into an image classification component and a position prediction component in the image recognition model; the image classification component comprises a first convolution parameter focusing on object classification, and the position prediction component comprises a second convolution parameter focusing on position information prediction;

the second feature convolution module is used for performing convolution processing on the image features through the second convolution parameters in the position prediction component to obtain a position convolution feature map;

the class output module is used for outputting the object class of the target object in the target image according to the classified convolution feature map;

14. A computer device, comprising: a processor, a memory, and a network interface;

the processor is coupled to the memory and the network interface, wherein the network interface is configured to provide network communication functionality, the memory is configured to store program code, and the processor is configured to invoke the program code to perform the method of any of claims 1-12.

15. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, perform the method of any of claims 1-12.