CN116310592A

CN116310592A - Image recognition method, training device, electronic equipment and storage medium

Info

Publication number: CN116310592A
Application number: CN202310398208.2A
Authority: CN
Inventors: 汪瑜
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-04-14
Filing date: 2023-04-14
Publication date: 2023-06-23

Abstract

The disclosure provides an image recognition method, a training device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence, in particular to the technical fields of image processing, deep learning and computer vision. The specific implementation scheme of the image recognition method is as follows: extracting image characteristics of an image to be identified; obtaining text characteristics of a target category, wherein the target category is determined according to object intention, and the text characteristics are obtained by processing the target category; and processing the image features and the text features to obtain a recognition result of the image to be recognized, wherein the recognition result characterizes the matching probability of the target object of the image to be recognized and the target category and the position information of the target object.

Description

Image recognition method, training device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly to the fields of image processing, deep learning, and computer vision. In particular to an image recognition method, a training device, electronic equipment and a storage medium.

Background

In the field of computer vision technology, tasks for object detection may include category detection tasks and location detection tasks.

With the development of artificial intelligence technology, the target detection technology is widely applied in different fields, so that the number of categories required to be detected is also increasing.

Disclosure of Invention

The disclosure provides an image recognition method, a training method, an apparatus, an electronic device and a storage medium.

According to an aspect of the present disclosure, there is provided an image recognition method including: extracting image characteristics of an image to be identified; obtaining text characteristics of a target category, wherein the target category is determined according to object intention, and the text characteristics are obtained by processing the target category; and processing the image features and the text features to obtain a recognition result of the image to be recognized, wherein the recognition result characterizes the matching probability of the target object of the image to be recognized and the target category and the position information of the target object. .

According to another aspect of the present disclosure, there is provided a training method of a deep learning model, including: extracting sample image features of a sample image; detecting the characteristics of the sample image by using a pre-trained detection network to obtain information of a sample detection frame and information of a sample category; processing the information of the sample category to obtain text characteristics of the sample category; processing the information of the sample detection frame to obtain the position characteristics of the initial detection frame; obtaining a category identification result of the sample image and a position identification result of the sample image according to the sample image characteristics, the position characteristics of the initial detection frame and the text characteristics of the sample category; based on the target loss function, obtaining category identification loss and position identification loss according to the category identification result of the sample image, the position identification result of the sample image, sample category information and information of a sample detection frame; and fixing model parameters of the pre-trained detection network based on the category recognition loss and the position recognition loss, and adjusting model parameters of the initial model to obtain a trained deep learning model. .

According to another aspect of the present disclosure, there is provided an image recognition apparatus including: the device comprises a first extraction module, an acquisition module and a first identification module. And the first extraction module is used for extracting the image characteristics of the image to be identified. And the acquisition module is used for acquiring the text characteristics of the target category. The target category is determined according to the object intention, and the text characteristic is obtained by processing the target category. The first recognition module is used for processing the image features and the text features to obtain recognition results of the images to be recognized. The recognition result characterizes the matching probability of the target object of the image to be recognized and the target category and the position information of the target object.

According to another aspect of the present disclosure, there is provided a training apparatus of a deep learning model, including: the system comprises a second extraction module, a second processing module, a third processing module, a second identification module, a loss calculation module and an adjustment module. And the second extraction module is used for extracting sample image features of the sample image. The second processing module is used for detecting and processing the characteristics of the sample image by utilizing a pre-trained detection network to obtain information of a sample detection frame and information of a sample category; and processing the information of the sample category to obtain the text characteristics of the sample category. And the third processing module is used for processing the information of the sample detection frame to obtain the position characteristics of the initial detection frame. And the second recognition module is used for obtaining a category recognition result of the sample image and a position recognition result of the sample image according to the sample image characteristics, the position characteristics of the initial detection frame and the text characteristics of the sample category. The loss calculation module is used for obtaining the category identification loss and the position identification loss according to the category identification result of the sample image, the position identification result of the sample image, the sample category information and the information of the sample detection frame based on the target loss function. And the adjusting module is used for fixing model parameters of the pre-trained detection network based on the category recognition loss and the position recognition loss, and adjusting the model parameters of the initial model to obtain a trained deep learning model.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method as above.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as above.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 schematically illustrates an exemplary system architecture to which image recognition methods or deep learning model training methods and apparatus may be applied, according to embodiments of the present disclosure;

FIG. 2 schematically illustrates a flow chart of an image recognition method according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a schematic diagram of processing image features and text features to obtain a recognition result of an image to be recognized according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow chart of a training method of a deep learning model according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a training method schematic of a deep learning model according to an embodiment of the present disclosure;

fig. 6 schematically illustrates a block diagram of an image recognition apparatus according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a block diagram of a training apparatus of a deep learning model according to an embodiment of the present disclosure; and

fig. 8 schematically illustrates a block diagram of an electronic device adapted to implement an image recognition method or a training method of a deep learning model, according to an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the field of computer vision technology, image recognition models are trained using fixed target categories defined in a sample dataset. However, in the actual application scenario, when a new category other than the fixed target category needs to be identified, the image recognition model lacks features corresponding to the new category, so that the recognition accuracy of the image recognition model is reduced.

Fig. 1 schematically illustrates an exemplary system architecture to which image recognition methods or deep learning model training methods and apparatuses may be applied, according to embodiments of the present disclosure.

It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios. For example, in another embodiment, an exemplary system architecture to which the training method and apparatus of the image recognition or deep learning model may be applied may include a terminal device, but the terminal device may implement the training method and apparatus of the image recognition or deep learning model provided by the embodiments of the present disclosure without interacting with a server.

As shown in fig. 1, a system architecture 100 according to this embodiment may include a first terminal device 101, a second terminal device 102, a third terminal device 103, a network 104, and a server 105. The network 104 is a medium used to provide a communication link between the first terminal device 101, the second terminal device 102, the third terminal device 103, and the server 105. The network 104 may include various connection types, such as wired and/or wireless communication links, and the like.

The user may interact with the server 105 via the network 104 using the first terminal device 101, the second terminal device 102, the third terminal device 103, to receive or send messages or the like. Various communication client applications, such as a knowledge reading class application, a web browser application, a search class application, an instant messaging tool, a mailbox client and/or social platform software, etc. (by way of example only) may be installed on the first terminal device 101, the second terminal device 102, the third terminal device 103.

The first terminal device 101, the second terminal device 102, the third terminal device 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (merely an example) providing support for content browsed by the user with the first terminal apparatus 101, the second terminal apparatus 102, the third terminal apparatus 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that, the image recognition method or the deep learning model training method provided by the embodiments of the present disclosure may be generally performed by the first terminal device 101, the second terminal device 102, and the third terminal device 103. Accordingly, the image recognition apparatus or the deep learning model training apparatus provided in the embodiments of the present disclosure may also be provided in the first terminal device 101, the second terminal device 102, and the third terminal device 103.

Alternatively, the image recognition method or the deep learning model training method provided by the embodiments of the present disclosure may also be generally performed by the server 105. Accordingly, the image recognition device or the deep learning model training device provided by the embodiments of the present disclosure may be generally provided in the server 105. The image recognition method or the deep learning model training method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103, and/or the server 105. Accordingly, the image recognition apparatus or the deep learning model training apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103, and/or the server 105.

For example, the first terminal device 101, the second terminal device 102, and the third terminal device 103 may acquire text features of the image to be identified and the target class, and then the text features of the image to be identified and the target class are sent to the server 105, and the server 105 analyzes the image to be identified and the target class to determine whether there is a target object belonging to the target class and a position of the target object in the image to be identified. Or by a server or server cluster capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103 and/or the server 105, analyzing the text features of the image to be identified and the target class, and finally realizing the identification of the class and the position of the target object in the image to be identified.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing, applying and the like of the personal information of the user all conform to the regulations of related laws and regulations, necessary security measures are adopted, and the public order harmony is not violated.

In the technical scheme of the disclosure, the authorization or consent of the user is obtained before the personal information of the user is obtained or acquired.

Fig. 2 schematically illustrates a flowchart of an image recognition method according to an embodiment of the present disclosure.

As shown in fig. 2, the method 200 includes operations S210 to S230.

In operation S210, image features of an image to be recognized are extracted.

In operation S220, text features of the target category are acquired.

In operation S230, the image features and the text features are processed to obtain a recognition result of the image to be recognized.

According to the embodiment of the disclosure, the image to be recognized may include an image of a signboard to be recognized, an image of an obstacle to be recognized, an image of a building to be recognized, an image of a vehicle to be recognized, and the like, and may also include an image of a face to be recognized, an image of a human body to be recognized. It should be noted that, the recognition of the face image to be recognized or the human body image to be recognized is performed after the user authorization is obtained, the requirements of related laws and regulations are met, necessary security measures are taken, and the public order is not violated.

According to embodiments of the present disclosure, the image features may include foreground features of the image to be identified. Before extracting the foreground features of the image to be identified, the image to be identified may be subjected to a data enhancement operation, for example: and performing operations such as scaling, clipping, rotation angle, color enhancement and the like on the image to be identified. To reduce the overfitting between foreground features, affecting the accuracy of feature recognition.

According to an embodiment of the present disclosure, the target category may be a category determined according to the recognition intention of the object. For example: the image to be recognized may be an obstacle image to be recognized, the recognition intention of the object may be recognition of a green belt, and the target category may be a green belt.

According to an embodiment of the disclosure, the text feature of the target category may be obtained by processing the target category using a graph-text detection network. For example: the teletext detection network may be a CLIP (Contrastive Language-Image Pretraining) network model. The CLIP network model structure comprises an image recognition branch network and a text recognition branch network. The text features of the target category can be identified by utilizing the text identification branch network to obtain the text features of the target category.

According to embodiments of the present disclosure, the image features may include image content features and image location features. The image content features may include a plurality of foreground content features of the image to be identified. The image position feature may be a position feature of a plurality of detection frames corresponding to the plurality of foreground content features, respectively. The position features of the detection frame may include the center point coordinates of the detection frame and the width and height of the detection frame.

According to the embodiment of the disclosure, the image content characteristics and the text characteristics are processed, so that the matching probability of the target object of the image to be identified and the target category can be obtained, and whether the image to be identified comprises the target object matched with the target category or not can be determined according to the matching probability. And processing the image position characteristics to obtain the position information of the detection frame of the target object.

For example: the image to be identified may be an obstacle image to be identified, and the target class may be a green belt. The image features of the obstacle to be identified may include image features Pa, pb, pc. By processing the image features of the obstacle image to be identified and the text features of the green belt, the matching probability of the image features Pa and the text features of the green belt can be 0.2, the matching probability of the image features Pb and the text features of the green belt can be 0.9, and the matching probability of the image features Pc and the text features of the green belt can be 0.3. It is possible to determine that a green belt exists in the obstacle image to be identified, and the position information of the green belt is the position information of the detection frame corresponding to the image feature Pb.

According to the embodiment of the disclosure, the image characteristics and the text characteristics of the image to be identified are processed by acquiring the text characteristics of the target category, so that the matching probability of the target object of the image to be identified and the target category and the position information of the target object are obtained. The method can determine the target category to be identified according to the object intention, is independent of the fixed category predefined in the data set in the model training process, and improves the accuracy of image identification.

According to an embodiment of the present disclosure, the above operation S230 may include the following operations: processing the image content characteristics and the text characteristics to obtain a category identification result of the image to be identified; processing the image position characteristics to obtain a position recognition result of the image to be recognized; and obtaining a recognition result according to the category recognition result and the position recognition result.

According to embodiments of the present disclosure, in the field of computer vision technology, a target object in an image to be identified is generally present in the foreground of the image to be identified, and thus, the image content features may be foreground content features of the image to be identified. In the actual application scenario, when the background feature in the image to be identified needs to be identified, the image content feature may also be the background feature of the image to be identified.

According to the embodiment of the disclosure, the coding module of the DETR (Detection Transformer) network can be utilized to firstly code the image to be identified to obtain the coding characteristics, and then the decoding module of the DETR network is utilized to decode the coding characteristics based on the attention mechanism to obtain the image content characteristics and the image position characteristics. And then, processing the image content characteristics and the text characteristics to obtain a category recognition result of the image to be recognized.

According to an embodiment of the present disclosure, processing the image content feature and the text feature to obtain a category recognition result of the image to be recognized may include the following operations: obtaining a category distribution probability matrix according to the image content characteristics and the text characteristics, wherein the category distribution probability matrix comprises elements corresponding to the image content characteristics, and the element values of the elements represent the matching probability of the image content characteristics and the target categories; and obtaining a category identification result of the image to be identified according to the matching probability.

According to embodiments of the present disclosure, the text feature is a result of processing the target category using the CLIP network. The CLIP network adopts a double-tower structure consisting of an image recognition branch network and a text branch network. Thus, the CLIP network can align the image content features of the output target object with the category text features of the target object.

Therefore, the image content features and the text features can be subjected to matrix multiplication to obtain a category distribution probability matrix. The category distribution probability matrix includes elements corresponding to the image content features, the element values of the elements characterizing the probability of matching the image content features with the target categories. Thus, whether the image content characteristics of the target object matched with the target category are included in the image to be identified can be determined according to the matching probability.

For example: in the class distribution probability matrix, element E ₁ Representing image content characteristics Pe ₁ Element E ₁ The element value of (2) may be the image content feature Pe ₁ Probability of match Pr with target class ₁ . By setting a matching probability threshold Pr _t . In the case where the element value in the category distribution probability matrix is greater than the matching probability threshold, it may be determined that the image content feature corresponding to the element value is an image content feature of a target object that matches the target category. I.e. the image to be identified comprises a target object matching the target class.

According to the embodiment of the disclosure, the matching probability of the image content characteristics in the image to be identified and the target category can be obtained through the category distribution probability matrix. Compared with the identification method based on the fixed category defined in the model training process in the related art, the category to be identified can be determined according to the actual application requirement, the generalization of the image identification method is improved, and the accuracy of image identification is ensured.

In the field of computer recognition technology, in order to improve the accuracy of recognition, the number of regions of interest is generally much larger than the number of target detection frames. When the DETR network is applied to target detection, the image features are usually screened based on target features matched with predefined categories, and the screened image features are input into the full-connection layer to obtain the position information of the detection frame. However, for the new class detection process, the lack of target features in the network model corresponding to the new class. It is more difficult to screen the image location features.

Thus, according to an embodiment of the present disclosure, processing the image position feature to obtain a position recognition result of an image to be recognized may include the following operations: extracting target image position features from the image position features based on the category recognition result; and processing the position characteristics of the target image to obtain a position recognition result of the image to be recognized.

According to embodiments of the present disclosure, based on the category recognition results, image content features that match the target category may be determined. Since the image content feature corresponds to the image position feature, the image position feature matching the target category can be extracted from the image position features based on the category recognition result.

For example: in the category recognition result, the image content feature Pe ₁ Is the image content feature with the highest matching probability with the target category. With image content characteristics Pe ₁ The corresponding image position feature may be an image position feature Po ₁ . It can be determined that the target image position feature is an image position feature Po ₁ 。

According to the embodiment of the disclosure, the processing of the target image position feature may be that the target image position feature is input into a fully connected layer of the DETR network, and a position recognition result is output. The position recognition result may characterize position information of a target object matching the target category in the image to be recognized.

According to the embodiment of the disclosure, the target image position features are extracted from the image position features based on the category recognition result, and then the target image position features are recognized, so that the data processing amount in the position recognition process is reduced, and the recognition efficiency is improved.

Fig. 3 schematically illustrates a schematic diagram of processing image features and text features to obtain a recognition result of an image to be recognized according to an embodiment of the present disclosure.

As shown in fig. 3, in an embodiment 300, a target category 304 is processed to obtain text features 305 of the target category. The image 301 to be identified is processed resulting in image content features 302 and image location features 303. From the text features 305 and image content features 302 of the target category, a category distribution probability matrix 306 is obtained. The category recognition result 307 is obtained based on the category distribution probability matrix 306. Then, the method comprises the steps of. The target image location feature 308 is selected from the image location features 303 based on the category identification result. The target image location feature 308 is processed to obtain a location recognition result 309.

The CLIP network is a multi-modal recognition model that trains a dataset based on large-scale image text. In the embodiment of the disclosure, the recognition capability of the model in the open domain, namely the capability of detecting and recognizing new categories outside the fixed categories defined by the training data set, can be realized by utilizing the CLIP network.

According to an embodiment of the present disclosure, the above image recognition method further includes the following operations: and carrying out text detection on the target category to obtain the text characteristics of the target category.

According to the embodiment of the disclosure, text detection can be performed on the target category by using the text recognition branch of the CLIP network, so as to obtain the text characteristics of the target category.

For example: the object intent may be "need to detect traffic barrier". Inputting the statement 'traffic barrier to be detected' into a text encoder of the CLIP network to obtain text feature vectors of the target category, thereby obtaining the text features of the target category.

According to the embodiment of the disclosure, text characteristics of the target category are obtained by text detection of the target category. By utilizing the characteristics output by the multi-mode characteristics of the CLIP network, whether the image to be identified comprises the target object matched with the target category or not can be determined according to the text characteristics and the image characteristics of the image to be identified, so that the open domain detection capability of image identification is improved.

Fig. 4 schematically illustrates a flowchart of a training method of a deep learning model according to an embodiment of the present disclosure.

As shown in fig. 4, the training method 400 may include operations S410 to S460.

In operation S410, sample image features of a sample image are extracted.

In operation S420, detecting the sample image features by using a pre-trained detection network to obtain information of a sample detection frame and information of a sample class; and processing the information of the sample category to obtain the text characteristics of the sample category.

In operation S430, the information of the sample detection frame is processed to obtain the position feature of the initial detection frame.

In operation S440, a category recognition result of the sample image and a location recognition result of the sample image are obtained according to the sample image features, the location features of the initial detection frame, and the text features of the sample category.

In operation S450, based on the target loss function, a class recognition loss and a position recognition loss are obtained from the class recognition result of the sample image, the position recognition result of the sample image, the sample class information, and the information of the sample detection frame.

In operation S460, model parameters of the initial model are adjusted based on the category recognition loss and the location recognition loss by fixing model parameters of the pre-trained detection network, resulting in a trained deep learning model.

According to embodiments of the present disclosure, the target object is typically present in the foreground features of the image as it is during target detection. Thus, the sample image feature may be all foreground features of the sample image.

According to the embodiment of the disclosure, the pre-trained detection network can detect the sample image characteristics to obtain the category of the foreground characteristics and the text characteristics corresponding to the category of the foreground characteristics. In the process of training the deep learning model, the category of the foreground features can be used as a true value of the category.

According to the embodiment of the disclosure, the pre-trained detection network can be used for detecting the characteristics of the sample image, so that the information of the sample detection frame can be obtained.

According to the embodiment of the disclosure, the information of the sample detection frame may be scaled according to the size of the sample image, for example: the size of the sample image may be HxW, where H represents the height of the sample image and W represents the width of the sample image. The position characteristics of the initial detection frame can be obtained by changing the height of the sample detection frame to H/H and the width to W/W under the condition that the position of the geometric center point of the sample detection frame is unchanged. In embodiments of the present disclosure, h, w may represent randomly generated noise values. In the process of training the deep learning model, the information of the sample detection frame can be used as a true value of the position.

According to embodiments of the present disclosure, the location features of the initial detection frame may include features of 4 dimensions: the abscissa of the geometric center point of the detection frame, the ordinate of the geometric center point of the detection frame, the height of the detection frame and the width of the detection frame.

According to the embodiment of the disclosure, feature fusion can be performed on the foreground features and the position features of the initial detection frame to obtain a first feature matrix comprising image content features and image position features. A second feature matrix may be constructed from the text features of the sample class.

For example: the initial detection frames may be N, and the first feature matrix may be an Nxd-dimensional feature matrix. The sample class may be C and the second feature matrix may be a Cxd dimensional feature matrix. Wherein N, C, d are integers greater than 1.

According to the embodiment of the disclosure, the first feature matrix and the second feature matrix can be multiplied to obtain an NxC-dimensional feature matrix. In the feature matrix in the NxC dimension, each element corresponds to a foreground feature and a class, and each element value represents a matching probability of the foreground feature and the class. And obtaining a class identification result of the sample image, namely the matching probability of each foreground feature in the sample image and each sample class.

According to the embodiment of the disclosure, the position recognition result of the sample image may be position information of the initial detection frame obtained by recognizing the position of the initial detection frame.

According to an embodiment of the present disclosure, the objective loss function may include: cross entropy loss function, regression loss function. Based on the cross entropy loss function, the class identification loss can be obtained according to the sample class information and the identification result of the sample class. Based on the regression loss function, the position identification loss can be obtained according to the position identification result of the sample image and the information of the sample detection frame.

According to embodiments of the present disclosure, the model parameters of the pre-trained detection network are fixed while the model parameters of the initial model are adjusted, only the model parameters of the initial model are adjusted. The initial model may include a feature extraction module, an attention module, and an identification module. Model parameters of the feature extraction module, the attention module, and the recognition module may be adjusted based on the category recognition loss and the location recognition loss such that the category recognition loss and the location recognition loss reach a convergence condition.

According to an embodiment of the present disclosure, the convergence condition may be a preset convergence threshold, and the convergence condition is indicated to be reached when the sum of the category identification loss and the location identification loss is smaller than the convergence threshold. In calculating the sum of the category recognition loss and the position recognition loss, different weights may be set for the category recognition loss and the position recognition loss, respectively, for example: the ratio of the weights of the category identification loss and the location identification loss may be 1:5.

According to the embodiment of the disclosure, the initial model is trained by extracting non-category image features of a sample image and utilizing sample category information, text features of sample categories and sample detection frame information output by a pre-training detection network. Because the recognition result of the sample category is obtained according to the sample image characteristics and the text characteristics, the initial model can effectively recognize the new category outside the definition of the training sample, and the detection capability and generalization of the open domain of the model are improved.

According to an embodiment of the present disclosure, the above operation S420 may include the following operations: and carrying out category detection processing on the sample image characteristics by utilizing a pre-trained text detection network to obtain sample category information. And carrying out text recognition on the information of the sample category by utilizing the pre-trained text detection network to obtain the text characteristics of the sample category. And carrying out position detection processing on the sample image characteristics by utilizing a pre-trained position detection network to obtain information of a sample detection frame.

According to an embodiment of the present disclosure, the pre-trained teletext detection network may be the aforementioned CLIP network. And will not be described in detail herein.

In accordance with embodiments of the present disclosure, the pre-trained detection network may include a pre-trained teletext detection network (CLIP network) and a pre-trained location detection network. The pre-trained position detection network is used for detecting the position information of the foreground features to obtain the information of the sample detection frame, namely the information of the foreground candidate frame.

In accordance with an embodiment of the present disclosure, the pre-trained position detection network may be a RPN (Region Proposal Network) front Jing Houxuan block acquisition network. Networks constructed using other algorithms are also possible, for example: selective search algorithm. The foreground classification probability and the background classification probability can be obtained by constructing a full-connection layer of the two classifications, so that the information of the candidate frames of the foreground features can be determined. The method for acquiring the foreground candidate frame in the embodiment of the disclosure is not particularly limited.

According to the embodiment of the disclosure, the pre-trained image-text detection network may include an image recognition branch detection network and a text recognition branch detection network, and the pre-trained image-text detection network is utilized to detect the foreground features of the sample image without categories, so that category information of the foreground features and text features corresponding to the category information may be obtained. Because the pre-trained graphic detection network adopts a double-tower structure, text features corresponding to the categories and image features corresponding to the categories can be aligned and output.

According to the embodiment of the disclosure, the detection capability of the deep learning model on the open domain can be realized by utilizing the pre-trained detection network, so that the generalization of the deep learning model is improved.

According to an embodiment of the present disclosure, the above-described operation S440 may include the following operations: based on the attention strategy, obtaining sample fusion characteristics according to the sample image characteristics and the position characteristics of the initial detection frame. And processing the sample fusion characteristics and the text characteristics of the sample categories to obtain the category recognition results of the sample images. And processing the sample fusion characteristics to obtain a position identification result of the sample image.

According to embodiments of the present disclosure, an attention policy may be used to achieve focusing of important information with high weight, ignoring non-important information with low weight, and enabling information exchange with other information by sharing important information, thereby achieving transfer of important information. In embodiments of the present disclosure, the attention policy may better accomplish category identification and location identification of the sample image from information of the sample image features and the location features of the initial detection frame with respect to each other. The sample fusion features may include content features of the sample image and location features of the sample image.

Since the pre-trained detection network can align the output image features with text features corresponding to the categories of image features. Therefore, the sample category distribution probability can be obtained according to the content characteristics and the text characteristics of the sample image, so that the category recognition result of the sample image is determined. And training the model by using the class identification result of the sample image and the loss value of the sample class information, so that the initial model has the detection capability on the new class.

According to an embodiment of the present disclosure, processing the fusion feature and the text feature of the sample category to obtain a category recognition result of the sample image may include the following operations: obtaining a sample category distribution probability matrix according to the sample image content characteristics and the text characteristics, wherein the sample category distribution probability matrix comprises elements corresponding to the sample image content characteristics, and the element values of the elements represent the matching probability of the sample image characteristics and the sample categories; and obtaining a category identification result of the sample image according to the matching probability.

For example: the sample image content feature may be an Nxd dimensional content feature matrix and the text feature may be a Cxd class feature matrix. N may represent the number of sample detection boxes and C may represent the number of sample categories. The content feature matrix and the category feature matrix can be multiplied to obtain a sample category distribution probability matrix. The sample class distribution probability matrix may include NxC elements, each element corresponding to a sample image content feature and a sample class, the element value of each element representing a probability of matching the sample image content feature to the sample class.

According to the embodiment of the disclosure, the sample category corresponding to the element with the matching probability larger than the matching probability threshold in the sample category distribution probability matrix can be determined as the category recognition result of the sample image by setting the matching probability threshold.

According to the embodiment of the disclosure, the classification recognition result of the sample image is obtained according to the image content characteristics of the sample image and the text characteristics of the sample classification output by the pre-trained detection network, so that the recognition of the model to the target classification is not dependent on the pre-defined fixed classification in the training sample any more, and the classification detection capability of the model can be generalized to the new classification outside the training sample definition.

Because the training is performed by using the non-category foreground features of the sample image in the embodiment of the disclosure, all foreground candidate boxes can be used as true values of the position recognition training task.

According to an embodiment of the present disclosure, processing the fusion feature to obtain a position recognition result of the sample image may include the following operations: and processing the position features of the sample image to obtain a position recognition result of the sample image.

According to the embodiment of the disclosure, the position features of the sample image can be input into the full connection layer to obtain the position identification result of the sample image, namely, the position identification information of the sample detection frame. The location identification information of the sample detection frame may include information of 4 dimensions, for example: the method comprises the steps of identifying the abscissa of a geometric center point of a sample detection frame, identifying the ordinate of the geometric center point of the sample detection frame, identifying the width of the sample detection frame and identifying the height of the sample detection frame.

According to an embodiment of the present disclosure, obtaining a category recognition loss and a position recognition loss from a category recognition result of a sample image, a position recognition result of the sample image, information of a sample category, and information of a sample detection frame based on an objective loss function may include the following operations: based on the first loss function, obtaining category identification loss according to the category identification result and the information of the sample category; and obtaining the position identification loss based on the second loss function according to the position identification result and the information of the sample detection frame.

According to an embodiment of the present disclosure, the first loss function may be a cross entropy loss function. The second loss function may be a mean absolute value error loss function (L1 loss function), a target location loss function (GIOU loss function).

For example: based on the cross entropy loss function, the category identification loss can be obtained according to the category identification result of the sample image and the information of the sample category. The position recognition loss can be obtained based on the average absolute value error loss function and the target positioning loss function according to the position recognition result of the sample image and the information of the sample detection frame.

According to an embodiment of the present disclosure, the model loss of the initial model may be a sum of the first loss function and the second loss function. The second loss function may employ a combined function of the mean absolute error loss function and the target positioning loss function, and may configure different weights for the mean absolute error loss function and the target positioning loss function, for example: 1:3. It should be noted that, different weights may be configured for the first loss function and the second loss function, and the configuration weights of the loss functions in the embodiments of the present disclosure may be configured according to actual application requirements, which is not limited herein specifically.

The sample detection box may be a foreground candidate box of the sample image. In the training process, some noise information needs to be added to the foreground candidate frame so as to improve the position detection precision of the deep learning model.

According to an embodiment of the present disclosure, processing information of a sample detection frame to obtain a position feature of an initial detection frame may include the following operations: determining the scaling of the sample detection frame according to the size information of the sample image; and obtaining the position characteristics of the initial detection frame according to the scaling of the sample detection frame and the information of the sample detection frame.

For example: the size of the sample image may be HxW, where H represents the height of the sample image and W represents the width of the sample image. The scale of the sample detection box may be f. Under the condition that the position of the geometric center point of the sample detection frame is unchanged, changing the height of the sample detection frame to f/H and changing the width of the sample detection frame to f/W to obtain the position characteristics of the initial detection frame.

Fig. 5 schematically illustrates a training method schematic of a deep learning model according to an embodiment of the present disclosure.

As shown in fig. 5, in embodiment 500, initial model 501 may include a feature extraction module 5011, an attention module 5012, and an identification module 5013. The sample image input feature extraction module 5011 obtains the sample image features 511.

Sample image features 511 are input into the pre-trained teletext detection network 502, resulting in sample categories 512. The pre-trained teletext detection network 502 is entered using the sample class 512, resulting in text features 513 for the sample class. Sample image features 511 are input into the pre-trained position detection network 503 to yield a sample detection block 514. And by processing the sample detection block 514, the location features 515 of the initial detection block are obtained.

The location feature 515 of the initial detection box, the sample image feature 511, and the text feature 513 of the sample category are input to the attention module 5012 to obtain a fused feature. The fusion feature input recognition module 5013 obtains a category recognition result 516 of the sample image and a position recognition result 517 of the sample image.

A class identification penalty 518 is derived from the class identification result 516 of the sample image and the sample class 512. A position recognition loss 519 is obtained from the position recognition result 517 of the sample image and the sample detection block 514. And adjusts model parameters of the initial model 501 based on the category identification loss 518 and the location identification loss 519 to yield a trained deep learning model.

Fig. 6 schematically illustrates a block diagram of an image recognition apparatus according to an embodiment of the present disclosure.

As shown in fig. 6, the image recognition apparatus 600 may include a first extraction module 610, an acquisition module 620, and a first recognition module 630.

The first extraction module 610 is configured to extract image features of an image to be identified. In some embodiments, the first extraction module 610 may be configured to perform operation S210 described above.

The obtaining module 620 is configured to obtain a text feature of a target class, where the target class is determined according to an object intention, and the text feature is obtained by processing the target class. In some embodiments, the acquisition module 620 may be configured to perform operation S220 described above.

The first recognition module 630 is configured to process the image feature and the text feature to obtain a recognition result of the image to be recognized, where the recognition result characterizes a matching probability of the target object of the image to be recognized and the target category and location information of the target object. In some embodiments, the first identification module 630 may be used to perform operation S230 described above.

According to an embodiment of the present disclosure, the image features include image content features and image location features. The first identification module may include: the system comprises a first processing sub-module, a second processing sub-module and a first obtaining sub-module. And the first processing sub-module is used for processing the image content characteristics and the text characteristics to obtain a category identification result of the image to be identified. And the second processing sub-module is used for processing the image position characteristics to obtain a position recognition result of the image to be recognized. The first obtaining sub-module is used for obtaining the identification result according to the category identification result and the position identification result.

According to an embodiment of the present disclosure, the second processing sub-module may include: an extraction unit and a processing unit. And an extraction unit for extracting a target image position feature from the image position features based on the classification result. And the processing unit is used for processing the position characteristics of the target image to obtain a position recognition result of the image to be recognized.

According to an embodiment of the present disclosure, the first processing sub-module may include: a first obtaining unit and a second obtaining unit. The first obtaining unit is used for obtaining a category distribution probability matrix according to the image content characteristics and the text characteristics, wherein the category distribution probability matrix comprises elements corresponding to the image content characteristics, and element values of the elements represent the matching probability of the image content characteristics and the target categories. And the second obtaining unit is used for obtaining a category recognition result of the image to be recognized according to the matching probability.

According to an embodiment of the disclosure, the image recognition device may further include a text recognition module, configured to perform text recognition on the target category, and obtain a text feature of the target category.

Fig. 7 schematically illustrates a block diagram of a training apparatus of a deep learning model according to an embodiment of the present disclosure.

As shown in fig. 7, the training apparatus 700 of the deep learning model may include: a second extraction module 710, a second processing module 720, a third processing module 730, a second identification module 740, a loss calculation module 750, and an adjustment module 760.

The second extraction module 710 is configured to extract sample image features of the sample image. In some embodiments, the second extraction module 710 may be used to perform operation S410 described above.

The second processing module 720 is configured to perform detection processing on the sample image features by using a pre-trained detection network, so as to obtain information of a sample detection frame and information of a sample class; and processing the information of the sample category to obtain the text characteristics of the sample category. In some embodiments, the second processing module 720 may be configured to perform operation S420 described above.

And a third processing module 730, configured to process the information of the sample detection frame to obtain the position feature of the initial detection frame. In some embodiments, the third processing module 730 may be configured to perform operation S430 described above.

The second recognition module 740 is configured to obtain a category recognition result of the sample image and a location recognition result of the sample image according to the sample image feature, the location feature of the initial detection frame, and the text feature of the sample category. In some embodiments, the second identification module 740 may be used to perform operation S440 described above.

The loss calculation module 750 is configured to obtain a category identification loss and a location identification loss according to the category identification result of the sample image, the location identification result of the sample image, the sample category information, and the information of the sample detection frame based on the target loss function. In some embodiments, the penalty calculation module 750 may be used to perform operation S450 described above.

An adjustment module 760 for fixing model parameters of the pre-trained detection network based on the category recognition loss and the location recognition loss, and adjusting model parameters of the initial model to obtain a trained deep learning model. In some embodiments, the adjustment module 760 may be used to perform operation S460 described above.

According to an embodiment of the present disclosure, the second identification module may include: the device comprises a feature fusion sub-module, a third processing sub-module and a fourth processing sub-module. And the feature fusion sub-module is used for obtaining sample fusion features according to the sample image features and the position features of the initial detection frame based on the attention strategy. And the third processing sub-module is used for processing the sample fusion characteristics and the text characteristics of the sample categories to obtain the category recognition results of the sample images. And the fourth processing submodule is used for processing the sample fusion characteristics to obtain a position identification result of the sample image.

According to an embodiment of the present disclosure, the fusion feature comprises a sample image content feature, and the third processing sub-module may comprise: a third obtaining unit and a fourth obtaining unit. The third obtaining unit is used for obtaining a sample category distribution probability matrix according to the sample image content characteristics and the text characteristics, wherein the sample category distribution probability matrix comprises elements corresponding to the sample image content characteristics, and element values of the elements represent the matching probability of the sample image characteristics and the sample categories. And the fourth obtaining unit is used for obtaining a category identification result of the sample image according to the matching probability.

According to an embodiment of the present disclosure, the fusion feature comprises a sample image location feature, and the fourth processing sub-module comprises: and the processing unit is used for processing the position characteristics of the sample image to obtain a position identification result of the sample image.

According to an embodiment of the present disclosure, the loss calculation module may include: the first loss calculation sub-module and the second loss calculation sub-module. The first loss calculation sub-module is used for obtaining the category identification loss according to the category identification result and the information of the sample category based on the first loss function. And the second loss calculation sub-module is used for obtaining the position identification loss according to the position identification result and the information of the sample detection frame based on the second loss function.

According to an embodiment of the present disclosure, the second processing module may include: the device comprises a first detection sub-module, an identification sub-module and a second detection sub-module. And the first detection sub-module is used for carrying out category detection processing on the sample image characteristics by utilizing the pre-trained text detection network to obtain sample category information. And the recognition sub-module is used for carrying out text recognition on the information of the sample category by utilizing the pre-trained text detection network to obtain the text characteristics of the sample category. And the second detection sub-module is used for carrying out position detection processing on the sample image characteristics by utilizing the pre-trained position detection network to obtain information of a sample detection frame.

According to an embodiment of the present disclosure, the third processing module may include: a sub-module and a second acquisition sub-module are determined. And the determining submodule is used for determining the scaling of the sample detection frame according to the size information of the sample image. And the second obtaining submodule is used for obtaining the position characteristics of the initial detection frame according to the scaling of the sample detection frame and the information of the sample detection frame.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to an embodiment of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method as described above.

According to an embodiment of the present disclosure, a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 801 performs the respective methods and processes described above, for example, an image recognition method or a training method of a deep learning model. For example, in some embodiments, the image recognition method or the training method of the deep learning model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the image recognition method or the training method of the deep learning model described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the image recognition method or the training method of the deep learning model in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. An image recognition method, comprising:

extracting image characteristics of an image to be identified;

obtaining text characteristics of a target class, wherein the target class is determined according to object intention, and the text characteristics are obtained by processing the target class; and

and processing the image features and the text features to obtain a recognition result of the image to be recognized, wherein the recognition result represents the matching probability of the target object of the image to be recognized and the target category and the position information of the target object.

2. The method of claim 1, wherein the image features include image content features and image location features; the processing the image features and the text features to obtain the recognition result of the image to be recognized comprises the following steps:

processing the image content characteristics and the text characteristics to obtain a category identification result of the image to be identified;

processing the image position characteristics to obtain a position recognition result of the image to be recognized; and

and obtaining the identification result according to the category identification result and the position identification result.

3. The method according to claim 2, wherein the processing the image position feature to obtain the position recognition result of the image to be recognized includes:

extracting target image position features from the image position features based on the classification result; and

and processing the position characteristics of the target image to obtain a position recognition result of the image to be recognized.

4. The method according to claim 2, wherein the processing the image content feature and the text feature to obtain the category identification result of the image to be identified includes:

Obtaining a category distribution probability matrix according to the image content characteristics and the text characteristics, wherein the category distribution probability matrix comprises elements corresponding to the image content characteristics, and element values of the elements represent the matching probability of the image content characteristics and the target categories; and

and obtaining a category identification result of the image to be identified according to the matching probability.

5. The method of claim 1, further comprising:

and carrying out text detection on the target category to obtain the text characteristics of the target category.

6. A training method of a deep learning model, comprising:

extracting sample image features of a sample image;

detecting the characteristics of the sample image by using a pre-trained detection network to obtain information of a sample detection frame and information of a sample category; processing the information of the sample category to obtain text characteristics of the sample category;

processing the information of the sample detection frame to obtain the position characteristics of an initial detection frame;

obtaining a category identification result of the sample image and a position identification result of the sample image according to the sample image characteristics, the position characteristics of the initial detection frame and the text characteristics of the sample category;

Based on a target loss function, obtaining category identification loss and position identification loss according to the category identification result of the sample image, the position identification result of the sample image, the sample category information and the information of the sample detection frame; and

based on the category recognition loss and the position recognition loss, fixing model parameters of the pre-trained detection network, and adjusting model parameters of an initial model to obtain a trained deep learning model.

7. The method of claim 6, wherein the obtaining the category recognition result of the sample image and the location recognition result of the sample image according to the sample image feature, the location feature of the initial detection frame, and the text feature of the sample category comprises:

based on an attention strategy, obtaining a sample fusion characteristic according to the sample image characteristic and the position characteristic of the initial detection frame;

processing the sample fusion characteristics and the text characteristics of the sample categories to obtain category identification results of the sample images; and

and processing the sample fusion characteristics to obtain a position identification result of the sample image.

8. The method of claim 7, wherein the fusion feature comprises a sample image content feature, the processing the fusion feature and the text feature of the sample category to obtain a category recognition result of the sample image comprises:

obtaining a sample category distribution probability matrix according to the sample image content characteristics and the text characteristics, wherein the sample category distribution probability matrix comprises elements corresponding to the sample image content characteristics, and element values of the elements represent the matching probability of the sample image characteristics and the sample categories; and

and obtaining a category identification result of the sample image according to the matching probability.

9. The method of claim 7, wherein the fusion feature comprises a sample image location feature, and the processing the fusion feature to obtain a location identification result of the sample image comprises:

and processing the position features of the sample image to obtain a position recognition result of the sample image.

10. The method of claim 6, wherein the deriving the class identification loss and the position identification loss based on the target loss function based on the class identification result of the sample image, the position identification result of the sample image, the information of the sample class, and the information of the sample detection frame, comprises:

Based on a first loss function, obtaining the category identification loss according to the category identification result and the information of the sample category; and

and based on a second loss function, obtaining the position identification loss according to the position identification result and the information of the sample detection frame.

11. The method of claim 6, wherein the detecting the sample image feature using the pre-trained detection network to obtain information of a sample detection frame and information of a sample class, and processing the information of the sample class to obtain text feature of the sample class, comprises:

performing category detection processing on the sample image characteristics by using a pre-trained image-text detection network to obtain information of the sample categories;

carrying out text recognition on the information of the sample category by utilizing the pre-trained image-text detection network to obtain text characteristics of the sample category;

and performing position detection processing on the sample image features by using a pre-trained position detection network to obtain information of the sample detection frame.

12. The method of claim 6, wherein the processing the sample detection frame information to obtain the location feature of the initial detection frame comprises:

Determining the scaling of the sample detection frame according to the size information of the sample image;

and obtaining the position characteristics of the initial detection frame according to the scaling of the sample detection frame and the information of the sample detection frame.

13. An image recognition apparatus comprising:

the first extraction module is used for extracting image features of the image to be identified;

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring text characteristics of a target class, the target class is determined according to object intention, and the text characteristics are obtained by processing the target class; and

the first recognition module is used for processing the image features and the text features to obtain a recognition result of the image to be recognized, wherein the recognition result represents the matching probability of the target object of the image to be recognized and the target category and the position information of the target object.

14. The apparatus of claim 13, wherein the image features comprise image content features and image location features; the first identification module includes:

the first processing sub-module is used for processing the image content characteristics and the text characteristics to obtain a category identification result of the image to be identified;

The second processing sub-module is used for processing the image position characteristics to obtain a position recognition result of the image to be recognized; and

and the first obtaining submodule is used for obtaining the identification result according to the category identification result and the position identification result.

15. The apparatus of claim 14, wherein the second processing sub-module comprises:

an extracting unit configured to extract a target image position feature from the image position features based on the classification result; and

and the processing unit is used for processing the position characteristics of the target image to obtain a position recognition result of the image to be recognized.

16. The apparatus of claim 14, wherein the first processing submodule comprises:

the first obtaining unit is used for obtaining a category distribution probability matrix according to the image content characteristics and the text characteristics, wherein the category distribution probability matrix comprises elements corresponding to the image content characteristics, and element values of the elements represent the matching probability of the image content characteristics and the target categories; and

and the second obtaining unit is used for obtaining the category identification result of the image to be identified according to the matching probability.

17. The apparatus of claim 13, further comprising:

and the text recognition module is used for recognizing the text of the target category to obtain the text characteristics of the target category.

18. A training device for a deep learning model, comprising:

the second extraction module is used for extracting sample image features of the sample image;

the second processing module is used for detecting the characteristics of the sample image by utilizing a pre-trained detection network to obtain information of a sample detection frame and information of a sample category; processing the information of the sample category to obtain text characteristics of the sample category;

the third processing module is used for processing the information of the sample detection frame to obtain the position characteristics of the initial detection frame;

the second recognition module is used for obtaining a category recognition result of the sample image and a position recognition result of the sample image according to the sample image characteristics, the position characteristics of the initial detection frame and the text characteristics of the sample category;

the loss calculation module is used for obtaining category identification loss and position identification loss according to the category identification result of the sample image, the position identification result of the sample image, the sample category information and the information of the sample detection frame based on the target loss function; and

And the adjustment module is used for fixing the model parameters of the pre-trained detection network based on the category recognition loss and the position recognition loss, and adjusting the model parameters of the initial model to obtain a trained deep learning model.

19. The apparatus of claim 18, wherein the second identification module comprises:

the feature fusion sub-module is used for obtaining sample fusion features according to the sample image features and the position features of the initial detection frame based on an attention strategy;

the third processing sub-module is used for processing the sample fusion characteristics and the text characteristics of the sample categories to obtain category identification results of the sample images; and

and the fourth processing submodule is used for processing the sample fusion characteristics to obtain a position identification result of the sample image.

20. The apparatus of claim 19, wherein the fusion feature comprises a sample image content feature, the third processing sub-module comprising:

a third obtaining unit, configured to obtain a sample category distribution probability matrix according to the sample image content feature and the text feature, where the sample category distribution probability matrix includes an element corresponding to the sample image content feature, and an element value of the element characterizes a matching probability of the sample image feature and the sample category; and

And a fourth obtaining unit, configured to obtain a category identification result of the sample image according to the matching probability.

21. The apparatus of claim 19, wherein the fusion feature comprises a sample image location feature, the fourth processing sub-module comprising:

and the processing unit is used for processing the position characteristics of the sample image to obtain a position identification result of the sample image.

22. The apparatus of claim 18, wherein the loss calculation module comprises:

the first loss calculation sub-module is used for obtaining the category identification loss according to the category identification result and the information of the sample category based on a first loss function; and

and the second loss calculation sub-module is used for obtaining the position identification loss according to the position identification result and the information of the sample detection frame based on a second loss function.

23. The apparatus of claim 18, wherein the second processing module comprises:

the first detection sub-module is used for carrying out category detection processing on the sample image characteristics by utilizing a pre-trained image-text detection network to obtain information of the sample categories;

the recognition sub-module is used for carrying out text recognition on the information of the sample category by utilizing the pre-trained image-text detection network to obtain the text characteristics of the sample category; and

And the second detection sub-module is used for carrying out position detection processing on the sample image characteristics by utilizing a pre-trained position detection network to obtain information of the sample detection frame.

24. The apparatus of claim 18, wherein the third processing module comprises:

the determining submodule is used for determining the scaling of the sample detection frame according to the size information of the sample image; and

and the second obtaining submodule is used for obtaining the position characteristics of the initial detection frame according to the scaling of the sample detection frame and the information of the sample detection frame.

25. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-12.

26. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-12.

27. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-12.