CN116978060A

CN116978060A - Bird identification method and system

Info

Publication number: CN116978060A
Application number: CN202310885421.6A
Authority: CN
Inventors: 马国学; 宫泽军; 朱少林
Original assignee: Zhicheng Xinke Beijing Technology Co ltd
Current assignee: Zhicheng Xinke Beijing Technology Co ltd
Priority date: 2023-07-19
Filing date: 2023-07-19
Publication date: 2023-10-31

Abstract

The invention provides a bird recognition method and a bird recognition system, which are characterized in that through generating sample images and image description texts of images to be recognized, category description texts corresponding to various bird categories are also obtained, then based on the similarity between the image characteristics of the sample images and the image characteristics of the images to be recognized and the similarity between the text characteristics of the image description texts of the sample images and the text characteristics of the image description texts of the images to be recognized, a plurality of nearest neighbor sample images of the images to be recognized and the category description texts corresponding to the nearest neighbor sample images are determined, and based on the similarity between the text characteristics of the category description texts corresponding to various bird categories, the nearest neighbor category description texts of the category description texts corresponding to the nearest neighbor sample images are determined, so that the bird recognition precision including a zero-shot scene can be improved based on the relevance between the image description texts of the images to be recognized and the nearest neighbor category description texts.

Description

Bird identification method and system

Technical Field

The invention relates to the technical field of image recognition, in particular to a bird recognition method and system.

Background

With the rapid development of deep learning, the accuracy of image recognition on a specific data set by using a deep learning algorithm is continuously improved, and the image recognition has been applied to various actual scenes, for example, bird recognition is one of very important application scenes. The image recognition task is divided into coarse-grain recognition and fine-grain recognition, wherein when the image recognition task is applied to the field of species recognition, the coarse-grain recognition is used for recognizing crossing species, such as between cats and dogs, and the fine-grain recognition is used for recognizing the purpose of fine classification in the same species, such as the recognition of different bird species.

Currently, bird recognition tasks typically train image recognition models in a supervised manner such that the image recognition models learn the visual characteristics of a particular species of bird in a dataset to identify the particular species of bird image from the bird image. However, the sample images of bird images are difficult to collect, so that the number of training samples is small, the brought supervision information is correspondingly insufficient, and the recognition rate is poor under the condition of high similarity among subdivision types. In addition, the actual images of some species of birds (e.g., endangered birds) are difficult to collect, resulting in the difficulty of having a zero-shot in the bird identification task (i.e., having an image of the species of birds in the test set that does not appear in the training set).

Disclosure of Invention

The invention provides a bird identification method and a bird identification system, which are used for solving the defect of poor identification rate in the prior art.

The invention provides a bird identification method, which comprises the following steps:

extracting image features of each sample image in the training set and image features of the images to be identified;

generating an image description text of each sample image in the training set and an image description text of the image to be identified;

extracting text features of a variety description text corresponding to each bird variety, and text features of the sample image and an image description text of the image to be identified;

determining a plurality of nearest neighbor sample images of the image to be identified and category description texts corresponding to the nearest neighbor sample images based on similarity between image features of the sample images and image features of the image to be identified and similarity between text features of image description texts of the sample images and text features of image description texts of the image to be identified;

determining a plurality of nearest neighbor type description texts of the type description texts corresponding to the nearest neighbor sample images based on the similarity among text features of the type description texts corresponding to the bird types;

And determining the types of birds in the image to be identified based on the association degree between the image description text of the image to be identified and the nearest neighbor type description text.

According to the bird recognition method provided by the invention, the determining of the bird type in the image to be recognized based on the association degree between the image description text of the image to be recognized and the nearest neighbor type description text specifically comprises the following steps:

determining the matching degree of any nearest type descriptive text based on the association degree between the text characteristics of the image descriptive text of the image to be identified and the text characteristics of any nearest type descriptive text and the association degree between the image characteristics of the image to be identified and the text characteristics of any nearest type descriptive text;

and determining a category label corresponding to the nearest neighbor category description text with the highest matching degree as the category of birds in the image to be identified.

According to the bird recognition method provided by the invention, the association degree between the image characteristics of the image to be recognized and the text characteristics of any nearest type descriptive text is determined based on the product among the image characteristics of the image to be recognized, the association matrix and the text characteristics of any nearest type descriptive text;

The association matrix is learned based on the following steps:

determining the current association value of any sample image and any kind of descriptive text based on the product of the image characteristic of any sample image, the current value of the association matrix and the text characteristic of any kind of descriptive text and the actual corresponding relation between any sample image and any kind of descriptive text, and accumulating the current association value of any sample image and each kind of descriptive text to obtain the association loss of any sample image;

and adjusting the value of the correlation matrix to reduce the correlation loss of each sample image until the preset target condition is reached.

According to the bird recognition method provided by the invention, any text feature of the description text is extracted based on the following steps:

clustering the species description texts corresponding to the bird species to obtain a plurality of text clusters, and determining cluster vectors of the text clusters based on a word bag model;

extracting a sentence vector of each clause in any kind of description text, and determining the text characteristics of any kind of description text based on the sentence vector of each clause in any kind of description text and the cluster vector of the text class cluster to which any kind of description text belongs.

According to the bird recognition method provided by the invention, the text characteristics of any kind of descriptive text are determined based on the sentence vector of each clause in the any kind of descriptive text and the cluster vector of the text cluster to which the any kind of descriptive text belongs, and the method specifically comprises the following steps:

extracting sentence vectors of image title texts of a plurality of bird images;

determining an average value of similarity between sentence vectors of any clause in the description text of any kind and sentence vectors of image title texts of the bird images as a visual score of the any clause;

weighting to obtain visual vectors of any kind of descriptive text based on sentence vectors and visual scores of each clause in the any kind of descriptive text;

and fusing the visual vector of any kind of description text and the cluster vector of the text cluster to which the any kind of description text belongs to obtain the text characteristics of the any kind of description text.

According to the bird recognition method provided by the invention, the method for extracting the image characteristics of each sample image and the image characteristics of the image to be recognized in the training set specifically comprises the following steps:

carrying out continuous feature extraction on any image based on a plurality of feature extraction layers with different sizes to obtain feature graphs corresponding to the feature extraction layers with different sizes; the image to be identified is a sample image or an image to be identified;

Classifying each feature point in the feature map corresponding to the feature extraction layer with any size based on the classification layer to obtain the probability that each feature point in the feature map corresponding to the feature extraction layer with any size belongs to each bird species;

screening out distinguishing characteristic points in the characteristic images corresponding to the characteristic extraction layers with any size based on the probability that each characteristic point in the characteristic images corresponding to the characteristic extraction layers with any size belongs to various bird species;

and fusing the distinguishing feature points in the feature graphs corresponding to the feature extraction layers with different sizes to obtain the image features of any image.

According to the bird recognition method provided by the invention, the distinguishing feature points in the feature graphs corresponding to the feature extraction layers with different sizes are fused to obtain the image features of any image, and the method specifically comprises the following steps:

splicing the distinguishing feature points in the feature graphs corresponding to the feature extraction layers with different sizes to obtain spliced feature points;

and recombining the spliced characteristic points based on the full connection layer to obtain the image characteristics of any image.

According to the bird recognition method provided by the invention, the image description text of any image is generated based on the following steps;

Fusing the feature images corresponding to the feature extraction layers with different sizes to obtain a fused feature image of any image;

detecting the target positions of birds on the fusion feature map of any image based on a target detection model to obtain target detection frames of a plurality of target positions in any image;

reinforcing characteristic values of target detection frame areas corresponding to all target positions in the fusion characteristic image of any image based on target detection frames of a plurality of target positions in the any image to obtain area reinforcing characteristics of the any image;

and generating an image description text of any image by utilizing the region reinforcing features of any image based on the image description generation model.

According to the bird recognition method provided by the invention, the image description generation model is trained based on the following steps:

acquiring a fusion feature map describing a training image, and detecting the target positions of birds on the fusion feature map describing the training image based on a target detection model to obtain target detection frames of a plurality of target positions in the describing training image;

based on target detection frames of a plurality of target positions in the descriptive training image, enhancing characteristic values of target detection frame areas corresponding to the target positions in a fusion characteristic image of the descriptive training image to obtain area enhancement characteristics of the descriptive training image;

Generating an image description text describing the training image by utilizing the region reinforcing features of the training image based on the image description generation model;

calculating description matching loss based on the alignment degree of the image description text of the description training image and the target detection frames of a plurality of target parts in the description training image and the similarity between the image description text of the description training image and the sample description text of the description training image;

and adjusting parameters of the image description generation model based on the description matching loss.

The present invention also provides a bird recognition system comprising:

the image feature extraction unit is used for extracting the image features of each sample image in the training set and the image features of the images to be identified;

the descriptive text generation unit is used for generating image descriptive texts of all sample images in the training set and the image descriptive texts of the images to be identified;

a text feature extraction unit, configured to extract text features of category description texts corresponding to bird categories and text features of image description texts of the sample image and the image to be identified;

a first nearest neighbor determining unit, configured to determine a plurality of nearest neighbor sample images of the image to be identified and category description texts corresponding to the nearest neighbor sample images based on a similarity between image features of the sample images and image features of the image description texts of the image to be identified and a similarity between text features of the image description texts of the sample images and text features of the image description texts of the image to be identified;

A second nearest neighbor determining unit, configured to determine a plurality of nearest neighbor type description texts of the type description texts corresponding to the nearest neighbor sample images based on similarity between text features of the type description texts corresponding to the bird types;

the type determining unit is used for determining the type of birds in the image to be identified based on the association degree between the image description text of the image to be identified and the nearest type description text.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a bird identification method as described in any one of the above when executing the program.

The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which when executed by a processor implements a bird identification method as described in any of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements a method of bird identification as described in any of the above.

According to the bird recognition method and system, through generating the sample image and the image description text of the image to be recognized, more visual features can be provided for the image features, the defect that part of bottom visual features are lost during image feature extraction is overcome, and under the condition that appearance similarity among bird subdivision types is large, the follow-up bird recognition precision is improved; in addition, the variety description text corresponding to each bird variety is obtained, on one hand, more and richer bird variety information can be provided, and on the other hand, importantly, even if some bird images of specific varieties are not collected temporarily in a training set, the variety description text corresponding to the variety can be used for learning the characteristics of the variety, especially the visual characteristics of the variety, then the variety of the bird in the image to be identified is determined based on the similarity between the image characteristics of the sample image and the image characteristics of the image to be identified and the similarity between the text characteristics of the image description text of the sample image and the text characteristics of the image description text to be identified, a plurality of nearest neighbor sample images of the image to be identified and the variety description text corresponding to the nearest neighbor sample image are determined based on the similarity between the text characteristics of the variety description text corresponding to the nearest neighbor sample image, and therefore, bird identification accuracy including zero-shot scenes can be improved.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a bird identification method provided by the invention;

FIG. 2 is a schematic flow chart of the bird species determination method provided by the invention;

FIG. 3 is a schematic flow chart of a text feature extraction method provided by the invention;

FIG. 4 is a schematic flow chart of an image feature extraction method provided by the invention;

FIG. 5 is a schematic diagram of a bird identification system provided by the present invention;

fig. 6 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 is a schematic flow chart of a bird recognition method provided by the invention, as shown in fig. 1, the method comprises:

step 110, extracting image features of each sample image in the training set and image features of the image to be identified;

step 120, generating an image description text of each sample image in the training set and an image description text of the image to be identified;

step 130, extracting text features of a variety description text corresponding to each bird variety and text features of image description texts of the sample image and the image to be identified;

step 140, determining a plurality of nearest neighbor sample images of the image to be identified and category description texts corresponding to the nearest neighbor sample images based on similarity between image features of the sample images and image features of the image to be identified and similarity between text features of image description texts of the sample images and text features of image description texts of the image to be identified;

step 150, determining a plurality of nearest neighbor type description texts of the type description texts corresponding to the nearest neighbor sample images based on the similarity among text features of the type description texts corresponding to the bird types;

Step 160, determining the bird type in the image to be identified based on the association degree between the image description text of the image to be identified and the nearest neighbor type description text.

Specifically, a training set containing sample images of various bird species and images to be identified which need to be identified of the bird species are obtained, and image feature extraction is carried out on the images to obtain image features of each sample image and image features of the images to be identified. The extracted image features have the image semantic information of the corresponding images. Considering that more visual features are lost in the image features extracted by the multi-layer convolution layer, the differences of the visual features among birds of subdivision types are finer, and the accuracy of bird identification is reduced due to the lack of the visual features, so that the image description text of each sample image and the image description text of the image to be identified in the training set can be generated in order to make up for the lack of the visual features. The image description generation model can be utilized to carry out image understanding on the sample image and the image to be identified, and corresponding image description text is generated. Because the image description generation model carries out image understanding on the input image and visual information of the corresponding image is more intuitively expressed in the generated image description text, visual characteristics lost in image characteristic extraction can be complemented by utilizing an image description generation mode. In addition, the species descriptive text corresponding to each bird species can be obtained from a knowledge base (such as wikipedia, birds encyclopedia, etc.) to provide more and richer bird species information, and importantly, even if some specific species of bird images are not collected temporarily in the training set, the characteristics of the species, especially the visual characteristics thereof, can be learned by using the species descriptive text corresponding to the species. Thus, for the above-described category description text and image description text, text features thereof may be extracted in the same or similar manner as complements of image features.

And then, determining a plurality of nearest neighbor sample images of the image to be identified and category description texts corresponding to the nearest neighbor sample images based on the similarity between the image features of the sample images and the image features of the image to be identified and the similarity between the text features of the image description texts of the sample images and the text features of the image description texts of the image to be identified. By combining the similarity between the image features and the similarity between the text features of the image description text, a plurality of sample images which are visually similar to the image to be identified, particularly highly similar to the visual features of the target (namely birds), can be screened out from the training set to serve as nearest neighbor sample images of the image to be identified, and meanwhile, the corresponding category description text of the nearest neighbor sample images can be obtained according to category marks of the nearest neighbor sample images. Here, weights may be set for the image feature similarity and the text feature similarity of the image description text, and for each sample image, the similarity between the image feature of the sample image and the image feature of the image to be identified and the similarity between the text feature of the image description text of the sample image and the text feature of the image description text of the image to be identified are weighted, so as to obtain the overall similarity between the sample image and the image to be identified, so that the sample image with the overall similarity higher than the first similarity threshold value is screened out as the nearest neighbor sample image based on the overall similarity between each sample image and the image to be identified.

In order to reduce the error rate of image recognition as much as possible, considering that the shooting of partial bird images has various angle problems and interference caused by environmental problems, the description of the species description text for birds of corresponding species is more accurate, and in the case of zero-shot, birds in the images to be recognized are probably not appeared in the sample images, and the species description text just can make up for the defect, therefore, a plurality of nearest neighbor species description texts of the species description texts corresponding to the nearest neighbor sample images can be determined based on the similarity between text features of the species description texts corresponding to the bird species. The nearest neighbor type description text of the type description text corresponding to the nearest neighbor sample image, namely the type description text with the similarity between the text characteristics and the text characteristics of the type description text corresponding to the nearest neighbor sample image being higher than a second similarity threshold value. The nearest neighbor species description sample obtained contains species description samples of bird species that are visually similar to birds in the image to be identified. In view of the fact that birds in the image to be recognized may not have been present in the training set, in order to accurately recognize the types of birds in the image to be recognized, the types of birds in the image to be recognized may be determined based on the degree of association between the image description text of the image to be recognized and the nearest neighbor type description text. Here, the higher the degree of association between the image description text of the image to be recognized and any nearest neighbor type description text, the closer the visual features of birds in the image to be recognized are to the visual features of the bird type described by the nearest neighbor type description text. Therefore, the bird species corresponding to the nearest neighbor species description text with the highest association degree of the image description text of the image to be identified can be selected as the species of birds in the image to be identified.

Therefore, by the method provided by the embodiment of the invention, more visual features can be provided for the image features by generating the sample image and the image description text of the image to be identified, the defect that part of bottom visual features are lost during image feature extraction is overcome, and the method is beneficial to improving the accuracy of follow-up bird identification under the condition that the appearance similarity among bird subdivision types is large; in addition, the variety description text corresponding to each bird variety is obtained, on one hand, more and richer bird variety information can be provided, and on the other hand, importantly, even if some bird images of specific varieties are not collected temporarily in a training set, the variety description text corresponding to the variety can be used for learning the characteristics of the variety, especially the visual characteristics of the variety, then the variety of the bird in the image to be identified is determined based on the similarity between the image characteristics of the sample image and the image characteristics of the image to be identified and the similarity between the text characteristics of the image description text of the sample image and the text characteristics of the image description text to be identified, a plurality of nearest neighbor sample images of the image to be identified and the variety description text corresponding to the nearest neighbor sample image are determined based on the similarity between the text characteristics of the variety description text corresponding to the nearest neighbor sample image, and therefore, bird identification accuracy including zero-shot scenes can be improved.

Based on the above embodiment, as shown in fig. 2, the determining the kind of the bird in the image to be identified based on the association degree between the image description text of the image to be identified and the nearest neighbor kind description text specifically includes:

step 210, determining the matching degree of any nearest neighbor type description text based on the association degree between the text features of the image description text of the image to be identified and the text features of any nearest neighbor type description text and the association degree between the image features of the image to be identified and the text features of any nearest neighbor type description text;

and 220, determining a category label corresponding to the nearest neighbor category description text with the highest matching degree as the category of the birds in the image to be identified.

Specifically, since the image features contain abundant image semantic information, the image features are helpful for distinguishing similar bird species, and therefore, the matching degree of the nearest neighbor species description text can be determined by combining the association degree between the text features of the image description text of the image to be identified and the text features of any nearest neighbor species description text and the association degree between the image features of the image to be identified and the text features of the nearest neighbor species description text. Wherein, the matching degree of the nearest neighbor type description text characterizes the similarity of the bird type described by the nearest neighbor type description text and birds in the image to be identified visually. And then, determining a category label corresponding to the nearest neighbor category description text with the highest matching degree as the category of birds in the image to be identified.

Based on any of the above embodiments, the degree of association between the image features of the image to be identified and the text features of any of the nearest neighbor type descriptive texts is determined based on the product of the image features of the image to be identified, the association matrix and the text features of any of the nearest neighbor type descriptive texts;

the association matrix is learned based on the following steps:

Specifically, since the image feature of the image to be recognized is a space vector in the image mode, and the text feature of the nearest neighbor type descriptive text is a space vector in the text mode, there is a semantic gap between the two, and therefore it is difficult to directly calculate (e.g., by using a cosine similarity calculation method) the degree of association between the two. In this regard, in the embodiment of the present invention, the association degree between the image feature of the image to be identified and the text feature of any nearest-neighbor type description text is calculated by constructing the association matrix, so as to overcome the semantic gap between the two, and specifically, the association degree between the image feature of the image to be identified and the text feature of the nearest-neighbor type description text can be determined based on the product of the image feature of the image to be identified, the association matrix and the text feature of the nearest-neighbor type description text.

Here, the correlation loss of the sample image may be obtained by determining the current correlation value between any sample image and any type of descriptive text based on the product of the image feature of the sample image, the current value of the correlation matrix and the text feature of any type of descriptive text (assuming that I is the image feature of the sample image, W is the correlation matrix, and T is the text feature of the type descriptive text), and the actual correspondence between the sample image and the type descriptive text (i.e., whether the type of birds in the sample image is consistent with the type of birds described by the type descriptive text), and then accumulating the current correlation values of the sample image and the type descriptive text. If the actual correspondence between the sample image and any kind of descriptive text is inconsistent, the current association value between the sample image and the kind of descriptive text is 0, otherwise, the current association value between the sample image and the kind of descriptive text is i×w×t. And reducing the association loss of each sample image by adjusting the value of the association matrix until a preset target condition is reached, for example, the sum of the association losses of all sample images reaches a minimum or less than a preset loss value. The value of the correlation matrix can be adjusted by means of machine learning or dynamic programming, etc., until the preset target condition is reached.

Based on any of the above embodiments, as shown in fig. 3, the text features of any of the class description texts are extracted based on the following steps:

step 310, clustering the species description texts corresponding to the bird species to obtain a plurality of text clusters, and determining cluster vectors of the text clusters based on a word bag model;

step 320, extracting a sentence vector of each clause in the arbitrary type of descriptive text, and determining a text feature of the arbitrary type of descriptive text based on the sentence vector of each clause in the arbitrary type of descriptive text and a cluster vector of a text cluster to which the arbitrary type of descriptive text belongs.

Specifically, considering that the similarity between different types of similar appearance description texts mostly appears in the appearance description, the types of description texts corresponding to the bird types can be clustered to obtain a plurality of text clusters, and cluster vectors corresponding to the text clusters and containing the semantics of the common appearance description are extracted based on a word bag model so as to provide additional semantic information for the types of description texts. And then extracting the sentence vector of each clause in the category description text, and determining the text characteristics of the category description text based on the sentence vector of each clause in the category description text and the cluster vector of the text category cluster to which the category description text belongs. By integrating the sentence vector of each clause in any kind of description text and the cluster vector of the text class cluster to which the kind of description text belongs, more semantic information about visual features can be provided for the sentence vector of the kind of description text by utilizing the semantic of common appearance description contained in the cluster vector, so that the accuracy of the association degree between the kind of description text and the image description text or the image features of the image to be identified is improved.

Based on any one of the above embodiments, the determining the text feature of any one of the class description texts based on the sentence vector of each sentence in the any one of the class description texts and the cluster vector of the text class cluster to which the any one of the class description texts belongs specifically includes:

extracting sentence vectors of image title texts of a plurality of bird images;

Specifically, image caption text of a plurality of bird images can be obtained from a network, and sentence vectors of the image caption text of the bird images are extracted, wherein the image caption text contains appearance descriptions about birds in the corresponding bird images. And determining an average value of similarity between sentence vectors of any clause in any kind of descriptive text and sentence vectors of image title texts of the plurality of bird images as a visual score of the clause. Here, the higher the similarity between the sentence vector of any clause and the sentence vector of the image header text of the above-mentioned plurality of bird images, the more terms including the appearance description in the clause, and thus the higher the visual score thereof. And then, weighting to obtain a visual vector of the category description text based on the sentence vector and the visual score of each sentence in the category description text so as to strengthen the duty ratio of semantic information corresponding to the appearance description part in the category description text, and fusing the visual vector of the category description text and the cluster vector of the text class cluster to which the category description text belongs to obtain the text characteristics of the category description text.

Based on any of the above embodiments, as shown in fig. 4, the extracting the image features of each sample image in the training set and the image features of the image to be identified specifically includes:

step 410, performing continuous feature extraction on any image based on a plurality of feature extraction layers with different sizes, so as to obtain feature graphs corresponding to the feature extraction layers with different sizes; the image to be identified is a sample image or an image to be identified;

step 420, classifying each feature point in the feature map corresponding to the feature extraction layer with any size based on the classification layer to obtain the probability that each feature point in the feature map corresponding to the feature extraction layer with any size belongs to each bird species;

step 430, screening out distinguishing feature points in the feature map corresponding to the feature extraction layer with any size based on the probability that each feature point in the feature map corresponding to the feature extraction layer with any size belongs to each bird species;

step 440, fusing the distinguishing feature points in the feature graphs corresponding to the feature extraction layers with different sizes to obtain the image features of any image.

Specifically, since the appearance similarity between the individual subdivision categories of birds is large, a large number of similar image semantics are contained in the image features extracted from the image, making it difficult to distinguish the individual subdivision categories based on the image features. In this regard, on the one hand, the image description text and the category description text are introduced to provide more visual features as described in the above embodiments, and on the other hand, the image features can be processed to screen out feature points of the more distinguishable regions, so as to improve the accuracy of bird identification.

Specifically, continuous feature extraction can be performed on any image based on a plurality of feature extraction layers with different sizes, so as to obtain feature graphs corresponding to the feature extraction layers with different sizes. The image may be a sample image or an image to be identified. Classifying each feature point in the feature map corresponding to the feature extraction layer with each size based on a pre-trained classification layer (such as a full connection layer) to obtain the probability that each feature point in the feature map corresponding to the feature extraction layer with each size belongs to each bird species. The closer the probability that any feature point in the feature map corresponding to the feature extraction layer with any size belongs to each bird species is to the extremum (namely 0 or 1), the stronger the distinguishing property of the feature point in the feature map is. Therefore, based on the probability that each feature point in the feature map corresponding to the feature extraction layer with each size belongs to each bird species, the distinguishing feature points in the feature map corresponding to the feature extraction layer with each size, namely, the feature points with probability that the probability of belonging to each bird species is closer to an extreme value, can be screened out. And then fusing distinguishing feature points in the feature graphs corresponding to the feature extraction layers with different sizes to obtain the image features of the image.

Based on any one of the above embodiments, the fusing the distinguishing feature points in the feature graphs corresponding to the feature extraction layers with different sizes to obtain the image features of any one of the images specifically includes:

Specifically, when the distinguishing feature points in the feature graphs corresponding to the feature extraction layers with different sizes are fused, the distinguishing feature points in the feature graphs corresponding to the feature extraction layers with different sizes can be spliced first to obtain spliced feature points, and one or more full-connection layers are used for recombining the distinguishing feature points in the spliced feature points to obtain the image features of the image.

Based on any of the above embodiments, the image description text of any of the images is generated based on the steps of;

Specifically, in order to improve the perceptibility of the image description generation model to the image, the bird description capability is more accurate and richer, visual description can be focused on each part of birds in the image, and interference of other objects (such as a background) in the image is removed. Specifically, feature images corresponding to a plurality of feature extraction layers with different sizes can be fused, after the fused feature images of the image are obtained, target detection frames of a plurality of target positions (such as back, beak and the like) in the image are obtained by detecting target positions of birds on the basis of a target detection model. Based on the target detection frames of the plurality of target positions, the characteristic values of the characteristic points corresponding to the target detection frame regions in the fusion characteristic diagram of the image can be enhanced, and the region enhancement characteristic of the image can be obtained. For example, after the fused feature map is transformed into the size of the original image, a feature sub-map corresponding to the target detection frame region is cut out from the transformed fused feature map based on each target detection frame, and feature values of feature points in the feature sub-map are increased. Then, an image description text of the image is generated using the region-enhanced features of the image based on the image description generation model. Because the image semantics of the target part are highlighted in the region reinforcing features, the image description generation model can focus on the visual description of the target part when the image is perceived and the image description text is generated, so that the visual description capability of the image description text is improved.

Based on any of the above embodiments, the image description generation model is trained based on the following steps:

Specifically, a large number of descriptive training images are acquired as a training set of image description generation models. For any description training image, the fusion feature image of the description training image can be obtained in the same mode as that of the image to be identified or the sample image, the target detection model is based on the fusion feature image of the description training image to detect the target positions of birds, so as to obtain target detection frames of a plurality of target positions in the description training image, and then the feature values of the target detection frame regions corresponding to the target positions in the fusion feature image of the description training image are enhanced based on the target detection frames of the plurality of target positions in the description training image, so that the region enhancement feature of the description training image is obtained. Based on the image description generation model, the image description text of the description training image is generated by utilizing the region reinforcing features of the description training image. Then, based on the alignment degree of the image description text and the target detection boxes of the plurality of target parts in the description training image and the similarity between the image description text and the sample description text of the description training image, calculating description matching loss, and adjusting parameters of the image description generation model based on the description matching loss.

When calculating the alignment degree of the image description text describing the training image and the target detection boxes of the multiple target positions in the description training image, the image description text describing the training image can be divided into multiple text segments, text features of the text segments are extracted, the similarity between the text features of the text segments and the feature subgraphs corresponding to the target detection box areas is calculated, the maximum similarity between the text features of the text segments and the feature subgraphs corresponding to the target detection box areas can be selected as the matching score of the text segments for any text segment, and then the matching score of the text segments is accumulated to obtain the alignment degree of the image description text describing the training image and the target detection boxes of the multiple target positions in the description training image. Here, when calculating the similarity between the text feature of each text segment and the feature subgraph corresponding to each target detection frame region, a simpler and faster vector dot multiplication manner may be adopted. In calculating the similarity between the image description text describing the training image and the sample description text, keywords in the sample description text can be selected, and the similarity between the two can be determined based on the number of hit keywords in the image description text.

The bird recognition system provided by the present invention will be described below, and the bird recognition system described below and the bird recognition method described above may be referred to correspondingly to each other.

Based on any of the above embodiments, fig. 5 is a schematic structural diagram of a bird recognition system according to the present invention, as shown in fig. 5, the system includes: an image feature extraction unit 510, a descriptive text generation unit 520, a text feature extraction unit 530, a first nearest neighbor determination unit 540, a second nearest neighbor determination unit 550, and a category determination unit 560.

The image feature extraction unit 510 is configured to extract image features of each sample image in the training set and image features of an image to be identified;

the descriptive text generation unit 520 is configured to generate an image descriptive text of each sample image in the training set and an image descriptive text of the image to be identified;

the text feature extraction unit 530 is configured to extract text features of category description texts corresponding to respective bird categories, and text features of image description texts of the sample image and the image to be identified;

the first nearest neighbor determining unit 540 is configured to determine a plurality of nearest neighbor sample images of the image to be identified and category description texts corresponding to the nearest neighbor sample images based on a similarity between image features of the sample images and image features of the image description texts of the image to be identified and a similarity between text features of the image description texts of the sample images and text features of the image description texts of the image to be identified;

The second nearest neighbor determining unit 550 is configured to determine a plurality of nearest neighbor type description texts of the type description texts corresponding to the nearest neighbor sample images based on similarity between text features of the type description texts corresponding to the bird types;

the category determining unit 560 is configured to determine the category of birds in the image to be recognized based on a degree of association between the image description text of the image to be recognized and the nearest neighbor category description text.

According to the system provided by the invention, through generating the sample image and the image description text of the image to be identified, more visual features can be provided for the image features, the defect that part of bottom visual features are lost during image feature extraction is overcome, and under the condition that the appearance similarity among bird subdivision types is large, the follow-up bird identification precision is improved; in addition, the variety description text corresponding to each bird variety is obtained, on one hand, more and richer bird variety information can be provided, and on the other hand, importantly, even if some bird images of specific varieties are not collected temporarily in a training set, the variety description text corresponding to the variety can be used for learning the characteristics of the variety, especially the visual characteristics of the variety, then the variety of the bird in the image to be identified is determined based on the similarity between the image characteristics of the sample image and the image characteristics of the image to be identified and the similarity between the text characteristics of the image description text of the sample image and the text characteristics of the image description text to be identified, a plurality of nearest neighbor sample images of the image to be identified and the variety description text corresponding to the nearest neighbor sample image are determined based on the similarity between the text characteristics of the variety description text corresponding to the nearest neighbor sample image, and therefore, bird identification accuracy including zero-shot scenes can be improved.

Based on any one of the above embodiments, the determining the type of the bird in the image to be identified based on the association degree between the image description text of the image to be identified and the nearest neighbor type description text specifically includes:

the association matrix is learned based on the following steps:

Based on any of the above embodiments, the text features of any of the class descriptive texts are extracted based on the following steps:

extracting sentence vectors of image title texts of a plurality of bird images;

Based on any one of the foregoing embodiments, the extracting the image features of each sample image and the image features of the image to be identified in the training set specifically includes:

Fig. 6 is a schematic structural diagram of an electronic device according to the present invention, and as shown in fig. 6, the electronic device may include: processor 610, memory 620, communication interface (Communications Interface) 630, and communication bus 640, wherein processor 610, memory 620, and communication interface 630 communicate with each other via communication bus 640. Processor 610 may invoke logic instructions in memory 620 to perform a bird recognition method comprising: extracting image features of each sample image in the training set and image features of the images to be identified; generating an image description text of each sample image in the training set and an image description text of the image to be identified; extracting text features of a variety description text corresponding to each bird variety, and text features of the sample image and an image description text of the image to be identified; determining a plurality of nearest neighbor sample images of the image to be identified and category description texts corresponding to the nearest neighbor sample images based on similarity between image features of the sample images and image features of the image to be identified and similarity between text features of image description texts of the sample images and text features of image description texts of the image to be identified; determining a plurality of nearest neighbor type description texts of the type description texts corresponding to the nearest neighbor sample images based on the similarity among text features of the type description texts corresponding to the bird types; and determining the types of birds in the image to be identified based on the association degree between the image description text of the image to be identified and the nearest neighbor type description text.

Further, the logic instructions in the memory 620 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the bird identification method provided by the above methods, the method comprising: extracting image features of each sample image in the training set and image features of the images to be identified; generating an image description text of each sample image in the training set and an image description text of the image to be identified; extracting text features of a variety description text corresponding to each bird variety, and text features of the sample image and an image description text of the image to be identified; determining a plurality of nearest neighbor sample images of the image to be identified and category description texts corresponding to the nearest neighbor sample images based on similarity between image features of the sample images and image features of the image to be identified and similarity between text features of image description texts of the sample images and text features of image description texts of the image to be identified; determining a plurality of nearest neighbor type description texts of the type description texts corresponding to the nearest neighbor sample images based on the similarity among text features of the type description texts corresponding to the bird types; and determining the types of birds in the image to be identified based on the association degree between the image description text of the image to be identified and the nearest neighbor type description text.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which when executed by a processor is implemented to perform the above provided bird identification methods, the method comprising: extracting image features of each sample image in the training set and image features of the images to be identified; generating an image description text of each sample image in the training set and an image description text of the image to be identified; extracting text features of a variety description text corresponding to each bird variety, and text features of the sample image and an image description text of the image to be identified; determining a plurality of nearest neighbor sample images of the image to be identified and category description texts corresponding to the nearest neighbor sample images based on similarity between image features of the sample images and image features of the image to be identified and similarity between text features of image description texts of the sample images and text features of image description texts of the image to be identified; determining a plurality of nearest neighbor type description texts of the type description texts corresponding to the nearest neighbor sample images based on the similarity among text features of the type description texts corresponding to the bird types; and determining the types of birds in the image to be identified based on the association degree between the image description text of the image to be identified and the nearest neighbor type description text.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of bird identification comprising:

2. The bird recognition method according to claim 1, wherein the determining the type of the bird in the image to be recognized based on the degree of association between the image description text of the image to be recognized and the nearest neighbor type description text specifically includes:

3. The bird recognition method according to claim 2, wherein the degree of association between the image features of the image to be recognized and the text features of the any one of the nearest neighbor types of descriptive text is determined based on a product of the image features of the image to be recognized, an association matrix, and the text features of the any one of the nearest neighbor types of descriptive text;

the association matrix is learned based on the following steps:

4. The bird recognition method of claim 1, wherein the text features of any one of the class descriptive texts are extracted based on the steps of:

5. The bird recognition method according to claim 4, wherein the determining the text feature of the arbitrary type of descriptive text based on the sentence vector of each sentence in the arbitrary type of descriptive text and the cluster vector of the text cluster to which the arbitrary type of descriptive text belongs specifically comprises:

extracting sentence vectors of image title texts of a plurality of bird images;

6. The bird recognition method according to claim 1, wherein the extracting the image features of each sample image in the training set and the image features of the image to be recognized specifically comprises:

7. The bird recognition method according to claim 6, wherein the fusing the distinguishing feature points in the feature maps corresponding to the feature extraction layers with different sizes to obtain the image features of the arbitrary image specifically includes:

8. The bird recognition method of claim 6, wherein the image descriptive text of any one of the images is generated based on the steps of;

9. The bird recognition method of claim 8, wherein the image description generation model is trained based on the steps of:

10. A bird identification system, comprising: