WO2021169723A1 - 图像识别方法、装置、电子设备及存储介质 - Google Patents

图像识别方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2021169723A1
WO2021169723A1 PCT/CN2021/074191 CN2021074191W WO2021169723A1 WO 2021169723 A1 WO2021169723 A1 WO 2021169723A1 CN 2021074191 W CN2021074191 W CN 2021074191W WO 2021169723 A1 WO2021169723 A1 WO 2021169723A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
image
feature information
feature extraction
feature
Prior art date
Application number
PCT/CN2021/074191
Other languages
English (en)
French (fr)
Inventor
颜波
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Publication of WO2021169723A1 publication Critical patent/WO2021169723A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Definitions

  • This application relates to the field of image processing technology, and more specifically, to an image recognition method, device, electronic equipment, and storage medium.
  • the embodiments of the present application propose an image recognition method, device, electronic equipment, and storage medium.
  • an embodiment of the present application provides an image recognition method, which includes: obtaining an image to be recognized; obtaining first feature information and second feature information of the image to be recognized based on a trained feature extraction model, wherein, the first feature information is used to characterize the target subcategory of the image to be recognized, the second feature information is used to characterize the difference between the target subcategory and other subcategories, and the target subcategory and The other subcategories belong to the same main category; the first feature information and the second feature information are fused to obtain fused feature information; the recognition result of the image to be recognized is determined according to the fused feature information; According to the recognition result, a predetermined operation is performed.
  • an embodiment of the present application provides an image recognition device.
  • the device includes: an image acquisition module for acquiring an image to be recognized; a feature extraction module for obtaining the to be recognized based on a trained feature extraction model The first feature information and the second feature information of the image, where the first feature information is used to characterize the target subcategory of the image to be recognized, and the second feature information is used to characterize the target subcategory and other subcategories.
  • the image recognition module is used to determine the recognition result of the image to be recognized according to the fusion feature information; the operation execution module is used to perform a predetermined operation according to the recognition result.
  • an embodiment of the present application provides an electronic device, including: a memory; one or more processors coupled to the memory; one or more application programs, wherein one or more application programs are stored In the memory and configured to be executed by one or more processors, one or more application programs are configured to execute the image recognition method provided in the above-mentioned first aspect.
  • an embodiment of the present application provides a computer-readable storage medium, and the computer-readable storage medium stores program code, and the program code can be invoked by a processor to execute the image recognition method provided in the first aspect.
  • Fig. 1 shows a schematic diagram of an application scenario of an image recognition method provided by an embodiment of the present application.
  • Fig. 2 shows a schematic flowchart of an image recognition method provided by an embodiment of the present application.
  • FIG. 3 shows a schematic flowchart of an image recognition method provided by another embodiment of the present application.
  • Fig. 4 shows a schematic flowchart of step S240 in Fig. 3 in an exemplary embodiment of the present application.
  • FIG. 5 shows a schematic flowchart of an image recognition method provided by another embodiment of the present application.
  • Fig. 6 shows a schematic diagram of the bottleneck structure of MobileNetV2 in an exemplary embodiment of the present application.
  • Fig. 7 shows a schematic flowchart of step S330 in Fig. 5 in an exemplary embodiment of the present application.
  • Fig. 8 shows a schematic diagram of the training process of the first feature extraction network in an exemplary embodiment of the present application.
  • Fig. 9 shows a schematic diagram of an image recognition process based on a feature extraction model in an exemplary embodiment of the present application.
  • Fig. 10 shows a module block diagram of an image recognition device provided by an embodiment of the present application.
  • Fig. 11 shows a structural block diagram of an electronic device provided by an embodiment of the present application.
  • Fig. 12 shows a storage unit provided by an embodiment of the present application for storing or carrying program code for implementing the image recognition method according to the embodiment of the present application.
  • Standard Deviation Describes the average deviation between the value of a random variable and its arithmetic mean, expressed as ⁇ in Greek letters.
  • Adaptive Moment Estimation It is an optimization algorithm that can iteratively update the weights of the neural network based on training data, and design independent parameters for different parameters by calculating the first-order moment estimation and second-order moment estimation of the gradient Adaptive learning rate.
  • the current image recognition methods are mostly aimed at specific fields and are used in relatively complex systems.
  • image recognition on the mobile terminal has attracted more and more attention and is relevant.
  • Technology has also been developed accordingly. For example, users can identify unknown items or find similar items in real time through the terminal, which not only expands their knowledge and satisfies their curiosity, but also enhances the user's experience of using the terminal.
  • the current image recognition methods are difficult to meet the performance requirements of the mobile terminal to recognize general objects.
  • the embodiments of the present application provide an image recognition method, device, electronic equipment, and computer readable storage medium.
  • the features extracted based on the trained feature extraction model can be simultaneously Consider the differences between the features themselves and the features of the subcategories under the same main category, so that the fusion feature information obtained by the final fusion can not only reflect the differences between the features of different categories of objects, but also reflect the differences between the features of the same category of objects , which can significantly improve the accuracy of image recognition and has a wider range of applications.
  • FIG. 1 shows a schematic diagram of an application scenario of an image recognition method provided by an embodiment of the present application.
  • the application scenario includes an image recognition system 10 provided by an embodiment of the present application.
  • the communication system 10 includes: a terminal 100 and a server 200.
  • the terminal 100 may be, but is not limited to, a mobile phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, moving image compression standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, moving image compression standard audio layer 4) Players, personal computers or wearable electronic devices, etc.
  • MP3 player Moving Picture Experts Group Audio Layer III, moving image compression standard audio layer 3
  • MP4 Moving Picture Experts Group Audio Layer IV, moving image compression standard audio layer 4
  • the embodiment of the present application does not limit the device type of a specific terminal.
  • the server 200 may be a traditional server or a cloud server, it may be a server, or a server cluster composed of several servers, or a cloud computing service center.
  • the terminal 100 can acquire images, and the device for processing the images can be set in the server 200. After the terminal 100 acquires the images, it can transmit the images to the server 200, and the images can be processed by the server 200.
  • the processing result is returned to the terminal 100, so that the terminal can realize image recognition and the like according to the processing result.
  • the processing result may be the recognition result, or the intermediate result of the intermediate process before the recognition result, such as the extracted feature, the feature after the feature fusion, etc., which are not limited here.
  • the device for processing the image can also be set on the terminal 100, so that the terminal 100 does not need to rely on establishing communication with the server 200, and can also recognize the image to be recognized to obtain the recognition result, then the image recognition
  • the system 10 may only include the terminal 100.
  • FIG. 2 shows a schematic flowchart of an image recognition method provided by an embodiment of the present application, which can be applied to the above-mentioned terminal. The following will elaborate on the process shown in FIG. 2 in detail.
  • the image recognition method may include the following steps:
  • Step S110 Obtain an image to be recognized.
  • the image to be recognized may be an image after target detection, or an original image without target detection, which is not limited in the embodiment of the present application.
  • the target detection may be performed on the image to be recognized that contains the target object, and the target object is removed from the original image. It is detected and cropped to obtain the target image for subsequent feature extraction.
  • the image to be recognized may be input by the user based on the terminal, and in this case, the terminal may obtain the image input by the user as the image to be recognized. As another way, the terminal may also obtain the image to be recognized from other terminals or servers, which is not limited in this embodiment.
  • obtaining the image to be recognized may include: obtaining an original image containing the target object, performing target detection on the original image, and then cropping Get the image to be recognized.
  • a preprocessing operation may be performed on the image to be recognized, which may include: normalizing the value of the pixel in the image to be recognized, for example, by The value of the point is divided by 255 to normalize the value of each pixel to [0, 1].
  • the normalization process may also include scaling the cropped image to a specified size, where the size is width * height, and the specified size can be determined according to actual needs or can be preset by the program. It can also be customized by the user, which is not limited here.
  • the specified size can be 224*224, and the unit can be pixels.
  • Step S120 Based on the trained feature extraction model, first feature information and second feature information of the image to be recognized are obtained.
  • the first feature information is used to characterize the target subcategory of the image to be recognized
  • the second feature information is used to characterize the difference between the target subcategory and other subcategories, where the target subcategory and other subcategories belong to The same main category.
  • desks, desks, and dining tables are three different subcategories, and they all belong to the main category of desks. Even if they all belong to the table category, there may still be obvious differences among desks, desks, and dining tables.
  • the features extracted by current image recognition models are not enough to describe the differences within the category.
  • the main category is one level above the subcategory.
  • the main category and the sub-category are divided into different ways according to the different fine-grained classifications. For example, Siamese cats, Garfield cats, and blue cats are three different subcategories, and they all belong to the main category-cats. If the classification is more fine-grained, cats, dogs, and pigs are divided into three different sub-categories, and they all belong to the main category-animals.
  • the second feature information that can characterize the difference between the target subcategory and other subcategories is also extracted.
  • the trained feature extraction model can be stored locally in the terminal, and the terminal does not rely on the network environment and does not need to consider the consumption of communication time, and directly run the feature extraction model locally to obtain the first feature information and the second feature information. Conducive to improving the efficiency of image recognition.
  • the trained feature extraction model can be stored in the server.
  • the terminal can send the image to be recognized to the server, instructing the server to obtain the first feature information and the second feature information based on the trained feature extraction model. , And return the result to the terminal.
  • the result can be the first and second feature information, or the fusion feature information obtained by the server continuing to execute step 130, or the recognition result obtained by the server continuing to execute step S140, which is not limited in this embodiment.
  • the terminal can still rely on a higher network speed to realize real-time image recognition on the mobile terminal and meet the needs of users for image recognition using the mobile terminal. .
  • Step S130 The first feature information and the second feature information are fused to obtain fused feature information.
  • the fusion feature information can contain two feature information at the same time, which not only reflects the difference in features between objects in different main categories, but also reflects the difference in features between objects in the same main category, so that subsequent classification based on the fusion feature information The accuracy is higher and the accuracy rate is higher.
  • the first feature information and the second feature information can be fused according to weights to obtain the fused feature information.
  • the first feature information can correspond to the first weight
  • the second feature information can correspond to the second weight.
  • the fusion feature information is obtained by weighted average.
  • the fusion feature information A can be obtained by weighted average based on a predetermined formula , The predetermined formula can be
  • the specific values of the first weight and the second weight can be determined according to actual needs.
  • the network used to extract the first feature information can be recorded as the first feature extraction network
  • the network used to extract the second feature information can be recorded as the second feature extraction network.
  • the evaluation parameters of the two feature extraction network are used to determine the first weight and the second weight.
  • the evaluation parameter includes at least one of accuracy rate and recall rate.
  • the ratio of the accuracy of the first and second feature extraction networks can be used to determine the ratio of the first weight to the second weight, and the first weight can be calculated based on the predetermined value.
  • the second weight For example, the predetermined value may be 1, the second weight may be 1, and the first weight may be the product of the second weight and the ratio of the first weight to the second weight.
  • the first weight and the second weight may also be preset by the program or customized by the user, which is not limited in this embodiment.
  • Step 140 Determine the recognition result of the image to be recognized according to the fusion feature information.
  • the image to be recognized is classified according to the fusion feature information to determine the classification result of the image to be recognized, that is, the recognition result.
  • a classifier can be connected, and the classifier can be used to classify according to the input fusion feature information.
  • the classifier can use logistic regression, Softmax regression, or support vector machine (SVM), etc., This embodiment does not limit this.
  • the classification result is output, and if it is less than or equal to the given threshold, it can be determined that the image is not in the given Category.
  • the given category is the classification category of the pre-divided image, which can be determined by the sample label of the sample used in the training process of the feature extraction model, that is, the feature extraction model can be obtained by training the sample labelled with the sample label of the given category.
  • the given threshold can be determined according to actual needs, or can be customized by the user, and is not limited here.
  • Step 150 Perform a predetermined operation according to the recognition result.
  • the predetermined operation may be the terminal outputting the recognition result, for example, the recognition result may be output through various methods such as voice, text, etc., so that the user can obtain the recognition result of the image to be recognized, and even in some embodiments, the corresponding recognition result can also be obtained.
  • the information is output by the terminal together with the recognition result, so that the user can not only know the recognition result of the image to be recognized, but also the relevant information to further expand the knowledge.
  • the predetermined operation can also be to send the recognition result to other terminals or servers to synchronize the recognition results to other terminals or servers. In some embodiments, the terminal sends the recognition results to other terminals or servers, and can also instruct other terminals or servers to execute the recognition results.
  • the corresponding operation may be the terminal outputting the recognition result, for example, the recognition result may be output through various methods such as voice, text, etc., so that the user can obtain the recognition result of the image to be recognized, and even in some embodiments, the corresponding recognition result can also be obtained.
  • the information
  • the terminal can also determine the control command corresponding to the recognition result according to the recognition result, and send the control command to other terminals or servers (for convenience of description, it can be recorded as the opposite terminal) to instruct other terminals or The server executes the control operation corresponding to the instruction.
  • the terminal can store at least the mapping relationship between the recognition result and the control command locally to determine the corresponding control command according to the recognition result.
  • the opposite terminal can store at least the control command.
  • the mapping relationship between the instruction and the control operation is used to determine and execute the corresponding control operation according to the received control instruction.
  • the local terminal and the opposite terminal may also store the recognition result, the mapping relationship between the control instruction and the control operation, which is not limited here.
  • user A’s terminal obtains that the recognition result of the image to be recognized is a honey badger, it can generate and play the voice "The current animal is a honey badger”; it can also send the recognition result containing "honey badger" to user B's Terminal, so that user B can also learn information related to the "honey badger”; it can also obtain images or videos related to the honey badger, and send them to other terminals or play them locally.
  • other predetermined operations can also be performed, which are not limited here.
  • the predetermined operation may be determined by the function of an application (APP) used by the current terminal to obtain the image to be recognized.
  • APP application
  • the application program can be the application program that comes with the terminal system, such as camera, photo album, calendar, etc.; the application program can also be the application program downloaded and installed by the user from the application market, application store or other third-party platforms, such as Youku, Taobao, etc. . This embodiment does not limit this.
  • the application is a camera
  • the application has image recognition capabilities
  • the user when the user encounters an unknown object, the user can open the camera application and take an image of the unknown object, which can perform real-time monitoring of the unknown object.
  • the recognition result is obtained by recognition, and the recognition result is output through various methods such as voice or text.
  • voice or text For example, the recognition result of the unknown thing can be played through voice, and the text information of the recognition result can be displayed, so that the user can know the unknown in real time.
  • the related information of things is convenient to expand the user's knowledge and satisfy the user's curiosity.
  • the image processing strategy that matches the recognition result can also be determined based on the recognition result obtained by recognizing the captured object.
  • the image processing strategy includes filters and image processing algorithms.
  • the image processing algorithm may be an algorithm that modifies the image display effect by optimizing image parameters, where the image parameters may include but are not limited to one or more of contrast, brightness, saturation, etc., thereby achieving, for example, increase/decrease Image processing of one or more combinations of contrast, increase/decrease brightness, increase/decrease saturation, etc.
  • each recognition result can correspond to an image processing strategy, and different recognition results can correspond to the same or different image processing strategies, for example, recognition result A and recognition result B can belong to the same category
  • the terminal can first determine the category of the recognition result, and then determine the matching image processing strategy according to the category, so as to automatically perform image processing on the recognized image to improve the image display effect, help users exclude more satisfactory photos, and improve user experience.
  • the terminal may first capture an image containing the object to be recognized, and then recognize the image through the above method to obtain the recognition result.
  • the terminal does not need to obtain images, and the object to be recognized is within the camera's field of view.
  • the terminal can obtain the images in the field of view for recognition to obtain the recognition result, which can further improve the real-time performance of image recognition and satisfy users. Needs identified in real time. This embodiment does not limit this.
  • the terminal can use this method to obtain the recognition result of each photo on the photo in the album, so as to perform the photo identification according to the recognition result.
  • various types of corresponding photo albums or photo collections can be photo albums or photo collections corresponding to various subcategories, or photo albums or photo collections corresponding to various main categories, which are not limited here.
  • the image can be stored in the photo album or atlas corresponding to the recognition result according to the recognition result of the image; if there is no album or atlas corresponding to the recognition result, Atlas, the terminal can create a photo album or atlas corresponding to the recognition result according to the recognition result of the image, and then store the image in the photo album or atlas corresponding to the recognition result.
  • the image recognition method provided by the embodiment of the application obtains the first feature information that can characterize the target subcategory of the image to be recognized by acquiring the image to be recognized, and then based on the trained feature extraction model, as well as the target subcategory and other subcategories.
  • the embodiments of the present application can consider the differences between the features themselves and the features of the subcategories under the same main category based on the features extracted by the trained feature extraction model at the same time, so that the final fusion can be obtained.
  • Fusion feature information can not only reflect the difference in features between objects of different categories, but also reflect the difference in features between objects of the same category, so it can significantly improve the accuracy of image recognition and has a wider range of applications.
  • the trained feature extraction model can be obtained by training in the following method.
  • FIG. 3 shows another embodiment of the present application.
  • a schematic flow chart of the image recognition method, which can be applied to the above-mentioned terminal, and the image recognition method may include:
  • Step S210 Obtain multiple sample sets.
  • the sample set includes a plurality of sample images and sample labels corresponding to the sample images, wherein the sample labels corresponding to the sample images in the same sample set belong to the same main category.
  • the sample label is the label of the subcategory to which the sample image belongs, that is, the subcategory label.
  • a sample set corresponds to a main category, that is, the sample images in a sample set belong to the same main category, that is, the sample labels of the sample images have the same main category label .
  • the sample set S includes sample image A, sample image B, and sample image C, where sample image A corresponds to a sample label as a desk, sample image B corresponds to a sample label as desk, and sample image C corresponds to a sample label as dining table.
  • A, B, and C all belong to the same main category, the table category.
  • sample labels of sample images in the same sample set may be the same or different, which is not limited in this embodiment.
  • image data and category labels of different objects in different scenes can be obtained, and sample images and corresponding sample labels can be obtained from this.
  • the target object area containing the target object can be detected and cropped from the original image, and the target object area can be scaled to a specified size, and then the target object area
  • the normalization process is performed to obtain the sample image. For example, the value of all pixels in the target object area can be divided by 255 to obtain the normalization of the pixel value to [0, 1].
  • the category label corresponding to the original image is recorded as the sample label corresponding to the sample image. In this way, multiple sample images and sample labels corresponding to the sample images can be obtained.
  • the object detection model can be composed of the following networks, for example, it can be a regional convolutional neural network (Regions with CNN, RCNN) (including RCNN, Fast RCNN, and Faster RCNN), YOLO (You Only Look Once) network, and single mirror For a multi-core detection (Single Shot multiBox Detector, SSD) network, this embodiment does not limit the specific type of the target detection network.
  • RCNN regional convolutional neural network
  • YOLO You Only Look Once
  • SSD single mirror For a multi-core detection
  • the object detection model may use MobileNet-SSD or MobileNet-SSDLite, specifically, it may include but not limited to MobileNetV1+SSD, MobileNetV2+SSD, MobileNetV1+SSDLite, MobileNetV2+SSDLite, and so on.
  • MobileNet is an efficient model for mobile terminal visual recognition, based on the aforementioned object detection model, real-time lightweight target detection can be realized, and the efficiency of target detection efficiency can be improved.
  • SSDLite modifies the SSD structure, replacing all standard convolutions in the SSD prediction layer with deep separable convolutions, which can greatly reduce the amount of parameters and calculation costs, and the calculations are more efficient.
  • the further description of MobileNet can be found in the following steps.
  • the sample image and its sample label can be stored in the sample set corresponding to the main category. Multiple sample sets are available.
  • Step S220 Based on the initial feature extraction model and the sample image, first sample feature information and second sample feature information are obtained.
  • the initial feature extraction model includes a first feature extraction network and a second feature extraction network.
  • the first feature extraction network is used to extract first sample feature information
  • the second feature extraction network is used to extract second feature information.
  • the first sample feature information is a feature vector used to characterize the target subcategory of the image
  • the second sample feature information is used to characterize feature vectors that characterize the difference between the target subcategory and other subcategories.
  • the target subcategory and The other subcategories belong to the same main category.
  • the first feature extraction network may be MobileNetV1 or MobileNetV2.
  • MobileNetV1 is a general-purpose computer vision neural network designed for mobile devices, which can support tasks such as image classification and detection.
  • MobileNetV2 is an upgraded version based on MobileNetV1, which can be used for image classification, target detection and semantic segmentation, and MobileNetV2 achieves faster feature extraction and higher accuracy.
  • the terminal can use the MobileNetV2 network as the backbone network of the initial feature extraction model, which can greatly reduce the size of the model and make the model more lightweight, suitable for deployment on the mobile terminal, and meet the requirements of the terminal, especially the real-time performance of the mobile terminal. , Lightweight and high performance requirements.
  • the first feature extraction network can also be other networks, such as a convolutional neural network with the classification module removed.
  • the first feature extraction network can be a volume that is retained to the last convolution layer (convolution layer).
  • Product neural network may use a deep convolutional neural network such as ResNet101.
  • the first feature extraction network may also use other convolutional neural networks, such as Inception-Resnet-V2, NasNet, etc., which is not limited in this embodiment.
  • the initial feature extraction model uses the first feature extraction network as the backbone network to extract the feature information of the first sample, and adds a second feature extraction network after the first feature extraction network, which is used according to the first feature extraction network. This feature information obtains the feature information of the second sample.
  • the second feature extraction network may include at least two fully connected layers (Fully Connected Layer, FC), the dimensions of which are consistent with the output dimensions of the first feature extraction network. That is, at least two fully connected layers are added after the first feature extraction network to obtain the initial feature extraction model. In an example, two fully connected layers can be added after MobileNetV2, and the dimensions are consistent with the output dimensions of the MobileNetV2 model for training.
  • FC Fully Connected Layer
  • Step S230 fuse the first sample feature information and the second sample feature information to obtain sample fusion feature information.
  • the sample fusion feature information can be obtained by adding the first sample feature information and the second sample feature information.
  • the elements of the first sample feature information and the second sample feature information can be correspondingly added.
  • the first sample feature information and the second sample feature information are both feature vectors and have the same dimensions. Therefore, each element in the respective feature vectors can be correspondingly added to obtain the value of each element of the sample fusion feature information. Value to obtain the sample fusion feature information that fuses the first sample feature information and the second sample feature information.
  • Step S240 Correct the network parameters of the second feature extraction network in the initial feature extraction model according to the sample fusion feature information and the sample label corresponding to the sample image.
  • the network parameters may include the weight of the network.
  • the first feature extraction network before training the second feature extraction network, the first feature extraction network may be trained first, that is, when the network parameters of the second feature extraction network are corrected, the first feature extraction network has been pre-trained. Therefore, when training the second feature extraction network, the network parameters of the first feature extraction network can be kept unchanged, and only the network parameters of the second feature extraction network can be modified, so that the first sample feature information output by the first feature extraction network When the target sub-category of the sample image can be characterized, the second sample feature information that can characterize the feature difference within the class can be extracted through the second feature extraction network.
  • step S240 may include step S241 to step S242 to train the second feature extraction network so that it can extract features that characterize feature differences within a class, and improve the accuracy and accuracy of subsequent classification.
  • FIG. 4 shows a schematic flowchart of step S240 in FIG. 3 in an exemplary embodiment of the present application.
  • Step S240 includes:
  • Step S241 Obtain a second loss function value corresponding to the sample image according to the sample fusion feature information and the sample label corresponding to the sample image.
  • sample images of the same main category can be taken from multiple sample sets as a training batch, so that n training batches can be obtained for training the second feature extraction network to modify the second feature extraction
  • the network parameters of the network can make the main categories of the samples of a training batch the same.
  • a sample set can be used as a training batch for training.
  • a predetermined number of sample images and corresponding sample labels from the sample set can be taken from the sample set according to the predetermined number of samples in each training batch as a training batch for training.
  • the predetermined number of samples can be determined according to actual needs, which is not limited in this embodiment. It is understandable that the higher the predetermined number of samples, the higher the number of sample images contained in a training batch, and the larger the training volume of a batch.
  • each category can be repeated, that is, the target categories of samples in different training batches can be repeated.
  • the target category may be a main category or a subcategory, which is not limited here.
  • the main category corresponding to training batch 1 is a table, that is, the main category of samples included in training batch 1 is a table, and the main category corresponding to training batch 2 can also be a table.
  • the main categories corresponding to different training batches may not be repeated and are not limited here. That is, in the foregoing example, the main category corresponding to training batch 2 may not be a table but a chair, and the main category corresponding to training batch 3
  • the category can be computer, training batch 4...
  • each batch contains different images of objects belonging to the same main category. Then input the sample images of each training batch into the initial feature extraction model for training in batches. During the training process, the network parameters of the first feature extraction network are kept unchanged, and only the network parameters of the second feature extraction network are trained. The output of the second feature extraction network and the output of the first feature extraction network are fused to obtain the final feature, that is, the sample fusion feature information. Finally, the classification is performed according to the sample fusion feature information to obtain the classification result, and the classification result corresponding to the sample image corresponds to the sample image For the sample label, the second loss function value corresponding to the sample image is obtained based on the second loss function.
  • the second loss function may be set to Softmax Loss. In other embodiments, the second loss function may also be set to L2Loss, Focal Loss, etc., which is not limited here.
  • the first feature extraction network may be trained in advance based on the sample images of the same training batch, that is, the corresponding sample labels, that is, during the training process of the first feature extraction network and the second feature extraction network, both The sample images under the same main category are trained as a training batch. Specific implementation manners can be seen in the following embodiments, which will not be repeated here.
  • the output of the first feature extraction network can be used to describe the average condition of features, that is, the output of the first feature extraction network can be regarded as the mean value of the feature (Mean), which is recorded as the feature mean (logit-mu), and the second feature extraction network
  • the output of can be used to describe the average deviation between the feature value and its average, that is, the output of the second feature extraction network can be regarded as the standard deviation of the feature (Standard Deviation), which is recorded as the feature standard deviation (logit-sigma), It is used to reflect the difference of characteristics within the same main category, that is, the difference between subcategories under the same main category.
  • the final features obtained by fusing the feature mean and feature standard deviation can not only reflect the difference in features between objects of different categories, but also It can reflect the difference in characteristics between objects of the same category, so the accuracy of model recognition can be significantly improved, and it has a wider range of applications.
  • Step S242 Correct the network parameters of the second feature extraction network based on the second loss function value.
  • the network parameters of the second feature extraction network can be corrected based on a predetermined optimization algorithm, until the second loss function value meets the second convergence condition, the second feature extraction network can be stopped. Train and obtain the trained second feature extraction network, that is, the second feature extraction network containing the corrected network parameters. If the second loss function value does not satisfy the second convergence condition, the next sample image can be obtained for the next round of training.
  • the second convergence condition may be a preset threshold, and when the second loss function is less than the preset threshold, it may be determined that the second loss function meets the second convergence condition, otherwise it is not.
  • the preset threshold value the higher the model training requirements, and the better the effect that can be achieved by the network where the second loss function meets the second convergence condition in the end.
  • the preset convergence condition is satisfied, where the minimum value can be a value, and a confidence range can be set with the minimum value as the center.
  • the second loss function converges to the confidence
  • it is within a degree range it can be considered that it converges to near the minimum value, and it can be further determined that the second loss function satisfies the second convergence condition.
  • the predetermined optimization algorithm may be Adaptive Moment Estimation (ADAM).
  • ADAM Adaptive Moment Estimation
  • the momentum factor BETA_1 can be set to 0.9
  • the momentum factor BETA_2 is set to 0.999
  • the basic learning rate (LEARNING_RATE) is set to 0.001. Increase and decrease gradually to speed up convergence. Specifically, every time the number of iterations increases by 300,000 times, the learning rate drops to 0.3 from the original.
  • the basic learning rate is updated to 0.0003, after 600,000 iterations, the basic learning rate is updated to 0.00009, and so on, until the second loss The function satisfies the second convergence condition. Therefore, in this embodiment, after training the second feature extraction network through a large amount of data, the network parameters of the revised second feature extraction network can be obtained.
  • Step S250 Determine the initial feature extraction model including the corrected network parameters as the trained feature extraction model.
  • the model composed of the trained second feature extraction network and the first feature extraction network is determined as the trained feature extraction model, which can be used to extract the first feature information and the second feature information from the image to be recognized, and is used to perform Recognition.
  • the image recognition method provided by this embodiment introduces the concept of feature standard deviation to express this intra-class difference, that is, for each category of objects, it is not only necessary to extract the mean value of the feature, but also to extract the standard of the feature
  • the difference is used to express the difference of objects in the class.
  • the mean value and standard deviation are merged to obtain the final feature, and then the classification is performed, which can significantly improve the accuracy of the final classification.
  • the network parameters of the first feature extraction network may be modified first, so that when the second feature extraction network is modified, on the one hand, the input of the second feature extraction network The feature is more accurate.
  • the fusion feature information obtained by subsequent fusion has better characterization performance of the feature, thereby improving the performance of the entire feature extraction model, which is conducive to improving the accuracy and accuracy of image classification.
  • FIG. 5 shows a schematic flowchart of an image recognition method provided by another embodiment of the present application. The method may include the following steps:
  • Step S310 Obtain multiple sample sets.
  • Step S320 Based on the initial first feature extraction network, first sample feature information of the sample image is obtained.
  • sample images of the same main category can be taken from multiple sample sets as a training batch, so that n training batches can be obtained, and the main categories of the samples in each training batch are the same, and then based on the initial A feature extraction network can obtain the first sample feature information of the sample image.
  • n training batches can be obtained, and the main categories of the samples in each training batch are the same, and then based on the initial A feature extraction network can obtain the first sample feature information of the sample image.
  • sample images in the sample set and the sample labels corresponding to the sample images are input into the initial first feature extraction network to train the initial first feature extraction network.
  • the first sample feature information of the sample image can be obtained based on the initial first feature extraction network and the sample image.
  • the initial first feature extraction network may be various networks such as MobileNet.
  • MobileNet a network such as MobileNet.
  • FIG. 6 shows a schematic diagram of the bottleneck structure of MobileNetV2 in an exemplary embodiment of the present application.
  • the stride (Stride) is 1, as shown in Figure 6(a), the input (Input) is first upgraded based on the linear rectification function (Rectified Linear Unit, ReLU) by 1 ⁇ 1, and then the depth convolution (Depthwise, DW) extract the features, and then obtain the output (Output) through linear point-by-point convolution, and finally use the Shortcut structure (the curve from input to add (Add) in Figure 6) to combine Input and Output Add together to form a residual structure.
  • ReLU linear rectification function
  • DW Depth convolution
  • ReLU specifically uses ReLU6, that is, the maximum output value is limited to 6 on the basis of ordinary ReLU, which is to have a good numerical resolution even when the mobile terminal device has a low precision of float16.
  • Step S330 Correct the network parameters of the initial first feature extraction network according to the first sample feature information and the sample label corresponding to the sample image.
  • step S330 may include step S331 to step S332 to modify the network parameters of the first feature extraction network.
  • FIG. 7 shows a schematic flowchart of step S330 in FIG. 5 in an exemplary embodiment of the present application.
  • Step S330 includes:
  • Step S331 Obtain a first loss function value corresponding to the sample image according to the first sample feature information and the sample label corresponding to the sample image.
  • the classification is performed according to the first sample feature information
  • the Softmax classifier can be used for classification to obtain the classification result corresponding to the sample image, and then the sample image corresponding to the sample image can be obtained according to the classification result and sample label corresponding to the sample image.
  • the value of the first loss function is the Softmax classifier.
  • FIG. 8 shows a schematic diagram of the training process of the first feature extraction network provided by an exemplary embodiment of the present application.
  • the first feature extraction network can be obtained.
  • the sample feature information is then classified based on a classifier such as a Softmax classifier to obtain the classification result, that is, the classification label corresponding to the sample image, which is used with the sample label to obtain the first loss function value corresponding to the sample image.
  • a classifier such as a Softmax classifier
  • the first loss function value corresponding to the sample image can be obtained according to the classification result and the sample label corresponding to the sample image.
  • the first loss function may be Softmax Loss.
  • the formula (1) of Softmax Loss can be as follows:
  • xi represents the output vector of the i-th sample image through MobileNetV2, that is, the first sample feature information
  • W is the weight vector
  • b represents the bias
  • y i represents the sample label corresponding to the i-th sample image. Therefore, according to formula (1), the first loss function value corresponding to the sample image can be obtained.
  • Step S332 Correct the network parameters of the initial first feature extraction network based on the first loss function value.
  • machine learning algorithms can be used to modify the network parameters of the initial first feature extraction network, that is, to optimize the initial first feature extraction network to obtain an initial first feature extraction network containing the corrected network parameters .
  • the machine learning algorithm may be ADAM or other algorithms, which is not limited here.
  • the parameter setting optimized based on the ADAM algorithm can be determined according to actual needs, and can also be set with reference to the parameters described in the foregoing embodiment, which will not be repeated here.
  • the initial first feature extraction network is determined as the first feature extraction network of the initial feature extraction model, where the network parameters of the initial first feature extraction network have been corrected, and the trained initial first feature extraction network is determined to be the initial feature extraction The first feature extraction network of the model.
  • the network structure of the first feature extraction network may be as shown in Table 1.
  • t represents the "expansion" multiple (multiplication factor of the input channel)
  • c represents the number of output channels
  • n represents the number of repetitions
  • s represents the stride
  • k represents the total number of image categories.
  • the number of image categories may be the number of subcategories.
  • the number of image categories may also be the number of main categories.
  • Step S340 Determine the initial first feature extraction network as the first feature extraction network of the initial feature extraction model.
  • the initial first feature extraction network after training is determined as the first feature extraction network of the initial feature extraction model.
  • the first feature extraction network is used to extract the first feature information of the target image, and is used as the input of the second feature extraction network, and is fused with the output of the second feature extraction network.
  • the target image represents an image whose features are to be extracted, such as an image input to an initial feature extraction model.
  • Step S350 Based on the initial feature extraction model and the sample image, first sample feature information and second sample feature information are obtained.
  • the second feature extraction network is after the first feature extraction network, and the output of the first feature extraction network is the input of the second feature extraction network.
  • the sample image first passes through the first feature extraction network to obtain the first sample feature information, and then the first sample feature information passes through the second feature extraction network to obtain the second sample feature information.
  • the second feature extraction network includes at least two fully connected layers, and the dimension is consistent with the output dimension of the first feature extraction network.
  • Step S360 fuse the first sample feature information and the second sample feature information to obtain sample fusion feature information.
  • the first sample feature information and the second sample feature information are fused to obtain the sample fusion feature information.
  • the fusion method of the two can be Add the corresponding elements.
  • Step S370 Correct the network parameters of the second feature extraction network in the initial feature extraction model according to the sample fusion feature information and the sample label corresponding to the sample image.
  • Step S380 Determine the initial feature extraction model including the corrected network parameters as the trained feature extraction model.
  • the above-mentioned embodiment only describes the algorithm for feature extraction as a model, that is, the feature extraction model.
  • the algorithm for classifying based on fused feature information can also be added to the classification result.
  • the image recognition model is obtained.
  • a trained object detection model such as MobileNet-SSD
  • the image to be recognized is first obtained through the first feature extraction network (such as MobileNetV2) to obtain the first feature information, that is, the feature mean, and then through the second feature extraction network (such as two-layer FC) to obtain the second feature
  • the information is the feature standard deviation, and then the first feature information and the second feature information are fused to obtain the fused feature information, and the classification is performed according to the fused feature information.
  • the classification result can be obtained based on the Softmax classifier, that is, the corresponding image to be recognized Label to determine the recognition result of the image to be recognized.
  • the classification result is output, otherwise it is determined that the image is not in the given category.
  • the object detection model used for target detection can also be added before the feature extraction model. It is understandable that all methods for feature extraction using the feature extraction model provided in the embodiments of this application should fall within the protection scope of this application.
  • the main frame of the feature extraction model is based on the MobileNetV2 network, so real-time prediction on the mobile terminal can be realized.
  • the concept of feature standard deviation is also proposed and specific Training method.
  • the feature standard deviation can be used to indicate the difference within this class. Therefore, according to the feature mean and feature standard
  • the final feature obtained by the difference fusion is the fusion feature information, which can not only reflect the difference between the features of different types of objects, but also the difference between the features of the same type of objects, so it can significantly improve the accuracy of model recognition and has a wider range of applications .
  • FIG. 10 shows a structural block diagram of an image recognition device 1000 provided by an embodiment of the present application.
  • the image recognition device 1000 can be applied to the aforementioned terminal.
  • the image recognition device 1000 can include: an image acquisition module 1010, features The extraction module 1020, the feature fusion module 1030, the image recognition module 1040, and the operation execution module 1050, specifically:
  • the image acquisition module 1010 is used to acquire the image to be recognized
  • the feature extraction module 1020 is configured to obtain first feature information and second feature information of the image to be recognized based on the trained feature extraction model, wherein the first feature information is used to characterize the target of the image to be recognized Sub-category, the second characteristic information is used to characterize the difference between the target sub-category and other sub-categories, and the target sub-category and the other sub-categories belong to the same main category;
  • the feature fusion module 1030 is configured to fuse the first feature information and the second feature information to obtain fused feature information
  • the image recognition module 1040 is configured to determine the recognition result of the image to be recognized according to the fusion feature information
  • the operation execution module 1050 is configured to execute a predetermined operation according to the recognition result.
  • the image recognition device 1000 further includes: a sample set acquisition module, a sample feature extraction module, a sample feature fusion module, a second network correction module, and a model update module, wherein:
  • a sample set acquisition module for acquiring a plurality of sample sets, the sample set includes a plurality of sample images and sample labels corresponding to the sample images, wherein the sample labels corresponding to the sample images in the same sample set belong to the same main category;
  • the sample feature extraction module is used to obtain first sample feature information and second sample feature information based on the initial feature extraction model and the sample image.
  • the initial feature extraction model includes a first feature extraction network and a second feature extraction network ;
  • a sample feature fusion module configured to fuse the first sample feature information and the second sample feature information to obtain sample fusion feature information
  • the second network correction module is configured to correct the network parameters of the second feature extraction network in the initial feature extraction model according to the sample fusion feature information and the sample label corresponding to the sample image;
  • the model update module is used to determine the initial feature extraction model including the corrected network parameters as the trained feature extraction model.
  • sample feature fusion module includes: a feature adding unit, wherein:
  • the feature adding unit is configured to add the first sample feature information and the second sample feature information to obtain the sample fusion feature information.
  • the image recognition device 1000 further includes: a first feature extraction module, a first network correction module, and a first network update module, wherein:
  • the first feature extraction module is configured to obtain the first sample feature information of the sample image based on the initial first feature extraction network
  • the first network correction module is configured to correct the network parameters of the initial first feature extraction network according to the first sample feature information and the sample label corresponding to the sample image;
  • the first network update module is configured to determine the initial first feature extraction network as the first feature extraction network of the initial feature extraction model, and the first feature extraction network is used to extract first feature information of the target image, And it is used as the input of the second feature extraction network and merged with the output of the second feature extraction network.
  • the first network correction module includes: a first loss acquisition unit and a first network correction unit, wherein:
  • a first loss acquiring unit configured to acquire a first loss function value corresponding to the sample image according to the first sample feature information and the sample label corresponding to the sample image;
  • the first network correction unit is configured to correct the network parameters of the first feature extraction network based on the first loss function value.
  • the second network correction module includes: a second loss acquisition unit and a second network correction unit, wherein:
  • a second loss acquiring unit configured to acquire a second loss function value corresponding to the sample image according to the sample fusion feature information and the sample label corresponding to the sample image;
  • the second network correction unit is configured to correct the network parameters of the second feature extraction network based on the second loss function value.
  • the second feature extraction network includes at least two fully connected layers.
  • the first feature extraction network is MobileNetV2.
  • the image recognition device provided in the embodiment of the present application is used to implement the corresponding image recognition method in the foregoing method embodiment, and has the beneficial effects of the corresponding method embodiment, and will not be repeated here.
  • the coupling between the modules may be electrical, mechanical or other forms of coupling.
  • the functional modules in the various embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or software functional modules.
  • the electronic device 1100 may be an electronic device capable of running application programs, such as a smart phone, a tablet computer, an e-book, a notebook computer, or a personal computer.
  • the electronic device 1100 in this application may include one or more of the following components: a processor 1110, a memory 1120, and one or more application programs, where one or more application programs may be stored in the memory 1120 and configured to be operated by one Or multiple processors 1110 execute, and one or more programs are configured to execute the method described in the foregoing method embodiment.
  • the processor 1110 may include one or more processing cores.
  • the processor 1110 uses various interfaces and lines to connect various parts of the entire electronic device 1100, and executes by running or executing instructions, programs, code sets, or instruction sets stored in the memory 1120, and calling data stored in the memory 1120.
  • the processor 1110 may adopt at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA).
  • DSP Digital Signal Processing
  • FPGA Field-Programmable Gate Array
  • PDA Programmable Logic Array
  • the processor 1110 may be integrated with one or a combination of a central processing unit (CPU), a graphics processing unit (GPU), a modem, and the like.
  • the CPU mainly processes the operating system, user interface, and application programs; the GPU is used for rendering and drawing of display content; the modem is used for processing wireless communication. It is understandable that the above-mentioned modem may not be integrated into the processor 1110, but may be implemented by a communication chip alone.
  • the memory 1120 may include random access memory (RAM) or read-only memory (Read-Only Memory).
  • the memory 1120 may be used to store instructions, programs, codes, code sets or instruction sets.
  • the memory 1120 may include a program storage area and a data storage area, where the program storage area may store instructions for implementing the operating system and instructions for implementing at least one function (such as touch function, sound playback function, image playback function, etc.) , Instructions used to implement the following various method embodiments, etc.
  • the storage data area can also store data created by the electronic device 1100 during use (such as phone book, audio and video data, chat record data) and the like.
  • FIG. 12 shows a structural block diagram of a computer readable storage medium provided by an embodiment of the present application.
  • the computer readable storage medium 1200 stores program codes, and the program codes can be invoked by a processor to execute the methods described in the foregoing embodiments.
  • the computer readable storage medium 1200 may be an electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM.
  • the computer-readable storage medium 1200 includes a non-transitory computer-readable storage medium (non-transitory computer-readable storage medium).
  • the computer readable storage medium 1200 has storage space for the program code 1210 for executing any method steps in the above methods. These program codes can be read from or written into one or more computer program products.
  • the program code 1210 may be compressed in an appropriate form, for example.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种图像识别方法、装置、电子设备及存储介质,涉及图像处理领域,该方法包括:获取待识别图像;基于训练好的特征提取模型,得到待识别图像的第一、第二特征信息,其中,第一特征信息用于表征待识别图像的目标子类别,第二特征信息用于表征目标子类别与其他子类别之间的差异,目标子类别和其他子类别属于同一个主类别;将第一特征信息和第二特征信息进行融合,得到融合特征信息;根据融合特征信息确定待识别图像的识别结果;根据识别结果,执行预定操作。本申请通过训练好的特征提取模型得到第一、第二特征信息,并基于融合得到的融合特征信息进行图像识别,可同时考虑特征本身及同一主类别下类内特征的差异性,提高图像识别精度。

Description

图像识别方法、装置、电子设备及存储介质
相关申请的交叉引用
本申请要求于2020年2月27日提交的申请号为202010124982.0的中国申请的优先权,其在此出于所有目的通过引用将其全部内容并入本文
技术领域
本申请涉及图像处理技术领域,更具体地,涉及一种图像识别方法、装置、电子设备及存储介质。
背景技术
随着终端的普及和终端技术的发展,用户对终端图像识别的精度要求越来越高。例如,用户期望通过终端实时识别各种物体,但是,目前的图像识别方法多是针对特定领域,并且应用在相对复杂的系统中,很难满足终端对通用物体进行识别的精度要求,即目前终端图像识别的精度不高。
发明内容
本申请实施例提出了一种图像识别方法、装置、电子设备及存储介质。
第一方面,本申请实施例提供了一种图像识别方法,该方法包括:获取待识别图像;基于训练好的特征提取模型,得到所述待识别图像的第一特征信息和第二特征信息,其中,所述第一特征信息用于表征所述待识别图像的目标子类别,所述第二特征信息用于表征所述目标子类别与其他子类别之间的差异,所述目标子类别和所述其他子类别属于同一个主类别;将所述第一特征信息和所述第二特征信息进行融合,得到融合特征信息;根据所述融合特征信息确定所述待识别图像的识别结果;根据所述识别结果,执行预定操作。
第二方面,本申请实施例提供了一种图像识别装置,该装置包括:图像获取模块,用于获取待识别图像;特征提取模块,用于基于训练好的特征提取模型,得到所述待识别图像的第一特征信息和第二特征信息,其中,所述第一特征信息用于表征所述待识别图像的目标子类别,所述第二特征信息用于表征所述目标子类别与其他子类别之间的差异,所述目标子类别和所述其他子类别属于同一个主类别;特征融合模块,用于将所述第一特征信息和所述第二特征信息进行融合,得到融合特征信息;图像识别模块,用于根据所述融合特征信息确定所述待识别图像的识别结果;操作执行模块,用于根据所述识别结果,执行预定操作。
第三方面,本申请实施例提供了一种电子设备,包括:存储器;一个或多个处理器,与所述存储器耦接;一个或多个应用程序,其中,一个或多个应用程序被存储在存储器中并被配置为由一个或多个处理器执行,一个或多个应用程序配置用于执行上述第一方面提供的图像识别方法。
第四方面,本申请实施例提供了一种计算机可读取存储介质,计算机可读取存储介质中存储有程序代码,程序代码可被处理器调用执行上述第一方面提供的图像识别方法。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1示出了本申请实施例提供的一种图像识别方法的应用场景示意图。
图2示出了本申请一个实施例提供的图像识别方法的流程示意图。
图3示出了本申请另一个实施例提供的图像识别方法的流程示意图。
图4示出了本申请一个示例性实施例中图3内步骤S240的流程示意图。
图5示出了本申请又一个实施例提供的图像识别方法的流程示意图。
图6示出了本申请一个示例性实施例中MobileNetV2的瓶颈结构示意图。
图7示出了本申请一个示例性实施例中图5内步骤S330的流程示意图。
图8示出了本申请一个示例性实施例中第一特征提取网络的训练过程示意图。
图9示出了本申请一个示例性实施例中基于特征提取模型的图像识别过程示意图。
图10示出了本申请实施例提供的图像识别装置的模块框图。
图11示出了本申请实施例提供的电子设备的结构框图。
图12示出了本申请实施例提供的用于保存或者携带实现根据本申请实施例的图像识别方法的程序代码的存储单元。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。
术语定义
总体均值(Mean):描述随机变量取值平均状况的数字特征,用希腊字母表示为μ。
总体标准差(Standard Deviation):描述随机变量取值与其算术平均数之间的平均离差,用希腊字母表示为σ。
自适应时刻估计法(Adaptive Moment Estimation):是一种优化算法,它能基于训练数据迭代地更新神经网络权重,通过计算梯度的一阶矩估计和二阶矩估计而为不同的参数设计独立的自适应性学习率。
目前的图像识别方法多是针对特定领域,并且应用在相对复杂的系统中,但是目前由于智能手机和平板电脑等终端的普及、相机像素的提高,移动端的图像识别愈发收到关注,且相关技术也得到了相应发展,例如,用户可以通过终端实时识别未知的物品、或寻找相似的物品,不仅可以扩展自身的知识,满足自身的好奇心,而且能够提升用户使用终端的体验。但是,目前的图像识别方法很难满足移动端对通用物体进行识别的性能要求。
同时对于图像识别任务而言,即使相同主类别的物体,它们之间也仍可能存在明显差异,例如,都属于桌子类别的办公桌、书桌、餐桌之间可能存在明显差异,而目前的图像识别技术往往仅可将图像中的桌子分到桌子类别,而难以再具体细分到下一级类别,也就是说,目前图像识别的精度不够高。
因此,基于上述问题,本申请实施例提供了一种图像识别方法、装置、电子设备及计算机可读取存储介质,通过在图像识别时,基于训练好的特征提取模型所提取的特征,可同时考虑特征本身和同一主类别下各子类别特征之间的差异性,从而使得最终融合得到的融合特征信息不仅能够反映不同类别物体之间特征的差异,也能够反映相同类别物体之间特征的差异,由此可显著提高图像识别的精度,具有更广的应用范围。
为了便于详细说明,下面先结合附图对本申请实施例所适用的应用场景进行示例性说明。
请参见图1,图1示出了本申请实施例提供的图像识别方法的应用场景示意图,该应用场景包括本申请实施例提供的一种图像识别系统10。该通信系统10包括:终端100和服务器200。
其中,终端100可以为但不限于为手机、平板电脑、MP3播放器(Moving Picture Experts Group Audio LayerⅢ,动态影像压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio LayerⅣ,动态影像压缩标准音频层面4)播放器、个人计算机或可穿戴电子设备等等。本申请实施例对具体的终端的设备类型不作限定。
其中,服务器200可以是传统服务器,也可以是云端服务器,可以是一台服务器,或者由若干台服务器组成的服务器集群,或者是一个云计算服务中心。
在一些可能的实施方式中,终端100可获取图像,而对图像进行处理的装置可设置于服务器200,则终端100获取图像后,可将图像传输至服务器200,通过服务器200对图像进行处理后返回处理结果至终端100,由此终端可根据处理结果实现对图像的识别等。其中,处理结果可以是识别结果,也可以是识别结果前中间过程的中间结果,例如提取的特征、特征融合后的特征等,在此不做限定。
在另一些可能的实施方式中,对图像进行处理的装置也可以设置于终端100上,使得终端100无需依赖与服务器200建立通信,也可对待识别图像进行识别得到识别结果,则此时图像识别系统10可以只包括终端100。
下面将通过具体实施例对本申请实施例提供的信息处理方法、装置、电子设备及存储介质进行详细说明。
请参阅图2,图2示出了本申请实施例提供的一种图像识别方法的流程示意图,可应用于上述终端。下面将针对图2所示的流程进行详细的阐述。该图像识别方法可以包括以下步骤:
步骤S110:获取待识别图像。
其中,待识别图像可以是经目标检测后的图像,也可以是未经目标检测的原始图像,本申请实施例对此不做限定。
在一些实施方式中,若待识别图像是待处理的原始图像,则在特征提取前,即在步骤S120之前,可先对包含目标对象的待识别图像进行目标检测,将目标对象从原始图像中检测并裁剪出来,得到目标图像,以用于后续特征提取。
其中,作为一种方式,待识别图像可以是用户基于终端输入的,此时终端可获取用户输入的图像作为待识别图像。作为另一种方式,终端也可从其他终端或服务器获取待识别图像,本实施例对此不做限定。
在另一些实施方式中,若待识别图像是可直接用于特征提取的图像,则获取待识别图像的具体实施方式可包括:获取包含目标对象的原始图像,对原始图像进行目标检测,然后裁剪得到待识别图像。
另外,在一些实施例中,在对待识别图像进行特征提取之前,还可对待识别图像执行预处理操作,可包括:对待识别图像中像素点的值进行归一化处理,例如,通过将各像素点的值除以255以将各像素点的值归一化到[0,1]。
在一些实施例中,在归一化处理前,还可包括,将裁剪得到的图像缩放到指定尺寸,其中,尺寸为宽度*高度,指定尺寸可根据实际需要确定,也可为程序预设,还可为用户自定义,在此不做限定,例如指定尺寸可以为224*224,单位可为像素。
步骤S120:基于训练好的特征提取模型,得到待识别图像的第一特征信息和第二特征信息。
本申请实施例中,第一特征信息用于表征待识别图像的目标子类别,第二特征信息用于表征目标子类别与其他子类别之间的差异,其中,目标子类别和其他子类别属于同一个主类别。例如,办公桌、书桌、餐桌分别为三种不同的子类别,它们均属于桌子这个主类别。而即便都属于桌子类别,办公桌、书桌、餐桌之间仍可能存在明显差异,但是目前的图像识别模型所提取的特征不足以描述类内差异。
需要说明的是,主类别是子类别的上一层级的类别。在一些实施例中,根据分类细粒度的不同,对主类别和子类别的划分方式也相应不同。例如,暹罗猫、加菲猫、蓝猫分别为三种不同的子类别,且均属于主类别——猫。若分类细粒度更大些,则猫、狗、猪分别为三种不同的子类别,且均属于主类别——动物。
因此本申请实施例通过在提取出用于表征待识别图像的目标子类别的特征基础上,还提取可表征目标子类别与其他子类别的差异性的第二特征信息。
在一些实施方式中,训练好的特征提取模型可存储于终端本地,则终端可不依赖网络环境,无需考虑通信时间的消耗,直接在本地运行特征提取模型得到第一特征信息和第二特征信息。有利于提高图像识别的效率。
在另一些实施方式中,训练好的特征提取模型可存储于服务器,此时,可由终端将待识别图像发送至服务器,指示服务器基于训练好的特征提取模型得到第一特征信息和第二特征信息,并返回结果至终端。其中,结果可以是第一、第二特征信息,也可以是服务器继续执行步骤130得到的融合特征信息,还可以是服务器继续执行步骤S140得到的识别结果等,本实施例对此不作限定。
由此,通过将训练好的特征提取模型存储与服务器,可不必占用终端本地过多的存储和运行空间,有利于提高终端本地的运行效率。如此,也可降低终端要实现本方法所应满足的性能要求,有利于扩展应用范围。
另外,随着通信技术的迭代发展,训练好的特征提取模型存储于服务器时,终端也仍可依赖较高的网速,实现移动端的实时的图像识别,满足用户使用移动端进行图像识别的需求。
步骤S130:将第一特征信息和第二特征信息进行融合,得到融合特征信息。
通过将第一特征信息和第二特征信息进行融合,得到融合之后的特征信息,该融合之后的特征信息命名为融合特征信息。由此,融合特征信息可同时包含两个特征信息,不仅能够反映不同主类别物体之间特征的差异,也能够反映相同主类别下物体之间特征的差异,从而使得后续基于融合特征信息进行分类的精度更高,准确率也更高。
在一些实施例中,可按权重对第一特征信息和第二特征信息进行融合,得到融合特征信息,例如,第一特征信息可对应第一权重,第二特征信息可对应第二权重,然后根据第一特征信息与第一权重以及第二特征信息与第二权重,通过加权平均得到融合特征信息。在一个示例中,若记第一特征信息为A 1,第一权重为x 1,记第二特征信息为A 2,第二权重为x 2,则可基于预定公式加权平均得到融合特征信息A,预定公式可为
Figure PCTCN2021074191-appb-000001
Figure PCTCN2021074191-appb-000002
其中,第一权重与第二权重的具体数值可根据实际需要确定。作为一种方式,可记用于提取第一特征信息的网络为第一特征提取网络,记用于提取第二特征信息的网络为第二特征提取网络,则可根据训练好的第一、第二特征提取网络的评估参数,来确定第一权重和第二权重。其中,评估参数包括准确率和召回率中的至少一个。在一个示例中,所述评估参数为准确率时,可根据第一、第二特征提取网络的准确率比值,确定第一权重和第二权重的比值,并基于预定数值,计算得到第一权重和第二权重。例如,预定数值可为1,则第二权重可为1,第一权重可为第二权重与第一权重和第二权重的比值的乘积。
在其他一些实施方式中,第一权重与第二权重也可是程序预设,还可是用户自定义的,本实施例对此不做限定。
步骤140:根据融合特征信息确定待识别图像的识别结果。
根据融合特征信息对待识别图像进行分类,以确定待识别图像的分类结果,即识别结果。在一些实施方式中,可通过连接分类器,分类器用于根据输入的融合特征信息进行分类,其中,分类器可采用逻辑回归、Softmax回归、或者是支持向量机(Support Vector Machine,SVM)等,本实施例对此不做限定。
在一种实施方式中,以基于Softmax分类器进行分类为例,如果经Softmax分类器分类后的类别概率大于给定阈值则输出分类结果,若小于或等于给定阈值则可判定图像不在给定类别中。
其中,给定类别为预先划分出的图像的分类类别,可由特征提取模型的训练过程中所用样本的样本标签确定,即特征提取模型可由标注有给定类别的样本标签的样本训练得到。其中,给定阈值可以根据实际需要确定,也可由用户自定义,在此不做限定。
步骤150:根据识别结果,执行预定操作。
其中,预定操作可以是由终端输出识别结果,例如可通过语音、文字等各种方式输出识别结果,使得用户可获知待识别图像的识别结果,甚至在一些实施方式中,也可获取识别结果对应的信息,与识别结果一同由终端输出,使得用户不仅可获知待识别图像的识别结果,还可获知相关的信息以进一步扩展知识面。预定操作也可以是发送识别结果至其他终端或服务器,以同步识别结果给其他终端或服务器,在一些实施方式中,终端发送识别结果至其他终端或服务器,还可指示其他终端或服务器执行识别结果对应的操作。
另外,在一些实施例中,终端还可根据识别结果,确定识别结果对应的控制指令,并发送该控制指令至其他终端或服务器(为方便描述,可记为对端),以指示其他终端或服务器执行与该指令对应的控制操作。其中,识别结果、控制指令与控制操作之间一一对应,终端本地可至少存储有识别结果与控制指令之间的映射关系,以根据识别结果确定对应的控制指令,对端可至少存储有控制指令与控制操作之间的映射关系,以根据接收的控制指令确定对应的控制操作并执行。在一些可能的实施方式中,终端本地和对端也可存储有识别结果、控制指令与控制操作三者之间的映射关系,在此不做限定。
在一个示例中,若用户A的终端得到待识别图像的识别结果是蜜獾,可生成并播放语音“当前动物为蜜獾”;也可将包含“蜜獾”的识别结果发送至用户B的终端,使得用户B也可获知与“蜜獾”相关的信息;还可获取与蜜獾相关的图像或视频,并发送至其他终端或在本地播放等。除此之外,还可执行其他预定操作,此处不做限定。
在一些实施例中,预定操作可以由当前终端用于获取待识别图像的应用程序(Applicatiom,APP)的功能决定。其中,应用程序可以是终端系统自带的应用程序,例如相机、相册、日历等;应用程序也可以是用户由应用市场、应用商店或其他第三方平台下载安装的应用程序,例如优酷、淘宝等。本实施例对此不做限定。
在一些实施方式中,若应用程序为相机,且该应用程序具备图像识别功能,则在用户遇到未知事物时,可通过打开相机应用程序并拍摄未知事物的图像,可对该未知事物进行实时识别得到识别结果,并通过语音或文字等各种方式输出该识别结果,例如可通过语音播放该未知事物的识别结果,再如可显示该识别结果的文字信息等,以使用户可实时获知未知事物的相关信息,便于扩充用户知识面和满足用户的好奇心。
另外,在一些实施方式中,通过相机拍摄图像的过程中,还可根据对所拍摄物体进行识别得到的识别结果,确定与识别结果匹配的图像处理策略,图像处理策略包括滤镜、图像处理算法等,图像处理算法可以是通过优化图像参数以修改图像显示效果的算法,其中,图像参数可包括但不限于对比度、亮度、饱和度等的一种或多种,由此实现例如增加/减小对比度、增加/减小亮度、增加/减小饱和度等的一种或多种组合的图像处理。
作为一种方式,可预先存储识别结果与图像处理策略的映射关系,每个识别结果可对应一个图像处理策略,不同识别结果可对应相同或不同的图像处理策略,例如,识别结果A与识别结果B可属于相同类别,则终端可先确定识别结果的类别,再根据类别确定匹配的图像处理策略,以自动对待识别图像作图像处理,以提高图像显示效果,帮助用户排除更满意的照片,提升用户体验。
需要说明的是,通过相机进行图像识别时,作为一种实施方式,终端可以先拍摄获取包含待识别对象的图像后,再通过上述方法对图像进行识别得到识别结果。作为另一种实施方式,终端也可无需获取图像,而让待识别对象处于相机视野内,终端可获取视野内的图像进行识别得到识别结果,由此可进一步提高图像识别的实时性,满足用户实时识别的需求。本实施例对此不做限定。
在另一些实施方式中,若应用程序为相册类的应用程序,即具有相册功能的应用程序,终端可对相册中的照片通过本方法得到每个照片的识别结果,以根据识别结果对照片进行分类并存储至各类对应的相簿或图集中,实现相册分类,以方便用户查看和搜索等。其中,各类对应的相簿或图集可以是各个子类别对应的相簿或图集,也可以是各个主类别对应的 相簿或图集,此处不做限定。
另外,在一些示例中,若已存在识别结果对应的相簿或图集,可根据图像的识别结果将图像存储至识别结果对应的相簿或图集;若不存在识别结果对应的相簿或图集,终端可根据图像的识别结果创建识别结果对应的相簿或图集,再将图像存储至识别结果对应的相簿或图集中。例如,若图像的识别结果为“暹罗猫”而当前的相簿仅包括人、风景,可创建新的相簿“动物”用于存储识别结果与动物对应的图像,然后将识别结果为“暹罗猫”的图像存储至“动物”相簿中。
可以理解的是,以上仅为示例,本实施例提供的方法并不局限于上述场景中,但考虑篇幅原因在此不再穷举。
本申请实施例提供的图像识别方法,通过获取待识别图像,然后基于训练好的特征提取模型,得到可表征待识别图像的目标子类别的第一特征信息、以及表征目标子类别与其他子类别之间的差异的第二特征信息,其中,目标子类别和其他子类别属于同一个主类别,接着融合第一特征信息和第二特征信息得到融合特征信息进行识别,并根据识别结果,执行预定操作。由此,本申请实施例在图像识别时,基于训练好的特征提取模型所提取的特征,可同时考虑特征本身和同一主类别下各子类别特征之间的差异性,从而使得最终融合得到的融合特征信息不仅能够反映不同类别物体之间特征的差异,也能够反映相同类别物体之间特征的差异,因此可以显著提高图像识别的精度,具有更广的应用范围。
另外,在一些实施例中,在获取待识别图像之前,训练好的特征提取模型可通过下述方法训练得到,具体地,请参阅图3,图3示出了本申请另一个实施例提供的图像识别方法的流程示意图,可应用于上述终端,该图像识别方法可以包括:
步骤S210:获取多个样本集。
本申请实施例中,样本集包括多个样本图像及样本图像对应的样本标签,其中,同一个样本集中样本图像对应的样本标签属于同一个主类别。
其中,样本标签为样本图像所属于的子类别的标签即子类别标签,一个样本集对应一个主类别,即一个样本集中样本图像属于同一个主类别,即样本图像的样本标签的主类别标签相同。例如,样本集S包括样本图像A、样本图像B及样本图像C,其中,样本图像A对应样本标签为办公桌,样本图像B对应样本标签为书桌,样本图像C对应样本标签为餐桌,样本图像A、B、C均属于同一个主类别即桌子类别。
本实施例中,同一样本集中的样本图像的样本标签可以相同,也可以不同,本实施例对此不做限定。
为了保证算法的鲁棒性和适应性,可获取不同物体在不同场景下的图像数据和类别标签,并由此得到样本图像及对应的样本标签。具体地,作为一种实施方式,可基于训练好的物体检测模型,将包含目标物体的目标物体区域从原始图像中检测并裁剪出来,并将目标物体区域缩放到指定尺寸,然后将目标物体区域作归一化处理得到样本图像,例如可将目标物体区域内所有像素点的值除以255得到以将像素点的值归一化到[0,1]。同时,将原始图像对应的类别标签记录为样本图像对应的样本标签。由此可得到多个样本图像以及样本图像对应的样本标签。
其中,物体检测模型可以是由以下网络构成,例如,可以是区域卷积神经网络(Regions with CNN,RCNN)(包括RCNN、Fast RCNN以及Faster RCNN)、YOLO(You Only Look Once)网络、单镜多核检测(Single Shot multiBox Detector,SSD)网络,本实施例并不对目标检测网络的具体类型进行限定。
在一些实施方式中,物体检测模型可以采用MobileNet-SSD或MobileNet-SSDLite,具体地,可以包括但不限于MobileNetV1+SSD、MobileNetV2+SSD、MobileNetV1+SSDLite以及MobileNetV2+SSDLite等。由于MobileNet是一个用于移动端视觉识别的高效模型,因而基于前述物体检测模型,可实现实时轻量的目标检测,提高目标检测效率的效率。其中,SSDLite是对SSD结构做了修改,将SSD的预测层中所有标准卷积替换为深度可分离 卷积,可使得参数量和计算成本大大降低,计算更高效。其中,关于MobileNet的进一步说明可见后述步骤。
另外,在一些实施方式中,针对每个样本图像以及样本图像对应的样本标签,根据样本标签所属的主类别,可将样本图像及其样本标签分别存储至该主类别对应的样本集中,由此可得到多个样本集。
另外,对于不同类别的物体,图像数量越多、类别分布越广,训练得到的特征提取模型的性能和泛化能力就越好。
步骤S220:基于初始特征提取模型和样本图像,得到第一样本特征信息和第二样本特征信息。
本实施例中,初始特征提取模型包括第一特征提取网络以及第二特征提取网络,第一特征提取网络用于提取第一样本特征信息,第二特征提取网络用于提取第二特征信息。其中,第一样本特征信息是用于表征图像的目标子类别的特征向量,第二样本特征信息用于表征目标子类别与其他子类别之间差异性的特征向量,其中,目标子类别和其他子类别属于同一个主类别。
在一些实施方式中,第一特征提取网络可以为MobileNetV1或MobileNetV2。其中,MobileNetV1是一种为移动设备设计的通用计算机视觉神经网络,能够支持图像分类和检测等任务。MobileNetV2在MobileNetV1的基础上提升后的版本,可用于图像分类、目标检测和语义分割,并MobileNetV2实现特征提取的速度更快,准确率更高。
作为一种实施方式,终端可采用MobileNetV2网络作为初始特征提取模型的主干网络,从而可以大大降低模型大小,使得模型更加轻量化,以适用于在移动端部署,满足终端尤其是移动终端对实时性、轻量化和高性能的要求。
在另一些实施方式中,第一特征提取网络还可以为其他网络,例如去除分类模块的卷积神经网络,此时第一特征提取网络可以是保留到最后一个卷积层(convolution layer)的卷积神经网络。再如,第一特征提取网络可以采用深度卷积神经网络例如ResNet101。另外,第一特征提取网络还可以采用其他卷积神经网络,例如Inception-Resnet-V2、NasNet等,本实施例对此不作限定。
本实施例中,初始特征提取模型以第一特征提取网络作为主干网络,用于提取第一样本特征信息,并在第一特征提取网络后添加第二特征提取网络,用于根据第一样本特征信息得到第二样本特征信息。
在一些实施方式中,第二特征提取网络可包括至少两层全连接层(Fully Connected Layer,FC),其维度和第一特征提取网络的输出维度保持一致。即在第一特征提取网络后添加至少两层全连接层,得到初始特征提取模型。在一个示例中,可在在MobileNetV2后添加两层全连接层,维度和MobileNetV2模型的输出维度保持一致,进行训练。
步骤S230:将第一样本特征信息和第二样本特征信息融合,得到样本融合特征信息。
在一些实施方式中,可通过将第一样本特征信息和第二样本特征信息相加得到样本融合特征信息。具体地,可将第一样本特征信息和第二样本特征信息的元素对应相加。作为一种实施方式,第一样本特征信息和第二样本特征信息均是特征向量,且维度相同,因而可将各自特征向量中每个元素对应相加得到样本融合特征信息的每个元素的值,从而得到融合第一样本特征信息和第二样本特征信息的样本融合特征信息。
步骤S240:根据样本融合特征信息和样本图像对应的样本标签,修正初始特征提取模型中第二特征提取网络的网络参数。
其中,网络参数可以包括网络的权重。
本实施例中,在训练第二特征提取网络前,可先对第一特征提取网络进行训练,即在修正第二特征提取网络的网络参数时,第一特征提取网络已预先训练好。因此,在训练第二特征提取网络时,可保持第一特征提取网络的网络参数不变,仅修正第二特征提取网络的网络参数,从而在第一特征提取网络输出的第一样本特征信息可表征样本图像的目标子 类别时,可通过第二特征提取网络提取出可表征类内特征差异性的第二样本特征信息。
在一些实施例中,步骤S240可包括步骤S241至步骤S242,以训练第二特征提取网络,使其可提取表征类内特征差异性的特征,提高后续分类的精度和准确率。具体地,请参阅图4,图4示出了本申请一个示例性实施例中图3内步骤S240的流程示意图,步骤S240包括:
步骤S241:根据样本融合特征信息和样本图像对应的样本标签,获取样本图像对应的第二损失函数值。
在一些实施方式中,可从多个样本集中取同一个主类别的样本图像作为一个训练批次,从而可得到n个训练批次,用于训练第二特征提取网络,以修正第二特征提取网络的网络参数,从而可以使得一个训练批次的样本的主类别相同。作为一种方式,可将一个样本集作为一个训练批次进行训练。作为另一种方式,也可根据样本集,按每个训练批次的预定样本数量从样本集中取预定样本数量的样本图像及对应的样本标签作为一个训练批次进行训练。其中,预定样本数量可根据实际需要确定,本实施例对此不做限定。可以理解的是,预定样本数量越高,一个训练批次所包含的样本图像数量越高,一个批次的训练量越大。
需要说明的是,每个类别可重复,即不同训练批次中样本的目标类别可以重复。其中,目标类别可以是主类别也可以是子类别,在此不做限定。在一个示例中,训练批次1对应的主类别为桌子,即训练批次1包含的样本的主类别为桌子,训练批次2对应的主类别也可以同为桌子。另外,不同训练批次对应的主类别也可以不重复,在此不做限定,即在前述示例中,训练批次2对应的主类别也可以不是桌子而是椅子、训练批次3对应的主类别可以是电脑,训练批次4…。
在一些实施方式中,若最终得到n个训练批次,每个批次内包含的都是属于同一个主类别的物体的不同图像。然后分批次将每一个训练批次的样本图像输入初始特征提取模型进行训练,训练过程中保持第一特征提取网络的网络参数不变,仅训练第二特征提取网络的网络参数,接着将第二特征提取网络的输出和第一特征提取网络的输出进行融合得到最终特征即样本融合特征信息,最后根据样本融合特征信息进行分类,得到分类结果,根据样本图像对应的分类结果与样本图像对应的样本标签,基于第二损失函数得到样本图像对应的第二损失函数值。
在一些实施方式中,第二损失函数可设置为Softmax Loss,在另一些实施方式中,第二损失函数也可设置为L2Loss、Focal Loss等,在此不做限定。
在一些实施方式中,第一特征提取网络可以是预先基于相同训练批次的样本图像即对应的样本标签进行训练的,即第一特征提取网络和第二特征提取网络的训练过程中,都是将同一主类别下的样本图像作为一个训练批次(batch)训练的。具体实施方式可见后述实施例,在此不作赘述。因此,第一特征提取网络的输出可用于描述特征平均状况,即可将第一特征提取网络的输出视为特征的均值(Mean),记为特征均值(logit-mu),第二特征提取网络的输出可用于描述特征取值与其平均数之间的平均离差,即可将第二特征提取网络的输出视为特征的标准差(Standard Deviation),记为特征标准差(logit-sigma),用于反映同一主类别类内特征的差异性,即同一主类别下子类别之间的差异性。因此基于包括第一特征提取网络和第二特征提取网络的特征提取模型进行特征提取后,再根据特征均值和特征标准差融合得到的最终特征,不仅能够反映不同类别物体之间特征的差异,也能够反映相同类别物体之间特征的差异,因此可以显著提高模型识别的精度,具有更广的应用范围。
步骤S242:基于第二损失函数值修正第二特征提取网络的网络参数。
在一些实施方式中,获取第二损失函数值后,可基于预定优化算法修正第二特征提取网络的网络参数,直到第二损失函数值满足第二收敛条件,可停止对第二特征提取网络的训练并得到训练好的第二特征提取网络,即包含修正后的网络参数的第二特征提取网络。若第二损失函数值不满足第二收敛条件,可继续获取下一个样本图像进行下一轮训练。
其中,第二收敛条件可以是一个预设阈值,当第二损失函数小于该预设阈值时,可判定第二损失函数满足第二收敛条件,否则不满足。可以理解的是,预设阈值越小,模型训练的要求越高,最终第二损失函数满足第二收敛条件的网络可实现的效果可以越好。例如,若第二损失函数收敛到最小值附近可判定满足预设收敛条件,其中最小值可以是一个数值,以该最小值为中心可设置一个置信度范围,当第二损失函数收敛到该置信度范围时,即可认为收敛到最小值附近,进一步可判定第二损失函数满足第二收敛条件。
其中,预定优化算法可以是自适应时刻估计法(Adaptive Moment Estimation,ADAM)。在一种实施方式中,基于ADAM修正第二特征提取网络的网络参数时,可设置动量因子BETA_1为0.9,动量因子BETA_2为0.999,基础学习率(LEARNING_RATE)设为0.001,并且随着迭代次数的增加逐渐下降,以加快收敛速度。具体地,迭代次数每增加300,000次,学习率下降为原来的0.3。以初始的基础学习率为0.001为例,则在完成300,000次迭代后,将基础学习率更新为0.0003,在完成600,000次迭代后,将基础学习率更新为0.00009,以此类推,直到第二损失函数满足第二收敛条件。由此,本实施例通过大量数据训练第二特征提取网络完成后即可得到修正后的第二特征提取网络的网络参数。
步骤S250:将包含修正后的网络参数的初始特征提取模型确定为训练好的特征提取模型。
将训练好的第二特征提取网络和第一特征提取网络组成的模型确定为训练好的特征提取模型,可用于根据待识别图像提取第一特征信息和第二特征信息,用于对待识别图像进行识别。
需要说明的是,本实施例中未详细描述的部分可以参考前述实施例,在此不再赘述。
对于图像识别任务而言,即使相同类别的物体,类内也可能存在较为显著的差异,比如办公桌、书桌、餐桌都属于桌子类别但是它们之间还是具有明显的差异,即便是办公桌之间也可能存在不小的差异,因此如果仅仅通过第一特征提取网络如MobileNetV2的输出,本实施例中将其记为特征均值,是很难将各种类别的物体区分开的,也就是说目前分类模型的准确率和召回率并不能达到令人满意的效果。为此,本实施例提供的图像识别方法引入了特征标准差的概念来表示这种类内的差异性,即对于每种类别的物体而言,不仅需要提取特征的均值,而且需要提取特征的标准差用来表示类内物体的差异性,最后将均值和标准差融合得到最终的特征,再进行分类,可以显著提高最终分类的精度。
在一些实施例中,在修正第二特征提取网络的网络参数前,可先修正第一特征提取网络的网络参数,使得在修正第二特征提取网络时,一方面使输入第二特征提取网络的特征更准确,另一方面也使得后续融合得到的融合特征信息对特征的表征性能更佳,从而提升整个特征提取模型的性能,有利于提升图像分类精度和准率。具体地,请参阅图5,其示出了本申请又一个实施例提供的图像识别方法的流程示意图,该方法可包括以下步骤:
步骤S310:获取多个样本集。
步骤S320:基于初始第一特征提取网络,得到样本图像的第一样本特征信息。
本实施例中,可从多个样本集中取同一个主类别的样本图像作为一个训练批次,从而可得到n个训练批次,每个训练批次的样本的主类别相同,然后基于初始第一特征提取网络可得到样本图像的第一样本特征信息。其中如何按训练批次进行训练的具体方式可参考前述对步骤S241的描述,在此不再赘述。
进一步地,将样本集中的样本图像和样本图像对应的样本标签输入初始第一特征提取网络中以训练初始第一特征提取网络。具体地,基于初始第一特征提取网络和样本图像可得到样本图像第一样本特征信息。
其中,初始第一特征提取网络可以是MobileNet等各种网络,具体可参考前述对第一特征提取网络的描述,在此不再赘述。
为详细说明训练过程,下面以初始第一特征提取网络是MobileNetV2为例进行说明。 请参阅图6,其示出了本申请一个示例性实施例中MobileNetV2的瓶颈结构示意图。其中,步长(Stride)为1时,如图6(a),先对输入(Input)基于线性整流函数(Rectified Linear Unit,ReLU)进行1×1升维,再进行深度卷积(Depthwise,DW)提取特征,再通过线性(Linear)的逐点卷积降维得到输出(Output),最终通过捷径(Shortcut)结构(图6中从输入到相加(Add)的曲线)将Input与Output相加,形成残差结构。步长为2时,如图6(b),因为Input与Output的尺寸不符,因此不添加捷径结构,其余均一致。其中,ReLU具体采用的是ReLU6,即在普通的ReLU基础上限制最大输出值为6,这是为了在移动端设备float16的低精度的时候,也能有很好的数值分辨率。
步骤S330:根据第一样本特征信息和样本图像对应的样本标签,修正初始第一特征提取网络的网络参数。
在一些实施例中,步骤S330可包括步骤S331至步骤S332,以修正第一特征提取网络的网络参数。具体地,请参阅图7,图7示出了本申请一个示例性实施例中图5内步骤S330的流程示意图,步骤S330包括:
步骤S331:根据第一样本特征信息和样本图像对应的样本标签,获取样本图像对应的第一损失函数值。
在一些实施方式中,根据第一样本特征信息进行分类,例如可采用Softmax分类器进行分类,得到样本图像对应的分类结果,然后可根据样本图像对应的分类结果和样本标签,获取样本图像对应的第一损失函数值。
在一个示例中,请参阅图8,其示出了本申请一个示例性实施例提供的第一特征提取网络的训练过程示意图,如图8所示,基于第一特征提取网络,可得到第一样本特征信息,然后基于分类器如Softmax分类器进行分类得到分类结果即样本图像对应的分类标签,以用于和样本标签用于得到样本图像对应的第一损失函数值。
本实施例中,基于第一损失函数,根据样本图像对应的分类结果和样本标签可得到样本图像对应的第一损失函数值。
在一种实施方式中,第一损失函数可为Softmax Loss。并在一个示例中,Softmax Loss的公式(1)可如下:
Figure PCTCN2021074191-appb-000003
其中,x i表征第i个样本图像经MobileNetV2的输出向量,即第一样本特征信息,W为权重向量,b表征偏置,y i表征第i个样本图像对应的样本标签。由此根据公式(1)可得到样本图像对应的第一损失函数值。
步骤S332:基于第一损失函数值修正初始第一特征提取网络的网络参数。
求出第一损失函数值后,可利用机器学习算法修正初始第一特征提取网络的网络参数,即优化初始第一特征提取网络,以可得到一个包含修正后的网络参数初始第一特征提取网络。其中,机器学习算法可以是ADAM或其他算法,此处不做限定。
在一种实施方式中,基于ADAM算法进行优化的参数设置可根据实际需要确定,也可参见前述实施例所述的参数进行设置,在此不再赘述。
将初始第一特征提取网络确定为初始特征提取模型的第一特征提取网络,其中,初始第一特征提取网络的网络参数已被修正,即将训练好的初始第一特征提取网络确定为初始特征提取模型的第一特征提取网络。
在一种实施方式中,若第一特征提取网络为MobileNetV2,第一特征提取网络的网络结构可如表1所示。
表1
输入 算子(Operator) t c n s
224 2x 3 conv2d - 32 1 2
112 2x 32 bottleneck 1 16 1 1
112 2x 16 bottleneck 6 24 2 2
56 2x 24 bottleneck 6 32 3 2
28 2x 32 bottleneck 6 64 4 2
14 2x 64 bottleneck 6 96 3 1
14 2x 96 bottleneck 6 160 3 2
7 2x 160 bottleneck 6 320 1 1
7 2x 320 conv2d 1x1 - 1280 1 1
7 2x 1280 avgpool 7x7 - - 1 -
1 2x 1280 conv2d 1x1 - k 1 -
于表1中,t表征“扩张”倍数(输入通道的倍增系数),c表征输出通道数,n表征重复次数,s表征步长stride,k为总的图像类别数目。可选地,图像类别数目可以是子类别的数目。另外,在一些其他实施方式中,图像类别数目也可以是主类别的数目。
步骤S340:将初始第一特征提取网络确定为初始特征提取模型的第一特征提取网络。
将训练后的初始第一特征提取网络确定为初始特征提取模型的第一特征提取网络。其中,第一特征提取网络用于提取目标图像的第一特征信息,以及用于作为第二特征提取网络的输入,并与第二特征提取网络的输出进行融合。其中,目标图像表征待提取特征的图像,如输入初始特征提取模型的图像。
步骤S350:基于初始特征提取模型和样本图像,得到第一样本特征信息和第二样本特征信息。
在一种实施方式中,初始特征提取模型中,第二特征提取网络在第一特征提取网络之后,第一特征提取网络的输出为第二特征提取网络的输入。
基于初始特征提取模型和样本图像,样本图像先经过第一特征提取网络得到第一样本特征信息,然后第一样本特征信息经过第二特征提取网络得到第二样本特征信息。
本实施例中,第二特征提取网络包括至少两层全连接层,且维度和第一特征提取网络的输出维度保持一致。
步骤S360:将第一样本特征信息和第二样本特征信息融合,得到样本融合特征信息。
将第一样本特征信息和第二样本特征信息融合得到样本融合特征信息,在一种实施方式中,由于第一样本特征信息和第二样本特征信息的维度一致,二者的融合方式可以为对应元素相加。
步骤S370:根据样本融合特征信息和样本图像对应的样本标签,修正初始特征提取模型中第二特征提取网络的网络参数。
步骤S380:将包含修正后的网络参数的初始特征提取模型确定为训练好的特征提取模型。
需要说明的是,本实施例中未详细描述的部分可以参考前述实施例,在此不再赘述。
在一些实施例中,上述实施例仅将用于特征提取的算法作为一个模型即特征提取模型进行描述,实际上根据模型整合需要,还可将基于融合特征信息进行分类得到分类结果的算法加在特征提取模型之后,以得到图像识别模型。
下面以图9为例,对基于本实施例训练得到特征提取模型对待识别图像进行识别的方法进行说明。
首先,获取输入图像,然后检测出目标区域并进行缩放和归一化处理,具体地可基于训练好的物体检测模型,比如MobileNet-SSD,将目标物体从输入图像中检测并裁剪的目标物体区域缩放到224*224,然后将所有像素点的值归一化到[0,1],即将所有像素点的值除以255,由此通过对输入图像进行目标检测和预处理可得到待识别图像。接着,基于训 练好的特征提取模型,待识别图像先经第一特征提取网络(如MobileNetV2)得到第一特征信息即特征均值,再经第二特征提取网络(如两层FC)得到第二特征信息即特征标准差,然后将第一特征信息和第二特征信息进行融合得到融合特征信息,并根据融合特征信息进行分类,如可基于Softmax分类器进行分类得到分类结果,即待识别图像对应的标签,从而确定待识别图像的识别结果。
在一些实施方式中,如果经Softmax分类器分类后的类别概率大于给定阈值则输出分类结果,否则判定图像不在给定类别中。
在另一些实施例中,还可将用于目标检测的物体检测模型加在特征提取模型之前。可以理解的是,以本申请实施例所提供的特征提取模型进行特征提取的方法均应在本申请保护范围内。
本实施例提供的图像识别方法,特征提取模型的主体框架基于MobileNetV2网络,因此可实现移动端的实时预测,同时为了提高模型识别的精度,还提出了特征标准差的概念,并给出了具体的训练方式,对于图像识别任务而言,即使相同类别的物体,类内物体也可能存在较为显著的差异,而特征标准差可以用来表示这种类内的差异性,因此,根据特征均值和特征标准差融合得到的最终特征即融合特征信息,不仅能够反映不同类别物体之间特征的差异,也能够反映相同类别物体之间特征的差异,因此可以显著提高模型识别的精度,具有更广的应用范围。
请参阅图10,其示出了本申请实施例提供的一种图像识别装置1000的结构框图,该图像识别装置1000可应用于上述终端,该图像识别装置1000可以包括:图像获取模块1010、特征提取模块1020、特征融合模块1030、图像识别模块1040以及操作执行模块1050,具体地:
图像获取模块1010,用于获取待识别图像;
特征提取模块1020,用于基于训练好的特征提取模型,得到所述待识别图像的第一特征信息和第二特征信息,其中,所述第一特征信息用于表征所述待识别图像的目标子类别,所述第二特征信息用于表征所述目标子类别与其他子类别之间的差异,所述目标子类别和所述其他子类别属于同一个主类别;
特征融合模块1030,用于将所述第一特征信息和所述第二特征信息进行融合,得到融合特征信息;
图像识别模块1040,用于根据所述融合特征信息确定所述待识别图像的识别结果;
操作执行模块1050,用于根据所述识别结果,执行预定操作。
进一步地,所述图像识别装置1000还包括:样本集获取模块、样本特征提取模块、样本特征融合模块、第二网络修正模块以及模型更新模块,其中:
样本集获取模块,用于获取多个样本集,所述样本集包括多个样本图像及所述样本图像对应的样本标签,其中,同一个样本集中样本图像对应的样本标签属于同一个主类别;
样本特征提取模块,用于基于初始特征提取模型和所述样本图像,得到第一样本特征信息和第二样本特征信息,所述初始特征提取模型包括第一特征提取网络以及第二特征提取网络;
样本特征融合模块,用于将所述第一样本特征信息和所述第二样本特征信息融合,得到样本融合特征信息;
第二网络修正模块,用于根据所述样本融合特征信息和所述样本图像对应的样本标签,修正所述初始特征提取模型中所述第二特征提取网络的网络参数;
模型更新模块,用于将包含修正后的网络参数的初始特征提取模型确定为所述训练好的特征提取模型。
进一步地,所述样本特征融合模块包括:特征相加单元,其中:
特征相加单元,用于将所述第一样本特征信息和所述第二样本特征信息相加得到所述样本融合特征信息。
进一步地,所述图像识别装置1000还包括:第一特征提取模块、第一网络修正模块以及第一网络更新模块,其中:
第一特征提取模块,用于基于所述初始第一特征提取网络,得到所述样本图像的第一样本特征信息;
第一网络修正模块,用于根据所述第一样本特征信息和所述样本图像对应的样本标签,修正所述初始第一特征提取网络的网络参数;
第一网络更新模块,用于将所述初始第一特征提取网络确定为所述初始特征提取模型的第一特征提取网络,所述第一特征提取网络用于提取目标图像的第一特征信息,以及用于作为所述第二特征提取网络的输入,并与所述第二特征提取网络的输出进行融合。
进一步地,所述第一网络修正模块包括:第一损失获取单元以及第一网络修正单元,其中:
第一损失获取单元,用于根据所述第一样本特征信息和所述样本图像对应的样本标签,获取所述样本图像对应的第一损失函数值;
第一网络修正单元,用于基于所述第一损失函数值修正所述第一特征提取网络的网络参数。
进一步地,所述第二网络修正模块包括:第二损失获取单元以及第二网络修正单元,其中:
第二损失获取单元,用于根据所述样本融合特征信息和所述样本图像对应的样本标签,获取所述样本图像对应的第二损失函数值;
第二网络修正单元,用于基于所述第二损失函数值修正所述第二特征提取网络的网络参数。
进一步地,所述第二特征提取网络包括至少两层全连接层。
进一步地,所述第一特征提取网络为MobileNetV2。
本申请实施例提供的图像识别装置用于实现前述方法实施例中相应的图像识别方法,并具有相应的方法实施例的有益效果,在此不再赘述。
在本申请所提供的几个实施例中,模块相互之间的耦合可以是电性,机械或其它形式的耦合。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。
请参考图11,其示出了本申请实施例提供的一种电子设备的结构框图。该电子设备1100可以是智能手机、平板电脑、电子书、笔记本电脑、个人计算机等能够运行应用程序的电子设备。本申请中的电子设备1100可以包括一个或多个如下部件:处理器1110、存储器1120以及一个或多个应用程序,其中一个或多个应用程序可以被存储在存储器1120中并被配置为由一个或多个处理器1110执行,一个或多个程序配置用于执行如前述方法实施例所描述的方法。
处理器1110可以包括一个或者多个处理核。处理器1110利用各种接口和线路连接整个电子设备1100内的各个部分,通过运行或执行存储在存储器1120内的指令、程序、代码集或指令集,以及调用存储在存储器1120内的数据,执行电子设备1100的各种功能和处理数据。可选地,处理器1110可以采用数字信号处理(Digital Signal Processing,DSP)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、可编程逻辑阵列(Programmable Logic Array,PLA)中的至少一种硬件形式来实现。处理器1110可集成中央处理器(Central Processing Unit,CPU)、图像处理器(Graphics Processing Unit,GPU)和调制解调器等中的一种或几种的组合。其中,CPU主要处理操作系统、用户界面和应用程序等;GPU用于负责显示内容的渲染和绘制;调制解调器用于处理无线通信。可以理解的是,上述调制解调器也可以不集成到处理器1110中,单独通过一块通信芯片进行实现。
存储器1120可以包括随机存储器(Random Access Memory,RAM),也可以包括只读存储器(Read-Only Memory)。存储器1120可用于存储指令、程序、代码、代码集或指令集。存储器1120可包括存储程序区和存储数据区,其中,存储程序区可存储用于实现操作系统的指令、用于实现至少一个功能的指令(比如触控功能、声音播放功能、图像播放功能等)、用于实现下述各个方法实施例的指令等。存储数据区还可以存储电子设备1100在使用中所创建的数据(比如电话本、音视频数据、聊天记录数据)等。
请参考图12,其示出了本申请实施例提供的一种计算机可读取存储介质的结构框图。该计算机可读取存储介质1200中存储有程序代码,所述程序代码可被处理器调用执行上述实施例中所描述的方法。
计算机可读取存储介质1200可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。可选地,计算机可读取存储介质1200包括非易失性计算机可读取存储介质(non-transitory computer-readable storage medium)。计算机可读取存储介质1200具有执行上述方法中的任何方法步骤的程序代码1210的存储空间。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。程序代码1210可以例如以适当形式进行压缩。
最后应说明的是:以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不驱使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (20)

  1. 一种图像识别方法,其特征在于,所述方法包括:
    获取待识别图像;
    基于训练好的特征提取模型,得到所述待识别图像的第一特征信息和第二特征信息,其中,所述第一特征信息用于表征所述待识别图像的目标子类别,所述第二特征信息用于表征所述目标子类别与其他子类别之间的差异,所述目标子类别和所述其他子类别属于同一个主类别;
    将所述第一特征信息和所述第二特征信息进行融合,得到融合特征信息;
    根据所述融合特征信息确定所述待识别图像的识别结果;
    根据所述识别结果,执行预定操作。
  2. 根据权利要求1所述的方法,其特征在于,所述获取待识别图像之前,所述方法还包括:
    获取多个样本集,所述样本集包括多个样本图像及所述样本图像对应的样本标签,其中,同一个样本集中样本图像对应的样本标签属于同一个主类别;
    基于初始特征提取模型和所述样本图像,得到第一样本特征信息和第二样本特征信息,所述初始特征提取模型包括第一特征提取网络以及第二特征提取网络;
    将所述第一样本特征信息和所述第二样本特征信息融合,得到样本融合特征信息;
    根据所述样本融合特征信息和所述样本图像对应的样本标签,修正所述初始特征提取模型中所述第二特征提取网络的网络参数;
    将包含修正后的网络参数的初始特征提取模型确定为所述训练好的特征提取模型。
  3. 根据权利要求2所述的方法,其特征在于,所述将所述第一样本特征信息和所述第二样本特征信息融合,得到样本融合特征信息,包括:
    将所述第一样本特征信息和所述第二样本特征信息相加得到所述样本融合特征信息。
  4. 根据权利要求2或3所述的方法,其特征在于,所述基于初始特征提取模型和所述样本图像,得到第一样本特征信息和第二样本特征信息之前,所述方法还包括:
    基于所述初始第一特征提取网络,得到所述样本图像的第一样本特征信息;
    根据所述第一样本特征信息和所述样本图像对应的样本标签,修正所述初始第一特征提取网络的网络参数;
    将所述初始第一特征提取网络确定为所述初始特征提取模型的第一特征提取网络,所述第一特征提取网络用于提取目标图像的第一特征信息,以及用于作为所述第二特征提取网络的输入,并与所述第二特征提取网络的输出进行融合。
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述第一样本特征信息和所述样本图像对应的样本标签,修正所述初始第一特征提取网络的网络参数,包括:
    根据所述第一样本特征信息和所述样本图像对应的样本标签,获取所述样本图像对应的第一损失函数值;
    基于所述第一损失函数值修正所述初始第一特征提取网络的网络参数。
  6. 根据权利要求2至5任一项所述的方法,其特征在于,所述根据所述样本融合特征信息和所述样本图像对应的样本标签,修正所述初始特征提取模型中所述第二特征提取网络的网络参数,包括:
    根据所述样本融合特征信息和所述样本图像对应的样本标签,获取所述样本图像对应的第二损失函数值;
    基于所述第二损失函数值修正所述第二特征提取网络的网络参数。
  7. 根据权利要求2至6任一项所述的方法,其特征在于,所述第二特征提取网络包括至少两层全连接层。
  8. 根据权利要求2至7任一项所述的方法,其特征在于,所述第一特征提取网络为MobileNetV2。
  9. 根据权利要求2至8任一项所述的方法,其特征在于,在所述获取多个样本集之前,所述方法还包括:
    基于训练好的物体检测模型,从原始图像中检测并裁剪出包含目标物体的目标物体区域;
    缩放所述目标物体区域到指定尺寸并作归一化处理,得到样本图像;
    将所述原始图像对应的类别标签,确定为所述样本图像对应的样本标签。
  10. 根据权利要求9所述的方法,其特征在于,所述获取多个样本集,包括:
    确定所述样本标签所属的主类别;
    将所述样本图像及所述样本图像对应的所述样本标签存储至所述主类别对应的样本集中,得到多个样本集。
  11. 根据权利要求2至10任一项所述的方法,其特征在于,所述将所述第一特征信息和所述第二特征信息进行融合,得到融合特征信息,包括:
    确定所述第一特征信息对应的第一权重,以及所述第二特征信息对应的第二权重;
    对所述第一特征信息与所述第一权重以及所述第二特征信息与所述第二权重进行加权平均,得到融合特征信息。
  12. 根据权利要求11所述的方法,其特征在于,所述确定所述第一特征信息对应的第一权重,以及所述第二特征信息对应的第二权重,包括:
    确定训练好的所述第一特征提取网络和训练好的所述第二特征提取网络的评估参数,所述评估参数包括准确率和召回率中的至少一个;
    基于所述评估参数,确定所述第一特征信息对应的第一权重,以及所述第二特征信息对应的第二权重。
  13. 根据权利要求12所述的方法,其特征在于,所述评估参数为所述准确率时,所述基于所述评估参数,确定所述第一特征信息对应的第一权重,以及所述第二特征信息对应的第二权重,包括:
    确定训练好的所述第一特征提取网络和训练好的所述第二特征提取网络的准确率比值;
    基于预定数值以及所述准确率比值,确定所述第一特征信息对应的第一权重,以及所述第二特征信息对应的第二权重。
  14. 根据权利要求1至13任一项所述的方法,其特征在于,所述根据所述识别结果,执行预定操作,包括:
    确定所述识别结果对应的控制指令;
    发送所述控制指令至其他终端或服务器,以指示所述其他终端或服务器执行与所述控制指令对应的控制操作。
  15. 根据权利要求1至14任一项所述的方法,其特征在于,所述根据所述识别结果,执行预定操作,包括:
    确定与识别结果匹配的图像处理策略;
    根据所述图像处理策略,对所述待识别图像进行图像处理。
  16. 根据权利要求1至15任一项所述的方法,其特征在于,所述待识别图像包括相册中的照片,所述根据所述识别结果,执行预定操作,包括:
    根据所述相册中每张照片的所述识别结果,生成每个子类别或者每个主类别的图集。
  17. 根据权利要求16所述的方法,其特征在于,所述根据所述相册中每张照片的图像识别结果,生成每个子类别或者每个主类别的图集,包括:
    当不存在与所述识别结果对应的图集时,创建与识别结果对应的图集;
    存储所述识别结果对应的照片至创建的所述图集中,得到每个子类别或者每个主类别 的图集。
  18. 一种图像识别装置,其特征在于,所述装置包括:
    图像获取模块,用于获取待识别图像;
    特征提取模块,用于基于训练好的特征提取模型,得到所述待识别图像的第一特征信息和第二特征信息,其中,所述第一特征信息用于表征所述待识别图像的目标子类别,所述第二特征信息用于表征所述目标子类别与其他子类别之间的差异,所述目标子类别和所述其他子类别属于同一个主类别;
    特征融合模块,用于将所述第一特征信息和所述第二特征信息进行融合,得到融合特征信息;
    图像识别模块,用于根据所述融合特征信息确定所述待识别图像的识别结果;
    操作执行模块,用于根据所述识别结果,执行预定操作。
  19. 一种电子设备,其特征在于,包括:
    一个或多个处理器;
    存储器;
    一个或多个应用程序,其中所述一个或多个应用程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个应用程序配置用于执行如权利要求1-17任一项所述的方法。
  20. 一种计算机可读取存储介质,其特征在于,所述计算机可读取存储介质中存储有程序代码,所述程序代码可被处理器调用执行所述权利要求1-17任一项所述的方法。
PCT/CN2021/074191 2020-02-27 2021-01-28 图像识别方法、装置、电子设备及存储介质 WO2021169723A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010124982.0A CN111368893B (zh) 2020-02-27 2020-02-27 图像识别方法、装置、电子设备及存储介质
CN202010124982.0 2020-02-27

Publications (1)

Publication Number Publication Date
WO2021169723A1 true WO2021169723A1 (zh) 2021-09-02

Family

ID=71212208

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/074191 WO2021169723A1 (zh) 2020-02-27 2021-01-28 图像识别方法、装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN111368893B (zh)
WO (1) WO2021169723A1 (zh)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113657414A (zh) * 2021-10-19 2021-11-16 广州微林软件有限公司 一种物体识别方法
CN113744721A (zh) * 2021-09-07 2021-12-03 腾讯音乐娱乐科技(深圳)有限公司 模型训练方法、音频处理方法、设备及可读存储介质
CN113922500A (zh) * 2021-09-17 2022-01-11 国网山西省电力公司输电检修分公司 一种输电线路状态多源监测数据接入方法及装置
CN114022960A (zh) * 2022-01-05 2022-02-08 阿里巴巴达摩院(杭州)科技有限公司 模型训练和行为识别方法、装置、电子设备以及存储介质
CN115170250A (zh) * 2022-09-02 2022-10-11 杭州洋驼网络科技有限公司 一种电商平台的物品信息管理方法以及装置
CN115495712A (zh) * 2022-09-28 2022-12-20 支付宝(杭州)信息技术有限公司 数字作品处理方法及装置
CN115578584A (zh) * 2022-09-30 2023-01-06 北京百度网讯科技有限公司 图像处理方法、图像处理模型的构建和训练方法
CN115761239A (zh) * 2023-01-09 2023-03-07 深圳思谋信息科技有限公司 一种语义分割方法及相关装置
CN116612287A (zh) * 2023-07-17 2023-08-18 腾讯科技(深圳)有限公司 图像识别方法、装置、计算机设备和存储介质
CN117152027A (zh) * 2023-10-31 2023-12-01 广东中科凯泽信息科技有限公司 一种基于图像处理和人工智能识别的智能望远镜

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111368893B (zh) * 2020-02-27 2023-07-25 Oppo广东移动通信有限公司 图像识别方法、装置、电子设备及存储介质
CN113536870A (zh) * 2020-07-09 2021-10-22 腾讯科技(深圳)有限公司 一种异常头像识别方法及装置
CN111753854B (zh) * 2020-07-28 2023-12-22 腾讯医疗健康(深圳)有限公司 图像处理方法、装置、电子设备及存储介质
CN112308090A (zh) * 2020-09-21 2021-02-02 北京沃东天骏信息技术有限公司 图像分类方法及装置
CN112101477A (zh) * 2020-09-23 2020-12-18 创新奇智(西安)科技有限公司 目标检测方法及装置、电子设备、存储介质
CN112101476A (zh) * 2020-09-23 2020-12-18 创新奇智(西安)科技有限公司 一种图片分类方法、装置、电子设备及存储介质
CN112163377A (zh) * 2020-10-13 2021-01-01 北京智芯微电子科技有限公司 变压器温度预警模型的获取方法和装置及温度预测方法
CN112364912B (zh) * 2020-11-09 2023-10-13 腾讯科技(深圳)有限公司 信息分类方法、装置、设备及存储介质
CN112418303A (zh) * 2020-11-20 2021-02-26 浙江大华技术股份有限公司 一种识别状态模型的训练方法、装置及计算机设备
CN112580581A (zh) * 2020-12-28 2021-03-30 英特灵达信息技术(深圳)有限公司 目标检测方法、装置及电子设备
CN112651445A (zh) * 2020-12-29 2021-04-13 广州中医药大学(广州中医药研究院) 基于深度网络多模态信息融合的生物信息识别方法和装置
CN113569894B (zh) * 2021-02-09 2023-11-21 腾讯科技(深圳)有限公司 图像分类模型的训练方法、图像分类方法、装置及设备
CN113052159A (zh) * 2021-04-14 2021-06-29 中国移动通信集团陕西有限公司 一种图像识别方法、装置、设备及计算机存储介质
CN113096140B (zh) * 2021-04-15 2022-11-22 北京市商汤科技开发有限公司 实例分割方法及装置、电子设备及存储介质
CN113591539B (zh) * 2021-06-01 2024-04-16 中国电子科技集团公司第三研究所 一种目标识别方法、装置及可读存储介质
CN113469265A (zh) * 2021-07-14 2021-10-01 浙江大华技术股份有限公司 数据类别属性的确定方法及装置、存储介质、电子装置
CN113673576A (zh) * 2021-07-26 2021-11-19 浙江大华技术股份有限公司 图像检测方法、终端及其计算机可读存储介质
CN113989569B (zh) * 2021-10-29 2023-07-04 北京百度网讯科技有限公司 图像处理方法、装置、电子设备和存储介质
CN115239968A (zh) * 2022-07-25 2022-10-25 首都师范大学 一种图像处理方法、装置、计算机设备及存储介质
CN115620496B (zh) * 2022-09-30 2024-04-12 北京国电通网络技术有限公司 应用于输电线路的故障报警方法、装置、设备和介质
CN116152246B (zh) * 2023-04-19 2023-07-25 之江实验室 一种图像识别方法、装置、设备及存储介质
CN116935363B (zh) * 2023-07-04 2024-02-23 东莞市微振科技有限公司 刀具识别方法、装置、电子设备及可读存储介质
CN117499596A (zh) * 2023-11-15 2024-02-02 岳阳华润燃气有限公司 一种基于智能ar眼镜的燃气场站巡检系统及方法

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106845525A (zh) * 2016-12-28 2017-06-13 上海电机学院 一种基于底层融合特征的深度置信网络图像分类协议
CN108681746A (zh) * 2018-05-10 2018-10-19 北京迈格威科技有限公司 一种图像识别方法、装置、电子设备和计算机可读介质
AU2019100354A4 (en) * 2019-04-04 2019-05-16 Chen, Mingjie Mr An animal image search system based on convolutional neural network
CN109829459A (zh) * 2019-01-21 2019-05-31 重庆邮电大学 基于改进ransac的视觉定位方法
CN110414544A (zh) * 2018-04-28 2019-11-05 杭州海康威视数字技术股份有限公司 一种目标状态分类方法、装置及系统
WO2019221551A1 (ko) * 2018-05-18 2019-11-21 오드컨셉 주식회사 이미지 내 객체의 대표 특성을 추출하는 방법, 장치 및 컴퓨터 프로그램
CN111368893A (zh) * 2020-02-27 2020-07-03 Oppo广东移动通信有限公司 图像识别方法、装置、电子设备及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109074499B (zh) * 2016-04-12 2020-06-02 北京市商汤科技开发有限公司 用于对象重识别的方法和系统
CN107393554B (zh) * 2017-06-20 2020-07-10 武汉大学 一种声场景分类中融合类间标准差的特征提取方法
CN109685115B (zh) * 2018-11-30 2022-10-14 西北大学 一种双线性特征融合的细粒度概念模型及学习方法
CN110348387B (zh) * 2019-07-12 2023-06-27 腾讯科技(深圳)有限公司 一种图像数据处理方法、装置以及计算机可读存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106845525A (zh) * 2016-12-28 2017-06-13 上海电机学院 一种基于底层融合特征的深度置信网络图像分类协议
CN110414544A (zh) * 2018-04-28 2019-11-05 杭州海康威视数字技术股份有限公司 一种目标状态分类方法、装置及系统
CN108681746A (zh) * 2018-05-10 2018-10-19 北京迈格威科技有限公司 一种图像识别方法、装置、电子设备和计算机可读介质
WO2019221551A1 (ko) * 2018-05-18 2019-11-21 오드컨셉 주식회사 이미지 내 객체의 대표 특성을 추출하는 방법, 장치 및 컴퓨터 프로그램
CN109829459A (zh) * 2019-01-21 2019-05-31 重庆邮电大学 基于改进ransac的视觉定位方法
AU2019100354A4 (en) * 2019-04-04 2019-05-16 Chen, Mingjie Mr An animal image search system based on convolutional neural network
CN111368893A (zh) * 2020-02-27 2020-07-03 Oppo广东移动通信有限公司 图像识别方法、装置、电子设备及存储介质

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113744721A (zh) * 2021-09-07 2021-12-03 腾讯音乐娱乐科技(深圳)有限公司 模型训练方法、音频处理方法、设备及可读存储介质
CN113744721B (zh) * 2021-09-07 2024-05-14 腾讯音乐娱乐科技(深圳)有限公司 模型训练方法、音频处理方法、设备及可读存储介质
CN113922500A (zh) * 2021-09-17 2022-01-11 国网山西省电力公司输电检修分公司 一种输电线路状态多源监测数据接入方法及装置
CN113657414A (zh) * 2021-10-19 2021-11-16 广州微林软件有限公司 一种物体识别方法
CN114022960A (zh) * 2022-01-05 2022-02-08 阿里巴巴达摩院(杭州)科技有限公司 模型训练和行为识别方法、装置、电子设备以及存储介质
CN114022960B (zh) * 2022-01-05 2022-06-14 阿里巴巴达摩院(杭州)科技有限公司 模型训练和行为识别方法、装置、电子设备以及存储介质
CN115170250A (zh) * 2022-09-02 2022-10-11 杭州洋驼网络科技有限公司 一种电商平台的物品信息管理方法以及装置
CN115495712A (zh) * 2022-09-28 2022-12-20 支付宝(杭州)信息技术有限公司 数字作品处理方法及装置
CN115495712B (zh) * 2022-09-28 2024-04-16 支付宝(杭州)信息技术有限公司 数字作品处理方法及装置
CN115578584B (zh) * 2022-09-30 2023-08-29 北京百度网讯科技有限公司 图像处理方法、图像处理模型的构建和训练方法
CN115578584A (zh) * 2022-09-30 2023-01-06 北京百度网讯科技有限公司 图像处理方法、图像处理模型的构建和训练方法
CN115761239A (zh) * 2023-01-09 2023-03-07 深圳思谋信息科技有限公司 一种语义分割方法及相关装置
CN115761239B (zh) * 2023-01-09 2023-04-28 深圳思谋信息科技有限公司 一种语义分割方法及相关装置
CN116612287B (zh) * 2023-07-17 2023-09-22 腾讯科技(深圳)有限公司 图像识别方法、装置、计算机设备和存储介质
CN116612287A (zh) * 2023-07-17 2023-08-18 腾讯科技(深圳)有限公司 图像识别方法、装置、计算机设备和存储介质
CN117152027A (zh) * 2023-10-31 2023-12-01 广东中科凯泽信息科技有限公司 一种基于图像处理和人工智能识别的智能望远镜
CN117152027B (zh) * 2023-10-31 2024-02-09 广东中科凯泽信息科技有限公司 一种基于图像处理和人工智能识别的智能望远镜

Also Published As

Publication number Publication date
CN111368893B (zh) 2023-07-25
CN111368893A (zh) 2020-07-03

Similar Documents

Publication Publication Date Title
WO2021169723A1 (zh) 图像识别方法、装置、电子设备及存储介质
WO2022033150A1 (zh) 图像识别方法、装置、电子设备及存储介质
CN109815770B (zh) 二维码检测方法、装置及系统
CN111209970B (zh) 视频分类方法、装置、存储介质及服务器
US11494886B2 (en) Hierarchical multiclass exposure defects classification in images
CN112651438A (zh) 多类别图像的分类方法、装置、终端设备和存储介质
WO2022121485A1 (zh) 图像的多标签分类方法、装置、计算机设备及存储介质
CN111368636B (zh) 目标分类方法、装置、计算机设备和存储介质
CN111126140A (zh) 文本识别方法、装置、电子设备以及存储介质
WO2022033264A1 (zh) 人体特征点的筛选方法、装置、电子设备以及存储介质
CN112418327A (zh) 图像分类模型的训练方法、装置、电子设备以及存储介质
WO2021129466A1 (zh) 检测水印的方法、装置、终端及存储介质
US20210217443A1 (en) Film-making using style transfer
WO2021238586A1 (zh) 一种训练方法、装置、设备以及计算机可读存储介质
CN113505797B (zh) 模型训练方法、装置、计算机设备和存储介质
WO2022028147A1 (zh) 图像分类模型训练方法、装置、计算机设备及存储介质
CN111126389A (zh) 文本检测方法、装置、电子设备以及存储介质
CN109963072B (zh) 对焦方法、装置、存储介质及电子设备
CN111814913A (zh) 图像分类模型的训练方法、装置、电子设备及存储介质
CN111340213B (zh) 神经网络的训练方法、电子设备、存储介质
CN113887447A (zh) 对象分类模型的训练方法、对象分类预测方法及装置
CN114612728A (zh) 模型训练方法、装置、计算机设备及存储介质
CN112069338A (zh) 图片处理方法、装置、电子设备及存储介质
WO2021081945A1 (zh) 一种文本分类方法、装置、电子设备及存储介质
CN114898266A (zh) 训练方法、图像处理方法、装置、电子设备以及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21761894

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21761894

Country of ref document: EP

Kind code of ref document: A1