WO2022199500A1 - 一种模型训练方法、场景识别方法及相关设备 - Google Patents

一种模型训练方法、场景识别方法及相关设备 Download PDF

Info

Publication number
WO2022199500A1
WO2022199500A1 PCT/CN2022/081883 CN2022081883W WO2022199500A1 WO 2022199500 A1 WO2022199500 A1 WO 2022199500A1 CN 2022081883 W CN2022081883 W CN 2022081883W WO 2022199500 A1 WO2022199500 A1 WO 2022199500A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
scene
neural network
convolutional neural
scene recognition
Prior art date
Application number
PCT/CN2022/081883
Other languages
English (en)
French (fr)
Inventor
戚向涛
刘艳
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to US18/551,258 priority Critical patent/US20240169687A1/en
Priority to EP22774161.8A priority patent/EP4287068A1/en
Publication of WO2022199500A1 publication Critical patent/WO2022199500A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/273Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion removing elements interfering with the pattern to be recognised
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/35Categorising the entire scene, e.g. birthday party or wedding scene
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/10Earpieces; Attachments therefor ; Earphones; Monophonic headphones
    • H04R1/1041Mechanical or electronic switches, or control elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/10Earpieces; Attachments therefor ; Earphones; Monophonic headphones
    • H04R1/1083Reduction of ambient noise
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2430/00Signal processing covered by H04R, not provided for in its groups
    • H04R2430/01Aspects of volume control, not necessarily automatic, in sound systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2460/00Details of hearing devices, i.e. of ear- or headphones covered by H04R1/10 or H04R5/033 but not provided for in any of their subgroups, or of hearing aids covered by H04R25/00 but not provided for in any of its subgroups
    • H04R2460/01Hearing devices using active noise cancellation

Definitions

  • the present application relates to the field of artificial intelligence technology, and in particular, to a model training method, a scene recognition method and related equipment in the field of computer vision in the field of artificial intelligence applications.
  • AI Artificial intelligence
  • the application field of artificial intelligence includes the field of computer vision, and scene recognition is an important branch technology in the field of computer vision.
  • Scene recognition refers to the identification (or "classification") of the environment that can be reflected in the image or the environment where the subject (person or object) is located, and aims to obtain scene information by extracting and analyzing the features in the scene image, Thereby, the scene to which the image belongs is identified.
  • Embodiments of the present application provide a model training method, a scene recognition method, and related equipment, which are used to improve the accuracy of scene recognition.
  • the present application provides a model training method, which is applied to a training device.
  • the method includes: the training device obtains a first training data set, the first training data set includes a plurality of first images, and the first training data set includes a plurality of first images.
  • the image is a scene image, for example, a first image is an image of an "office" scene, and the first image may include images of objects unrelated to scene recognition; the training device uses the object detection model to recognize the first area in the first image , the first area is an image area unrelated to scene recognition; then, the training device performs mask processing on the first area to obtain a third image; the function of the mask processing is to block the first area; then, the training device obtains Multiple sample object images generated by the image generation model, the sample object images are images of objects unrelated to scene recognition; the training equipment replaces the multiple sample object images into the first area covered by the mask in the third image, and obtains multiple Target image; multiple target images are obtained by combining, on the one hand, in terms of data volume, the number of images in the first training data set is expanded.
  • the third image retains the background image related to scene recognition, and the sample object image generated by the image generation model is used as the target of new synthesis Difference between images.
  • the training device uses the data set of the target image to train the first convolutional neural network, and uses the data set of the third image to train the second convolutional neural network to obtain a scene recognition model.
  • the scene recognition model includes the first convolutional neural network and The second convolutional neural network.
  • the training device trains the first convolutional neural network through a large number of newly synthesized target images, and images of objects irrelevant to scene recognition are introduced into scene images of the same category, so that the scene recognition model reduces the impact on the scene.
  • the second convolutional neural network is trained through the image related to scene recognition (ie, the third image), so that the second convolutional neural network can learn different scenes more easily
  • the difference features between categories can reduce the adverse effect of the similarity between categories on the classification performance of the scene recognition model.
  • the scene recognition model obtained by training the equipment can reduce the negative impact of the intra-class differences of the scene images of the same category on the classification performance of the scene recognition model, and the inter-class similarity of different scene categories brings about the classification performance of the scene recognition model. The negative impact of this method can improve the accuracy of scene recognition.
  • the method further includes: the training device inputs the first image into an image recognition model, where the image recognition model is a general image recognition model (for both image recognition and scene recognition) , the training device uses the image recognition model to obtain the first classification result of the first image and the heat map of the first image, wherein the heat map is used to display the area where the target object is located, and the image features of the target object are image features irrelevant to scene recognition , the category indicated by the first classification result is a non-scene category or an incorrect scene category; the training device performs mask processing on the second area in the first image except the first area where the target object is located, that is, occludes the second area , obtain a second image (that is, an image containing only the target object); then, the training device uses the second training data set to train the first model to obtain an object detection model, and the second training data set includes a plurality of sample data, the sample data It includes input data and output data, wherein the input data is the second image, and the output data is position coordinates,
  • the training device can determine the area in the first image that has the greatest impact on the classification decision made by the image recognition model through the heat map of the first image, and can determine the position of the target object irrelevant to scene recognition through the heat map.
  • the second image trains the first model (such as a neural network) to obtain an object detection model.
  • the object detection model is used to identify which part of the scene image has nothing to do with scene recognition, and then can determine which part of the scene image. related to scene recognition.
  • the method further includes: the training device uses the second image to train the generative adversarial network GAN to obtain an image generation model.
  • the image generation model is used to generate a large number of multiple sample object images irrelevant to scene recognition, so that target images for training the scene recognition model can be obtained.
  • the number of images in the set has been expanded, and the images of the same category can be used as the difference images between the newly synthesized target images, thereby reducing the adverse impact of intra-class differences on the classification performance of the scene recognition model. Improve the performance of scene recognition models.
  • both the target image and the third image correspond to labels of the first category
  • the training of the first convolutional neural network by using the target image and the training of the second convolutional neural network by using the third image may be specific Including: the training device extracts the image features of the target image through the first convolutional layer of the first convolutional neural network, and extracts the image features of the third image through the second convolutional layer of the second convolutional neural network, and uses the third convolutional neural network.
  • the image features of the image are output to the first convolutional layer to be fused with the image features of the target image; then, the fused image features are output to the output layer of the first convolutional neural network, through the output of the first convolutional neural network Layers such as fully connected layers and classifiers output labels for the first class.
  • the image features of the third image extracted by the second convolutional neural network are image features related to scene recognition in the first image, the second convolutional neural network is equivalent to the attention model, and the second convolutional neural network will extract the image features
  • the features are fused into the last convolutional layer of the first convolutional neural network, so that the scene recognition model pays more attention to the image features related to scene recognition.
  • the second convolutional neural network is trained by the image related to scene recognition, and the second convolutional neural network is easier to learn the difference between different scene categories. feature, so as to reduce the adverse effect of the similarity between classes on the classification performance of the scene recognition model.
  • an embodiment of the present application provides a scene recognition method, which is applied to an execution device.
  • the method includes: the execution device acquires a first scene image to be recognized, and then the execution device uses an object detection model to detect the first scene image in the first scene image. The first area where the object unrelated to scene recognition is located; the execution device performs mask processing on the first area to obtain a second scene image; and then, the execution device inputs the first scene image into the first convolution in the scene recognition model Neural network, input the second scene image into the second convolutional neural network in the scene recognition model, and use the scene recognition model to output the classification result.
  • the first convolutional neural network is obtained by training the data set of the target image, and the second The convolutional neural network is trained using the data set of the third image, and the target image is obtained by replacing multiple sample object images generated by the image generation model into the first area in the third image.
  • the third image is After the object detection model is used to identify the first area in the first image that is not related to scene recognition, the first image is obtained after masking the first area, and the first image is an image in the training data set.
  • the first convolutional neural network is obtained by learning the target image, and the target image is synthesized by the same background image and different different object images (images of objects irrelevant to scene recognition). owned.
  • the attention of the scene recognition model to the image features unrelated to scene recognition in the first scene image is reduced, thereby reducing the negative impact of intra-class differences between scene images of the same category on the classification performance of the scene recognition model.
  • the second convolutional neural network is obtained by learning the images related to scene recognition, so that the scene recognition image extracts the image features of the part of the image related to scene recognition, and pays more attention to the scene recognition related to the first scene image. It can reduce the negative impact of the inter-class similarity of different categories of scene images on the classification performance of the scene recognition model. In order to greatly improve the accuracy of the classification result of the first scene image.
  • the execution device inputs the first scene image into the first convolutional neural network in the scene recognition model, inputs the second scene image into the second convolutional neural network in the scene recognition model, and uses The scene recognition model outputting the classification result may specifically include: the execution device extracts the image features of the first scene image through the first convolutional layer of the first convolutional neural network, and extracts the first scene image through the second convolutional layer of the second convolutional neural network.
  • the image features of the second scene image and output the image features of the second scene image to the first convolution layer to be fused with the image features of the first scene image, so that the scene recognition model pays attention to the global information;
  • the first convolutional neural network The network outputs the fused image features to the output layer, and outputs the classification results through the output layer (full connection layer and classifier) of the first convolutional neural network.
  • the execution device adjusts the noise reduction mode of the earphone to the first noise reduction mode according to the classification result, the execution device can recognize the scene image, and automatically adjust the noise reduction mode of the earphone according to the classification result obtained by the scene recognition, without the need for the user to manually set Noise-cancelling mode for headphones.
  • the method further includes: the execution device sends a classification result to the user equipment, where the classification result is used to trigger the user equipment to adjust the noise reduction mode of the headset to the first noise reduction mode .
  • the execution device can recognize the scene image and send the classification result to the user equipment, so that the user equipment can automatically adjust the noise reduction mode of the earphone according to the classification result obtained by the scene recognition, and the user does not need to manually set the earphone's noise reduction mode. Noise reduction mode.
  • the method further includes: the execution device executes the execution according to the classification result.
  • the system volume of the device is adjusted to the first volume value.
  • the execution device can adaptively adjust the system volume value according to the classification result of the scene image, without requiring the user to frequently adjust the system volume value of the mobile phone according to different environments.
  • the method further includes: the execution device sends a classification result to the user equipment, where the classification result is used to trigger the user equipment to adjust the system volume of the user equipment to the first volume value, so that the user equipment can
  • the classification result obtained by the scene recognition automatically adjusts and adjusts the system volume value of the mobile phone, without the need for the user to manually adjust the system volume value of the mobile phone, improving the experience.
  • the acquiring the first scene image to be recognized may include: the execution device receives the to-be-recognized first scene image sent by the user equipment; or, the execution device collects the to-be-recognized scene image through a camera or an image sensor First scene image.
  • an embodiment of the present application provides a model training device, including:
  • an acquisition module configured to acquire a first training data set, where the first training data set includes a plurality of first images
  • the processing module is used for identifying the first area in the first image by using the object detection model, and the first area is an image area irrelevant to scene recognition; performing mask processing on the first area to obtain a third image; obtaining the image generation model to generate
  • the sample object images are images of objects unrelated to scene recognition; replace the multiple sample object images into the first area covered by the mask in the third image to obtain multiple target images; use the target image
  • the dataset trains the first convolutional neural network, and uses the dataset of the third image to train the second convolutional neural network to obtain a scene recognition model, where the scene recognition model includes a first convolutional neural network and a second convolutional neural network.
  • the processing module is further configured to input the first image into the image recognition model, and use the image recognition model to obtain the first classification result of the first image and the heat map of the first image, and the heat map is used for Show the area where the target object is located, the image features of the target object are image features irrelevant to scene recognition, and the category indicated by the first classification result is a non-scene category or an incorrect scene category; Perform mask processing on the second area other than the second image to obtain a second image; use the second training data set to train the first model to obtain an object detection model, the second training data set includes a plurality of sample data, and the sample data includes the input data and output data, wherein the input data is the second image, the output data is the position coordinates, and the position coordinates are used to indicate the area where the target object is located.
  • the processing module is further configured to use the second image to train the generative confrontation network GAN to obtain an image generation model.
  • both the target image and the third image correspond to labels of the first category;
  • the processing module is further configured to extract image features of the target image through the first convolutional layer of the first convolutional neural network, and The image features of the third image are extracted through the second convolutional layer of the second convolutional neural network, and the image features of the third image are output to the first convolutional layer for fusion with the image features of the target image;
  • the output layer of the convolutional neural network outputs the labels of the first category according to the fused image features.
  • an embodiment of the present application provides a scene recognition device, including:
  • an acquisition module for acquiring the first scene image to be identified
  • the processing module is used for detecting the first area in the first scene image where the object unrelated to scene recognition is located by using the object detection model; performing mask processing on the first area to obtain the second scene image; inputting the first scene image into the scene
  • the first convolutional neural network in the recognition model, the second scene image is input into the second convolutional neural network in the scene recognition model, and the classification result is output by the scene recognition model, wherein the first convolutional neural network uses the target image.
  • the second convolutional neural network is obtained by training the data set of the third image, and the target image is obtained by replacing multiple sample object images generated by the image generation model with the first image in the third image.
  • the third image is obtained after using the object detection model to identify the first area in the first image that is not related to scene recognition, and then performing mask processing on the first area.
  • the first image is an image in the training data set.
  • the processing module is further configured to extract the image features of the first scene image through the first convolutional layer of the first convolutional neural network, and use the second convolutional layer of the second convolutional neural network to extract the image features of the first scene image.
  • the layer extracts the image features of the second scene image, and outputs the image features of the second scene image to the first convolution layer for fusion with the image features of the first scene image; through the output layer of the first convolutional neural network according to the The fused image features output the classification result.
  • the device further includes a sending module; if the classification result indicates the first scene, the first scene has a corresponding relationship with the first noise reduction mode of the earphone; the processing module is further configured to send the The noise reduction mode of the earphone is adjusted to the first noise reduction mode; or, the sending module is configured to send a classification result to the user equipment, and the classification result is used to trigger the user equipment to adjust the noise reduction mode of the earphone to the first noise reduction mode.
  • the processing module is further configured to adjust the system volume of the executing device to the first volume value according to the classification result or, the sending module is further configured to send the classification result to the user equipment, where the classification result is used to trigger the user equipment to adjust the system volume of the user equipment to the first volume value.
  • the acquiring module is further specifically configured to: receive the first scene image to be identified sent by the user equipment; or collect the first scene image to be identified through a camera or an image sensor.
  • an embodiment of the present application provides an electronic device, including: a processor, the processor is coupled to a memory, and the memory is used to store a program or an instruction, and when the program or instruction is executed by the processor when the electronic device is caused to execute the method described in any one of the first aspect above; or, when the program or instruction is executed by the processor, the electronic device is caused to execute the method described in the second aspect above. The method of any one.
  • an embodiment of the present application provides a computer program product, the computer program product includes computer program code, and when the computer program code is executed by a computer, enables the computer to implement any one of the above-mentioned first aspects The method described in item 1; or, when the computer program code is executed by a computer, the computer program code causes the computer to implement the method described in any one of the above-mentioned second aspects.
  • an embodiment of the present application provides a computer-readable storage medium for storing a computer program or instruction, and when the computer program or instruction is executed, causes a computer to execute any one of the above-mentioned first aspect. method; or, the computer program or instructions, when executed, cause a computer to perform the method according to any one of the above second aspects.
  • FIG. 1 is a schematic diagram of an artificial intelligence main body architecture in an embodiment of the application
  • FIGS. 2A and 2B are schematic diagrams of a system architecture in an embodiment of the present application.
  • FIG. 3 is a schematic diagram of the original image and the heat map of the original image
  • FIG. 4 is a schematic flowchart of steps for training an object detection model and an image generation model in an embodiment of the present application
  • FIG. 5 is a schematic diagram of obtaining a second image after processing the first image mask in an embodiment of the present application
  • FIG. 6 is a schematic diagram of the architecture of a scene recognition model in an embodiment of the present application.
  • FIG. 7 is a schematic flowchart of steps for training a scene recognition model in an embodiment of the present application.
  • FIG. 8 is a schematic diagram of obtaining a third image after masking the first image in an embodiment of the present application.
  • FIG. 9 is an architectural diagram of an object detection model and a scene recognition model in an embodiment of the present application.
  • FIG. 10 is a schematic flowchart of steps of an embodiment of a scene recognition method in an embodiment of the present application.
  • FIG. 11A, FIG. 11B and FIG. 11C are schematic diagrams of the setting interface of the corresponding relationship between the noise reduction mode of the earphone and the scene in the embodiment of the application;
  • FIG. 12 is a schematic diagram of a scene for modifying the correspondence between a scene and a noise reduction mode in an embodiment of the present application
  • FIG. 13 is a schematic diagram of a setting interface of a corresponding relationship between a scene and a system volume value in an embodiment of the application;
  • FIG. 14 is a schematic structural diagram of an embodiment of a model training apparatus in an embodiment of the present application.
  • FIG. 15 is a schematic structural diagram of an embodiment of a neural network processor in an embodiment of the present application.
  • 16 is a schematic structural diagram of an electronic device in an embodiment of the application.
  • FIG. 17 is a schematic structural diagram of an embodiment of a scene recognition apparatus in an embodiment of the present application.
  • FIG. 18 is a schematic structural diagram of another electronic device in an embodiment of the present application.
  • the present application relates to the field of computer vision in the application field of artificial intelligence, in particular to scene recognition in the field of computer vision. Firstly, the main frame of artificial intelligence is explained.
  • Figure 1 shows a schematic diagram of an artificial intelligence main frame, which describes the overall workflow of an artificial intelligence system and is suitable for general artificial intelligence field requirements.
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, data has gone through the process of "data-information-knowledge-wisdom".
  • the "IT value chain” reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecological process of the system.
  • the infrastructure provides computing power support for artificial intelligence systems, realizes communication with the outside world, and supports through the basic platform. Communication with the outside world through sensors; computing power is provided by smart chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA); the basic platform includes distributed computing framework and network-related platform guarantee and support, which can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
  • smart chips hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA
  • the basic platform includes distributed computing framework and network-related platform guarantee and support, which can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
  • the data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as IoT data from traditional devices, including business data from existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
  • machine learning and deep learning can perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc. on data.
  • Reasoning refers to the process of simulating human's intelligent reasoning method in a computer or intelligent system, using formalized information to carry out machine thinking and solving problems according to the reasoning control strategy, and the typical function is search and matching.
  • Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image identification, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall artificial intelligence solution, and the productization of intelligent information decision-making and implementation of applications. Its application areas mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical care, smart security, autonomous driving, safe city, smart terminals, etc.
  • Scene recognition is an important branch technology in the field of computer vision.
  • Scene recognition refers to the identification (or “classification") of the environment that can be represented in the image or the environment where the subject (person or object) is located.
  • subject or "object”-centric image recognition
  • scene recognition focuses on the global information of an image.
  • the recognition device tends to use objects that are not related to the environment as the key features of the recognition scene, resulting in two difficulties in scene recognition.
  • scene images of the same scene category ie, intra-class differences
  • image A is a photo of Huaweing wearing a mask at the airport
  • image B is a photo of Huaweing at the airport without a mask.
  • Image A and image B are also "airport" scenes. face” as the key feature of image recognition, get wrong classification results (such as “hospital”).
  • image C is an image including seats inside a high-speed rail
  • image D is an image including seats inside an airport. It is easier for the recognition device to use the seat as the key feature of the recognition scene, perform scene recognition on the image D, and use the seat in the image D as the key feature of the recognition to obtain an incorrect classification result (such as "high-speed rail”).
  • Intra-class dissimilarity and inter-class similarity lead to a decrease in the accuracy of scene recognition.
  • an embodiment of the present application provides a scene image recognition method, which is used to reduce intra-class differences and inter-class similarities of scene images, thereby improving the accuracy of scene recognition.
  • a system architecture The data collection device 210 is used to collect images, and the collected images are stored in the database 230 as training data.
  • the training device 220 is based on the image data maintained in the database 230 .
  • Generate object detection models and scene recognition models are used to detect the area "unrelated to scene (environment) recognition" in the image to be recognized.
  • the scene recognition model is used to recognize the scene image to be recognized.
  • the training device 220 is implemented by one or more servers, and optionally, the training device 220 is implemented by one or more terminal devices.
  • the execution device 240 acquires the object detection model and the scene recognition model from the training device 220 , and loads the object detection model and the scene recognition model into the execution device 240 . After the execution device 240 acquires the scene image to be recognized, the object detection model and the scene recognition model can be used to recognize the scene image to be recognized to obtain a classification result.
  • the execution device 240 is a terminal device.
  • the execution device 240 includes but is not limited to a mobile phone, a personal computer, a tablet computer, a wearable device (such as a watch, a wristband, a VR/AR device), a vehicle terminal, and the like.
  • the system architecture further includes user equipment 250, and the user equipment 250 includes but is not limited to mobile phones, personal computers, tablet computers, wearable devices (such as watches, wristbands, VR/AR devices) and Vehicle terminal, etc.
  • the execution device 240 is implemented by one or more servers.
  • the user equipment 250 may interact with the execution device 240 through any communication mechanism or communication standard communication network, and the communication network may be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.
  • the user equipment 250 is used to collect the scene image to be recognized, and send the scene image to be recognized to the execution device 240.
  • the execution device 240 receives the scene image to be recognized from the user equipment 250, and uses the object detection model and the scene recognition model to be recognized. The scene image is recognized, and the classification result is obtained. The execution device 240 sends the classification result to the user device 250 .
  • the training device 220 and the execution device 240 may be the same device, for example, a server (or server cluster) is used to implement both the functions of the training device 220 and the execution device 240 .
  • the embodiment of the present application provides a model training method, and the method is applied to the training device in the above-mentioned system architecture.
  • the training device acquires a first training data set, the first training data set includes a plurality of first images, uses an object detection model to identify images of target objects in the first image that are not related to scene recognition, and the training device determines where the target objects in the first image are located.
  • the area of is subjected to mask processing to obtain a third image (that is, only images related to scene recognition are included).
  • the training device uses the image generation model to generate a large number of sample object images irrelevant to scene recognition, and combines the sample object images and the third image to obtain a combined target image.
  • the training device inputs the combined target image into the first convolutional neural network for training, and inputs the third image into the second convolutional neural network for training to obtain a scene recognition model, where the scene recognition model includes the first convolutional neural network and the second convolutional neural network.
  • the first convolutional neural network is trained through a large number of newly synthesized target images, so that object images irrelevant to scene recognition are introduced into the scene images of the same category, so that the scene recognition model can reduce the difference in the target images. The attention of the features, thereby reducing the negative impact of the intra-class differences of the same category of scene images on the classification performance of the scene recognition model.
  • the second convolutional neural network is trained through the images related to scene recognition, so that the second convolutional neural network can more easily learn the difference features between different scene categories, thereby reducing the inter-class similarity of different scene categories to the scene.
  • the negative impact of the classification performance of the recognition model improves the accuracy of scene recognition.
  • an embodiment of the present application provides a scene recognition method, and the method is applied to the execution device in the above-mentioned system architecture.
  • the execution device collects an image of the first scene to be identified through a camera and/or an image sensor. Then, the execution device uses the object detection model obtained by the above-mentioned training device to detect the first region in the first scene image where the objects irrelevant to scene recognition are located.
  • the execution device performs mask processing on the first area to obtain a second scene image.
  • the execution device inputs the first scene image and the second scene image into the training device to obtain a scene recognition model, and uses the scene recognition model to output a classification result.
  • Scene recognition refers to the classification of the environment or the environment in which the object (person or object) is located in the image.
  • the categories of scene images can include but are not limited to “airport”, “high-speed rail”, and “hospital”. , “Office” classes, "Cafe” and so on.
  • the category of the scene image may also be, for example, an "indoor scene” category, an “outdoor scene” category, or a "noisy scene” category, a "quiet scene” category, a “monitoring scene” category, and the like.
  • the category of the scene image is configured according to the specific application scene, and the specific is not limited.
  • the intra-class difference of scene pictures refers to the difference between scene pictures of the same category, which causes pictures with large intra-class differences to be easily misclassified into other categories.
  • an image of an office scene includes a "face” image, and the office picture containing "face” is misclassified into other categories due to the introduction of differential information (images of human faces), that is, it is misclassified as "non-" Office" category.
  • the inter-class similarity of scene images refers to the fact that different categories of scene images have similar object images, resulting in the misclassification of different categories of scene images into one category.
  • the pictures inside the high-speed train and the inside of the airport both include "chairs". Due to the similarity of "chairs", the pictures inside the high-speed train and the inside of the airport are easily classified into the same category, for example, both are classified as "high-speed rail”. ” category, or both are classified in the “Airport” category.
  • Heat map (gradient-weighted class activation map, CAM) is a tool to help visualize convolutional neural networks (CNN), which is used to describe which local position in an image allows CNN to make the final classification decision .
  • the CAM includes a two-dimensional feature grid related to the output category, and the position of each network represents the importance of the output category.
  • FIG. 3 is a schematic diagram of the original image and the heat map of the original image, and the similarity between each grid position in the image and the classification result is presented in the form of a heat map.
  • Figure 3 includes a cat and a dog. CNN classifies the image into the category of "dog". It can be seen from the heat map that CNN recognizes the feature of "dog's face position", that is, the dog's face. The feature of the part is used as the key feature of classification, and the image is classified into the category of "dog”.
  • heatmaps The basic principles of heatmaps are briefly described below.
  • Input an image into the convolutional neural network extract image features through the convolutional neural network, perform global average pooling (GAP) on the last feature map (feature map) of the convolutional neural network model, and calculate The average value of each channel, and then calculate the gradient of the output of the largest category relative to the last feature map, and then visualize this gradient on the original image.
  • GAP global average pooling
  • feature map feature map
  • the heatmap can show which part of the high-level features extracted by the convolutional neural network has the greatest impact on the final classification decision.
  • GANs Generative adversarial networks for generating sample data.
  • GAN is used to generate images of objects in an image that are not relevant for scene recognition.
  • GAN includes a generative model (G) and a discriminative model (D).
  • the generative model is used to generate a sample similar to the real training data, and the goal is to be as similar to the real sample as possible.
  • the discriminant model is a binary classifier used to estimate the probability that a sample comes from a real training sample. If the discriminant model estimates that the sample comes from a real training sample, the discriminant model outputs a high probability. If the sample estimated by the discriminant model comes from the sample generated by the generative model, the discriminant model outputs a small probability.
  • the goal of the generative model is to find ways to generate samples that are the same as the real samples, so that the discriminant model cannot distinguish them.
  • the goal of the discriminative model is to find a way to detect the samples generated by the generative model. Through the confrontation and game between G and D, the samples generated by GAN are close to the real samples, so that a large amount of sample data can be obtained.
  • This application consists of two parts, the first part: the model training process.
  • the second part the execution (inference) process.
  • the following first describes the process of model training.
  • the model training process The main body of the training process is the training device.
  • the process of model training mainly involves three models: object detection model, image generation model and scene recognition model.
  • the training device acquires a first training data set, where the first training data set includes a plurality of first images (or referred to as "original images").
  • the data collection device collects images and stores the collected images in a database.
  • the training device obtains the first training data set from the database.
  • the data acquisition device is a device with an image sensor, such as a camera, a video camera, or a mobile phone.
  • the first training dataset includes a large number of images of different categories. For example, A1-"airport" category, A2-"high-speed rail” category, A3-"subway” category, A4-"office” category, A5-"hospital", etc., the specifics are not limited. It should be understood that there are various classification methods for the classification of images in the first training data set according to different requirements, and the specific classification has different classifications according to different specific application scenarios.
  • the original image in order to distinguish the original image in the first training data set and the image after processing the original image, the original image is referred to as a "first image".
  • the image obtained by masking the "image related to scene recognition” in the first image is called a “second image” (only the image of objects not related to scene recognition is retained).
  • the image obtained by masking the "object image not related to scene recognition” in the first image is called a “third image” (only the image related to scene recognition is retained).
  • the training device inputs the first image into the image recognition model, and uses the image recognition model to obtain the first classification result of the first image and the heat map of the first image.
  • the heat map is used to display the area where the target object is located, the image features of the target object are image features irrelevant to scene recognition, and the category indicated by the first classification result is a non-scene category or an incorrect scene category.
  • Image recognition models are general-purpose object recognition models for recognizing target objects (or "target objects") in images.
  • the first image is a scene image of "a person is working in the office”
  • the first image is input into a general image recognition model
  • the image recognition model outputs the first classification result of the first image is "person”.
  • the heat map of the image obtains the area where the image recognition model has the greatest impact on the classification decision (that is, the area where the face is located).
  • the general image recognition model pays more attention to the image features of the subject in the image, so the category (such as "person") indicated by the output classification result is a non-scene category or a wrong scene category.
  • step S11 The purpose of the above step S11 is to obtain the heat map of the first image, through which the position of a target object (such as a "face") irrelevant to scene recognition can be determined, so that an image that only includes the target object can be obtained (step S12 below). ), the image of the remaining area after the target object is blocked in the first image can also be obtained (step S22 below).
  • a target object such as a "face”
  • the training device performs mask processing on the second area in the first image except the first area where the target object is located, to obtain a second image (ie, an image containing only the target object).
  • the first image is any image in the first training data set.
  • Each image in the first training data set is processed through the above steps S11 and S12, that is, a second training data set is obtained, and the second training data set includes a plurality of second images.
  • the “area where objects not related to scene recognition are located” is referred to as the “first area”. area”, and the “area related to scene recognition” is called the second area.
  • the first image A is used as an example for illustration.
  • the first image A is an image of an office scene including a “human face”, and the For a target object irrelevant to scene recognition, the first area 502 where the "face" is located is an area irrelevant to scene recognition.
  • the area other than the first area 502 in the first image A is the second area 503, and mask processing is performed on the second area 503 (for example, the pixel value of the second area is set to 0), and the obtained image is the second image A.
  • the training device uses the second training data set to train the first model to obtain an object detection model.
  • the object detection model is used to detect the first region where objects irrelevant to scene recognition are located in the input first image.
  • the first model may be a neural network model.
  • the second training data set includes a plurality of sample data, each sample data includes input data and output data, wherein the input data is the second image, and the output data is position coordinates, and the position coordinates are used to indicate the rectangular area where the target object is located.
  • the training device trains the GAN network through the second image to obtain an image generation model.
  • the training device uses the image generation model to generate multiple sample object images of the same category as the target object.
  • the process of optimizing (or training) the GAN network through the second image is as follows.
  • the discriminative model (D) is optimized.
  • D optimizes the network structure to output 1 by itself.
  • the input comes from the data generated by G
  • D optimizes the network structure so that it outputs 0 itself.
  • G optimizes its own network so that it outputs the same samples as the real data as much as possible, and enables D to output high probability values after the generated samples are discriminated by D.
  • the training process of G and D is carried out alternately. This confrontation process makes the images generated by G become more and more realistic, and D's ability to "fake counterfeiting" is also getting stronger and stronger.
  • the second image A is an image of a "face”
  • the image generation model will generate a large number of "face” images.
  • the "face” image generated by the image generation model is not the “face” of someone in reality, but It is produced by the image generation model based on the learning of the second image A, and has all the characteristics of the real "face”.
  • the second image B is an image of a "chair”
  • the image generation model will generate a large number of "chair” images and so on.
  • the image generation model is obtained.
  • the above steps S13 and S14 are not limited in timing, and S13 and S14 can be performed synchronously, that is, the image generation model and the object detection model are obtained synchronously.
  • S13 is performed before step S14, that is, the object detection model is obtained first, and then the image generation model is obtained.
  • S13 is performed after step S14, that is, the image generation model is obtained first, and then the object detection model is obtained.
  • the scene recognition model includes two branch structures (or referred to as a backbone structure and a branch structure), and the two branch structures are two parallel sub-networks.
  • the two sub-networks are called the first convolutional neural network and the second convolutional neural network, respectively.
  • the first convolutional neural network includes a plurality of first convolutional layers, a first fully connected layer and a classifier. Among them, the first convolutional layer, the first fully connected layer and the classifier are connected in sequence.
  • the second convolutional neural network includes a plurality of second convolutional layers and a second fully connected layer.
  • the second fully connected layer is connected to the last layer of the first convolutional layer.
  • the convolutional layer in the first convolutional neural network is called "the first convolutional neural network”.
  • layer the convolutional layer in the second convolutional neural network
  • the fully connected layer in the first convolutional neural network is referred to as the "first fully connected layer”
  • second volume The fully connected layer in the product neural network is called the "second fully connected layer”.
  • step S20 to S25 the training process of the scene recognition model is shown in steps S20 to S25 below.
  • the training device obtains a first data training set.
  • the first training dataset includes a plurality of first images (or "original images").
  • step S10 for this step, please refer to the description of step S10 in the above example corresponding to FIG. 5 , which is not repeated here.
  • the training device inputs the first image into the object detection model, and uses the object detection model to identify the first region in the first image.
  • the first area is an image area that is not relevant for scene recognition.
  • the object detection model is the object detection model obtained in steps S11 to S13 in the example corresponding to the above-mentioned FIG. 4 .
  • the first image C is a foreground image of a human face, and a background image of a scene image of an office.
  • the first image C is input to the object detection model, and the object detection model outputs 4 coordinate points, the 4 coordinate points indicate the first area including the face, and the first area is an area irrelevant to scene recognition.
  • the training device performs mask processing on the first region to obtain a third image.
  • the area of the first image C including the face 501 is the first area 502
  • the area of the first image C other than the first area 502 is the second area 503 .
  • Area 502 is masked to obtain a third image.
  • the function of mask processing is to block the first area 502.
  • the pixel value of the first area is set to "0", so that the third image only contains the image of the second area 503, that is, the third image mainly contains Images related to scene recognition.
  • the training device acquires multiple sample object images generated by the image generation model.
  • a sample object image is an image of an object that is not relevant for scene recognition.
  • the image generation model generates a large number of sample object images according to the objects "unrelated to scene recognition" in each first image in the first training set. For this step, please refer to the description of S14 in the above example corresponding to FIG. 4 , which is not repeated here.
  • the training device replaces the multiple sample object images into the regions covered by the mask in the third image respectively, to obtain multiple target images.
  • the third image includes only the background image related to the scene recognition after the object image (also referred to as the "interference image") unrelated to the scene recognition has been occluded.
  • the first category is any one of multiple scene categories.
  • the first category takes the "office" category as an example.
  • the training device masks the area corresponding to the interference image "face” in the first image A.
  • the third image A is obtained. Then, replace a large number of different "face” images generated by the image generation model into the area covered by the mask in the third image A, and combine to obtain multiple target images (new images after combination).
  • the labels corresponding to the combined target images are still of the "office" class.
  • a third image B that is, the third image B is obtained. contains the area covered by the mask, and then replace a large number of "chair” images generated by the image generation model into the area covered by the mask in the third image B, and combine to obtain multiple target images, the labels corresponding to the multiple target images Still the "office".
  • the training device can also replace the "chair” image generated by the image generation model into the area covered by the mask in the third image A, and combine to obtain multiple target images.
  • the training device replaces the "face” image generated by the image generation model into the area covered by the mask in the third image B, and combines to obtain multiple target images.
  • the third image and the sample object image generated by the image generation model are combined, so that a large number of new target images can be obtained.
  • Each first image in the first training data set is processed in steps S21 and S22, and then the multiple sample object images generated by the image generation model are combined with the third image respectively.
  • Combining multiple target images on the one hand, in terms of data volume, the number of images in the first training data set is expanded.
  • the third image retains the background image related to scene recognition, and the sample object image generated by the image generation model is used as the newly synthesized scene.
  • the label corresponding to the new target image obtained by combining is still the first category (such as the office category), and the target image is used as the training data of the scene recognition model.
  • the scene recognition model is trained with multiple target images, and the multiple target images have the same (or similar) background images, thereby reducing the attention (or sensitivity) of the scene recognition model to the intra-class differences of the same category of scene images. , so that the scene recognition model pays less attention to the intra-class differences of scene images of the same category (such as different foreground images), and pays more attention to the intra-class similarity of the same scene images (such as the same background image), thereby improving The classification accuracy of the scene recognition model.
  • the training device inputs the target image into the first convolutional neural network, uses the data set of the target image to train the first convolutional neural network, inputs the third image into the second convolutional neural network, and uses the data of the third image
  • the second convolutional neural network is trained to obtain a scene recognition model, and the scene recognition model includes a first convolutional neural network and a second convolutional neural network.
  • the training data for the first convolutional neural network and the training data for the second convolutional neural network are different. That is, the training data for the first convolutional neural network is a large number of target images (that is, a new scene image obtained by combining), while the training data for the second convolutional neural network is the third image (that is, the original image and the scene recognition image after masking out irrelevant objects).
  • an original image A of an office scene includes a foreground image (human face) and a background image.
  • the "face” in this original image A is an object unrelated to scene recognition, then the area where the "face” is located is blocked to obtain image B (third image), and image B will be used as the second convolution input to the neural network.
  • the same point between the target image and the third image is: the background images of the target image (image C, image D and image F) are all the same, and all come from the original image A; the image information of the third image (image B) also comes from the original image image A.
  • the difference between the target image and the third image is: the target image (image C, image D, and image F) includes not only images related to scene recognition, but also object images not related to scene recognition; the third image (image B) contains only images relevant to scene recognition. That is, in the process of training the scene recognition model, the two branch structures of the scene recognition model receive two training data at the same time.
  • a convolutional layer (also referred to as a "first convolutional layer") of the first convolutional neural network is used to extract image features of the target image.
  • the first convolutional neural network can be divided into multiple stages of convolutional feature extraction operations. For example, the convolutional feature extraction operations of multiple stages can be recorded as "block_1" in the order from left to right (from shallow layer to high layer). ", "block_2"!block_n". The size of the image corresponding to each stage is different, and the size of the image feature (feature) from "block_1" to "block_n” becomes smaller.
  • the size of block_1 is 224 ⁇ 224 ⁇ 64; the size of block_2 is 112 ⁇ 112 ⁇ 128; the size of block_3 is 56 ⁇ 56 ⁇ 256; the size of block_4 is 28 ⁇ 28 ⁇ 512; the size of block_5 is 14 ⁇ 14 ⁇ 512.
  • the feature maps of the first two convolutional layers (block_n-2 and block_n-1) of the last convolutional layer (block_n) are pooled (such as average pooling), and after changing the size of these two blocks, block_n-
  • the features of 2 and block_n-1 are fused into the image features of the last block_n, so that multi-scale features are fused, that is, high-level features and shallow-level features are fused.
  • the scene recognition model enables the scene recognition model to pay more attention to global features.
  • the first convolutional neural network is trained through a large number of newly synthesized target images, and object images irrelevant to scene recognition are introduced into the scene images of the same category, so that the scene recognition model pays less attention to the features of the difference images in the scene images.
  • the "feature fusion" described in the embodiments of the present application may be implemented by concatenating image features (or called feature maps) (concatenate for short, concat for short), summing or weighted average.
  • the convolutional layer (also referred to as the second convolutional layer) of the second convolutional neural network is used to extract image features of the third image.
  • the image features of the third image pass through the fully-connected layer (the second fully-connected layer), and the image features output by the second fully-connected layer are fused to the last convolutional layer block_n of the first convolutional neural network, and the fused image features pass through The fully connected layer of the first convolutional neural network (the first fully connected layer) and the classifier output the classification result (label).
  • the image features of the third image extracted by the second convolutional neural network are image features related to scene recognition in the original image.
  • the second convolutional neural network is equivalent to the attention model, and the second convolutional neural network will extract the image features. Fusion into the last convolutional layer of the first convolutional neural network makes the scene recognition model pay more attention to the image features related to scene recognition. And by occluding object images that are not related to scene recognition, the second convolutional neural network is trained by images related to scene recognition. The detrimental effect of inter-class similarity on the classification performance of scene recognition models.
  • the shallow features extracted by the first convolutional neural network and the second convolutional neural network are similar.
  • the first convolutional neural network and the second convolutional neural network Neural networks can reuse parts of convolutional layers.
  • both the first convolutional neural network and the second convolutional neural network include 20 convolutional layers
  • the first convolutional neural network and the second convolutional neural network can reuse the first 8 convolutional layers
  • the first convolutional neural network The 9th convolutional layer in the convolutional neural network - the 20th convolutional layer (such as "convolutional layer 9a-convolutional layer 20a") and the 9th convolutional layer in the second convolutional neural network -
  • the 20th convolutional layer (for example, denoted as "convolutional layer 9b-convolutional layer 20b") is deployed separately.
  • the execution process of scene recognition is the execution device.
  • the execution device may be a mobile phone.
  • Figure 9 is the architecture diagram of the object detection model and the scene recognition model.
  • the terminal device is loaded with a scene recognition model and an object detection model.
  • the object detection model is used to detect the region in the input image where objects that are not related to scene recognition are located, and the scene recognition model is used to classify the image to be recognized.
  • the architecture of the scene recognition model please refer to the architecture description corresponding to FIG. 6 above, which is not repeated here.
  • FIG. 10 is a schematic flowchart of steps of a scene recognition method.
  • Step S30 the execution device collects the image of the first scene to be recognized through the camera.
  • the camera may be actively turned on by the user. For example, the user clicks the camera icon, and the execution device receives the operation of the user clicking the camera, controls the camera to be turned on, and the camera captures the image of the first scene.
  • the camera may be enabled by an application (application, APP) call. For example, during a video call of an instant messaging APP, the camera is enabled, and the camera collects an image of the first scene.
  • the camera may be self-started after a scene recognition requirement is generated. For example, if the execution device detects a change in the position of the device through the sensor, the current scene of the execution device may also change, and the scene needs to be re-identified. Therefore, the camera starts automatically and the camera A first scene image is acquired.
  • the execution device takes a mobile phone as an example, and the camera may be a front-facing camera or a rear-facing camera, which is not specifically limited.
  • Step S31 the executing device detects the first area in the first scene image where the object irrelevant to scene recognition is located by using the object detection model.
  • the object detection model in this step is the object detection model obtained by training in steps S11 to S13 in the example corresponding to FIG. 4 above.
  • the execution device inputs the to-be-recognized first scene image into the object detection model, and the object detection model outputs position coordinates, where the position coordinates are used to indicate the first area.
  • the position coordinates are 4 pixels, 4 pixels indicate a rectangular area, and the object image in the rectangular area (ie, the first area) is an image irrelevant to scene recognition.
  • the first scene image is an image of an office scene
  • the middle area in the first scene image is an image of a "face”
  • the first area where the "face" is located is detected by the object detection model.
  • Step S32 The execution device performs mask processing on the first area to obtain a second scene image.
  • the function of mask processing is to block the first area, so that the second scene image does not contain images unrelated to scene recognition, but only contains images related to scene recognition.
  • the pixel value of the rectangular area where the "face” is located is set to "0", and the area where the "face” is located is blocked to obtain the second scene image.
  • Step S33 The execution device inputs the first scene image and the second scene image into the scene recognition model, and uses the scene recognition model to output the classification result.
  • the scene recognition model includes a first convolutional neural network and a second convolutional neural network, wherein the first convolutional neural network is used for receiving a first scene image and extracting a first image feature of the first scene image.
  • the second convolutional neural network is used for receiving the second scene image, extracting the second image feature of the second scene image, and outputting the second image feature to the last convolutional layer of the first convolutional neural network.
  • the second image features are fused to the first image features, and then the first convolutional neural network outputs the fused image features to the output layer (including the first fully connected layer and the classifier), and outputs the classification result through the output layer.
  • the first convolutional neural network is obtained by learning the target image, and the target image is obtained by synthesizing the same background image and different different object images (object images irrelevant to scene recognition). .
  • the attention of the scene recognition model to the image features unrelated to scene recognition in the first scene image is reduced, thereby reducing the negative impact of intra-class differences between scene images of the same category on the classification performance of the scene recognition model.
  • the second convolutional neural network is obtained by learning the images related to scene recognition, so that the scene recognition image extracts the image features of the part of the image related to scene recognition, and pays more attention to the scene recognition related to the first scene image. It can reduce the negative impact of the inter-class similarity of different categories of scene images on the classification performance of the scene recognition model. In order to greatly improve the accuracy of the classification result of the first scene image to be recognized.
  • the scene recognition method provided by the embodiment of the present application can be applied to many specific application scenarios.
  • the mobile phone can adjust the noise reduction mode of the headset according to the classification result of the scene image, see the description of S34A below.
  • the mobile phone can adjust the volume according to the classification result of the scene image, see the description of S34B below.
  • the following describes application scenarios to which the classification result of the first scene image can be applied.
  • the first mode or called “deep noise reduction mode”
  • the second mode or called “life noise reduction mode”
  • the third mode or called “transparent mode” or called “monitor mode”
  • the general principle of earphone noise reduction is that the earphone picks up the ambient noise through the microphone set on the earphone, and the earphone generates anti-noise waves to cancel the external sound, so that the external sound can be fully or partially reduced before it enters the user's ear.
  • the first mode is used to control the headset to turn on deep noise reduction, so that the headset shields most of the noise in the surrounding environment.
  • the second mode is used to control the earphone to activate normal noise reduction, so that the earphone can block a small part of the noise in the surrounding environment. When the headset activates the second mode, the user can hear some sounds in the external environment. This mode is suitable for restaurants, streets, shopping malls and other living places.
  • the third mode refers to taking into account both human voice and voice while reducing ambient noise, so as to avoid missing important work information.
  • the above three noise reduction modes are only illustrative and not limiting.
  • the method for switching the noise reduction mode of the earphone in the current technology will be described.
  • the user needs to set the noise reduction mode of the current headset through the setting interface in the mobile phone, such as "Settings” - "General” - “Noise Reduction Mode” - "Deep Noise Cancellation”.
  • the setting interface in the mobile phone such as "Settings” - "General” - "Noise Reduction Mode” - "Deep Noise Cancellation”.
  • the user wants to adjust the noise reduction mode of the headset he needs to open the setting interface of the mobile phone and set the "deep noise reduction” mode to block all the noise from the outside world.
  • Step S34A the execution device adjusts the noise reduction mode of the earphone according to the classification result of the first scene image.
  • the mobile phone can recognize the scene image, and automatically adjust the noise reduction mode of the earphone according to the classification result obtained by the scene recognition, and the user does not need to manually set the noise reduction mode.
  • different scenes have a corresponding relationship with the noise reduction mode, and the mobile phone can adjust the noise reduction mode according to the scene and the corresponding relationship between the scene and the noise reduction model.
  • Table 1 The different scenes and noise reduction modes are shown in Table 1 below.
  • the correspondence between various noise reduction modes and scenes is only for illustration and does not constitute a limitation.
  • the correspondence in Table 1 above may be pre-configured by default.
  • the user can set the corresponding relationship between each noise reduction mode and the scene according to actual needs.
  • the mobile phone displays the setting interface, the mobile phone receives the user's selection operation (eg, click operation), and the mobile phone determines the corresponding relationship between each noise reduction mode and the scene according to the user's selection operation.
  • the user selects “subway”, “airport” and “high-speed rail”, and the mobile phone establishes the corresponding relationship between “subway”, “airport”, “high-speed rail” and deep noise reduction mode.
  • the user selects "cafe” and “supermarket”, and the mobile phone establishes the corresponding relationship between the life noise reduction mode and "cafe” and “supermarket”.
  • the user selects "Office”, and the mobile phone establishes the corresponding relationship between the monitoring noise reduction mode and the "Office”.
  • the mobile phone can automatically establish various noise reduction modes and scenarios based on preset rules, statistical analysis and/or statistical learning according to the user's historical setting data for noise reduction modes in different scenarios.
  • the mobile phone collects a scene image of the user's current environment, uses the scene recognition model to identify the scene image, and obtains a recognition result, which is used to indicate the first scene (or environment) where the user is located.
  • the mobile phone queries the historical setting data.
  • the historical setting data includes the historical data of the corresponding relationship between the first scene set by the user and each earphone noise reduction mode. If the setting frequency of the corresponding relationship between the first scene and the first noise reduction mode is greater than the first threshold , the mobile phone automatically establishes the correspondence between the first scene and the first noise reduction mode.
  • the first scenario takes "subway" as an example, and the historical setting data is shown in Table 2 below.
  • the frequency (80%) of the user setting "deep noise reduction mode” is greater than the first threshold (for example, the first threshold is 70%), and the user setting "life reduction mode” The frequency (20%) of "Noise Mode” is less than the first threshold.
  • the mobile phone establishes a corresponding relationship between "subway” and "deep noise reduction mode".
  • the individualized adjustment of the noise reduction mode can be realized without manual setting by the user.
  • the user can also manually modify the corresponding relationship automatically established by the mobile phone to perform personalized configuration.
  • the corresponding relationship between the first scene and the first noise reduction mode is modified to the corresponding relationship between the first scene and the second noise reduction mode.
  • the mobile phone displays the setting interface
  • the setting interface shows that "subway” and “deep noise reduction mode” have a corresponding relationship
  • "deep noise reduction mode” is associated with a selection key
  • the mobile phone responds to the user's pressing of the selection key.
  • Operation modify the corresponding relationship between "subway” and “deep noise reduction mode” to "subway” and "life noise reduction mode”.
  • the mobile phone can receive the user's selection operation, modify the corresponding relationship between the scene and the noise reduction mode automatically established by the mobile phone, and perform personalized configuration, so that the user can configure the scene and the noise reduction mode according to their own environment and actual needs. Correspondence to improve user experience.
  • the user is currently in a subway environment and listens to music through headphones.
  • the user can turn on the camera of the mobile phone, or the camera of the mobile phone starts automatically.
  • the mobile phone collects a scene photo in the subway through the camera.
  • the image can be collected through the front camera of the mobile phone, or the image can be collected through the rear camera of the mobile phone, which is not limited in detail.
  • the mobile phone collects the scene image through the front camera, although the scene image contains the user's "face" image, through the scene recognition method in this embodiment, it can be accurately recognized that the classification result of the scene image is the first scene (like the "subway" scene).
  • the mobile phone switches the noise reduction model of the earphone to the first noise reduction mode (eg, the deep noise reduction mode) according to the first scene and the correspondence between the first scene and the first noise reduction mode (eg, the deep noise reduction mode).
  • the mobile phone can perform scene recognition on the collected scene images, and automatically adjust the noise reduction mode of the earphone according to the classification result of the scene recognition, and the user does not need to adjust the noise reduction mode according to the operation steps, which is convenient to implement.
  • the mobile phone can acquire a frame of scene image at every interval, and then perform scene recognition on the scene image.
  • the duration of the time period can be 10 minutes, 15 minutes, or 20 minutes, etc.
  • the duration of the time period is set based on, in general, the approximate duration required by the user from one environment to another different environment. Under normal circumstances, users do not frequently change their environment in a short period of time. For example, a user goes from "subway" to "office".
  • the time period is 10 minutes.
  • the camera of the user's mobile phone collects a frame of scene image every 10 minutes.
  • scene image A is collected at 2021.3.7 10:20:01
  • the classification result of scene image A recognized by the mobile phone is "subway”
  • the mobile phone adjusts the noise reduction mode of the mobile phone to "deep noise reduction mode" according to the classification result, and the headset performs In the deep noise reduction mode, the user can hardly hear the external noise, and can only hear the voice content of the other party in the video call.
  • the user leaves the subway at 2021.3.7 10:25:00, and the mobile phone collects scene image B at 2021.3.7 10:30:01.
  • the mobile phone recognizes the classification result of scene image B as "office", and the mobile phone adjusts the noise reduction of the mobile phone according to the classification result.
  • the mode is "Monitor Noise Canceling Mode".
  • the headset is switched to the monitoring noise reduction mode.
  • the headset blocks the noise in the environment, and the user cannot hear the noise in the environment, but the user can still hear the sound of colleagues greeting in the office environment, and the voice of colleagues talking about communication issues. At the same time, the user can hear the voice content of the other party in the video call.
  • the mobile phone automatically adjusts the noise reduction mode of the earphone according to the classification result of the scene recognition, and the user does not need to manually adjust the noise reduction mode of the earphone step by step, thereby improving the user experience.
  • System sounds include earphones, ringtones, calls and media sounds.
  • the user's environment is different, and the system volume requirements of the mobile phone are different. For example, in a noisy environment (such as subway, supermarket), the user needs to turn up the volume of the system. To hear the other party's voice clearly.
  • a relatively quiet environment such as an office, a library, etc.
  • the user may need to repeatedly adjust the volume of the system sound in different environments.
  • the user directly adjusts the ringtone and prompt tone of the mobile phone to mute in a quiet environment. Muting will also prevent the user from receiving and replying to the user's calls and messages in time.
  • Step S34B The execution device adjusts the system volume of the execution device according to the classification result of the first scene image.
  • the mobile phone collects the scene image, and the mobile phone can adaptively adjust the system volume value according to the classification result of the scene image, and there is no need for the user to frequently adjust the system volume value of the mobile phone according to different environments.
  • the mobile phone displays the setting interface of the system volume value, and the setting interface displays the progress bar corresponding to each scene for adjusting the volume value.
  • the user can set the corresponding volume value of each scene by sliding the progress bar. volume value.
  • there is no need for the user to set volume values corresponding to different scenarios and the mobile phone configures the correspondence between different scenarios and system volume values by default according to experience values.
  • Table 3 The corresponding relationship between different scenarios and the system volume value is shown in Table 3 below.
  • the specific scenarios shown in Table 3 below and the volume value corresponding to each scenario are only illustrative, and not limiting.
  • the user is in a cafe environment
  • the user turns on the camera of the mobile phone, or the camera of the mobile phone starts automatically
  • the mobile phone obtains a scene image C collected by the camera
  • the mobile phone performs scene recognition on the scene image C
  • the mobile phone adjusts the volume value of the system volume according to the classification result C. For example, the phone adjusts the system sound to 50 based on the "cafe" scene.
  • the volume value of the ringtone is 50, and the lower volume will not disturb other people, but also enable the user to hear the ringtone (or prompt tone), so as to prevent the user from missing the call.
  • the mobile phone camera When the user enters the subway from the coffee shop, the user is in the subway environment, the user turns on the mobile phone camera, or the mobile phone camera starts automatically, the mobile phone collects the scene image D through the camera, and the mobile phone identifies the user's environment according to the scene image D. , obtain the classification result D (subway scene), and the mobile phone adjusts the system volume value to 90 according to the classification result D (subway scene), so that the user can still hear the system sound of the mobile phone in the subway.
  • the instant messaging APP if the user is currently using the video call function of the instant messaging APP, the instant messaging APP has already called the camera, and the camera captures the scene image of the user in real time.
  • the mobile phone can acquire a frame of scene image at every interval, and then perform scene recognition on the scene image.
  • the duration of the time period can be 10 minutes, 15 minutes, or 20 minutes, etc.
  • the duration of the time period is set based on, in general, the approximate duration required by the user from one environment to another different environment.
  • the camera of the user's mobile phone collects a frame of scene image every 10 minutes.
  • the scene image C is collected at 2021.3.8 10:20:01
  • the mobile phone recognizes the classification result of scene image C as "subway”
  • the mobile phone adjusts the volume value of the headset to 90 according to the classification result, and the volume in the headset increases,
  • the user can clearly hear the sound in the headset.
  • the user leaves the subway at 2021.3.8 10:25:00, and the mobile phone collects scene image D at 2021.3.8 10:30:01.
  • the mobile phone recognizes the classification result of scene image D as "office”, and the mobile phone adjusts the volume of the headset according to the classification result.
  • the value is 50.
  • the volume of the earphone is reduced, so that the user can not only hear the voice content of the other party, but also the volume of the earphone is moderate, which will not cause discomfort to the user's ears and will not leak the voice information in the earphone.
  • the mobile phone collects the scene image of the environment where the user is located through the camera, then identifies the scene image, and adaptively adjusts the system volume value according to the classification result of the scene image, that is, the environment where the user is located, without the need for the user to adjust the volume value of the system according to the environment where the user is located.
  • manually adjust the system volume repeatedly to improve the user experience.
  • the user equipment receives the classification result of the first scene image to be recognized from the execution device, and the classification result is used to trigger the user equipment to adjust the noise reduction mode of the headset to the above The first noise reduction mode.
  • the scene indicated by the classification result has a corresponding relationship with the first noise reduction mode. That is, for a specific description of how the user equipment adjusts the noise reduction mode of the earphone according to the classification result of the first scene image, please refer to the specific description of the above step S34A, which is not repeated here.
  • the user equipment receives a classification result of the first scene image to be recognized from the execution device, and the classification result is used to trigger the user equipment to adjust the system volume of the user equipment to the first volume value.
  • the scene indicated by the classification result has a corresponding relationship with the first volume value. That is, for the description of the user equipment adjusting the system volume value of the user equipment according to the classification result of the first scene image, please refer to the specific description of the above step S34B, which is not repeated here.
  • the present application also provides an apparatus to which the model training method is applied.
  • the model training method is applied to a model training device, and the model training device may be the training device described in the above method embodiments, or the model training device may also be a processor in the training device, or the model training device may be A system-on-a-chip in a training device.
  • the present application provides an embodiment of a model training apparatus 1400 .
  • the model training apparatus includes an acquisition module 1401 and a processing module 1402 .
  • an acquisition module 1401 configured to acquire a first training data set, where the first training data set includes a plurality of first images
  • the processing module 1402 is used to identify a first area in the first image by using an object detection model, and the first area is an image area irrelevant to scene recognition; mask processing is performed on the first area to obtain a third image; obtain multiple sample object images generated by the image generation model, the sample object images are images of objects irrelevant to scene recognition; replace the multiple sample object images with the masks covered in the third image respectively
  • a plurality of target images are obtained; the first convolutional neural network is trained by using the data set of the target image, and the second convolutional neural network is trained by using the data set of the third image to obtain a scene recognition model.
  • the scene recognition model includes the first convolutional neural network and the second convolutional neural network.
  • the acquisition module 1401 is replaced by a transceiver module.
  • the transceiver module is a transceiver.
  • the transceiver has the function of sending and/or receiving.
  • the transceiver is replaced by a receiver and/or a transmitter.
  • the transceiver module is a communication interface.
  • the communication interface is an input-output interface or a transceiver circuit.
  • the input and output interface includes an input interface and an output interface.
  • the transceiver circuit includes an input interface circuit and an output interface circuit.
  • the processing module 1402 is a processor, and the processor is a general-purpose processor or a special-purpose processor or the like.
  • the processor includes a transceiver unit for implementing receiving and transmitting functions.
  • the transceiver unit is a transceiver circuit, or an interface, or an interface circuit.
  • Transceiver circuits, interfaces, or interface circuits for implementing receiving and transmitting functions are deployed separately, or optionally, integrated together.
  • the above-mentioned transceiver circuit, interface or interface circuit is used for reading and writing code or data, or the above-mentioned transceiver circuit, interface or interface circuit is used for signal transmission or transmission.
  • the obtaining module 1401 is configured to execute step S10 in the example corresponding to FIG. 4 and step S20 in the example corresponding to FIG. 7 .
  • the processing module 1402 is configured to execute steps S11 to S14 in the example corresponding to FIG. 4 , and steps S20 to S25 in the example corresponding to FIG. 7 .
  • processing module 1402 is further specifically configured to:
  • the first image Inputting the first image into an image recognition model, and using the image recognition model to obtain a first classification result of the first image and a heat map of the first image, where the heat map is used to display the region where the target object is located,
  • the image feature of the target object is an image feature unrelated to scene recognition, and the category indicated by the first classification result is a non-scene category or an incorrect scene category;
  • the first model is trained by using a second training data set to obtain the object detection model.
  • the second training data set includes a plurality of sample data, and the sample data includes input data and output data, wherein the input data is all the second image, and the output data are position coordinates, where the position coordinates are used to indicate the area where the target object is located.
  • the processing module 1402 is further configured to use the second image to train a generative adversarial network GAN to obtain the image generation model.
  • processing module 1402 is further specifically configured to:
  • the image features of the target image are extracted through the first convolutional layer of the first convolutional neural network, and the image features of the third image are extracted through the second convolutional layer of the second convolutional neural network, and outputting the image features of the third image to the first convolution layer for fusion with the image features of the target image;
  • the label of the first category is output according to the fused image feature through the output layer of the first convolutional neural network.
  • the functions of the processing module 1402 are implemented by a processing device, and part or all of the functions of the processing device are implemented by software, hardware, or a combination thereof. Therefore, it can be understood that each of the above modules can be implemented by software, hardware or a combination of the two.
  • the processing device includes a memory and a processor, wherein the memory is used to store a computer program, and the processor reads and executes the computer program stored in the memory to perform corresponding processing and/or steps in the above method embodiments.
  • Processors include, but are not limited to, one or more of CPUs, DSPs, image signal processors, neural network processing units (NPUs), and microcontrollers.
  • the processing device includes only a processor.
  • the memory for storing the computer program is located outside the processing device, and the processor is connected to the memory through a circuit/wire to read and execute the computer program stored in the memory.
  • part or all of the functions of the processing device are implemented by hardware.
  • the processing device includes an input interface circuit, a logic circuit and an output interface circuit.
  • the processing means may be one or more chips, or one or more integrated circuits.
  • the object detection model, the image generation model, and the scene recognition model can be neural network models, which can be embedded, integrated in or run on a neural network processor (NPU).
  • NPU neural network processor
  • the arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 1502 and buffers it on each PE in the arithmetic circuit.
  • the arithmetic circuit fetches the data of matrix A and matrix B from the input memory 1501 to perform matrix operation, and stores the partial result or final result of the matrix in the accumulator 1508 .
  • Unified memory 1506 is used to store input data and output data.
  • the weight data is directly transferred to the weight memory 1502 through a direct memory access controller (DMAC) 1505 .
  • DMAC direct memory access controller
  • Input data is also moved into unified memory 1506 via the DMAC.
  • a bus interface unit (BIU) 1510 is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer (instruction fetch buffer) 1509.
  • the bus interface unit 1510 is used for the instruction fetch memory 1509 to obtain instructions from the external memory, and is also used for the storage unit access controller 1505 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1506 or the weight data to the weight memory 1502 or the input data to the input memory 1501 .
  • the vector calculation unit 1507 has multiple operation processing units, and if necessary, further processes the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on.
  • the vector computation unit can 1507 store the processed output vectors to the unified buffer 1506 .
  • the vector calculation unit 1507 may apply a nonlinear function to the output of the arithmetic circuit 1503, such as a vector of accumulated values, to generate activation values.
  • the vector computation unit 1507 generates normalized values, merged values, or both.
  • the vector of processed outputs can be used as activation input to the arithmetic circuit 1503, such as for use in subsequent layers in a neural network.
  • the instruction fetch buffer 1509 connected to the controller 1504 is used to store the instructions used by the controller 1504; the unified memory 1506, the input memory 1501, the weight memory 1502 and the instruction fetch memory 1509 are all On-Chip memories. External memory is private to the NPU hardware architecture.
  • the present application provides an electronic device 1600 , which is the training device in the above method embodiment, and is used to perform the functions of the training device in the above method embodiment.
  • the electronic device 1600 is described by taking a server as an example.
  • the server includes one or more central processing units (CPUs) 1622 (eg, one or more processors) and memory 1632, one or more storage media 1630 (eg, one or more storage media 1630) that store applications 1642 or data 1644 more than one mass storage device).
  • CPUs central processing units
  • storage media 1630 eg, one or more storage media 1630
  • the program stored in the storage medium 1630 includes one or more modules (not shown in the figure), and each module includes a series of instructions to operate on the device.
  • the central processing unit 1622 is configured to communicate with the storage medium 1630 to execute a series of instruction operations in the storage medium 1630 on the server.
  • the server also includes one or more power supplies 1626 , one or more wired or wireless network interfaces 1650 , one or more input and output interfaces 1658 , and/or, one or more operating systems 1641 .
  • the server also includes one or more power supplies 1626 , one or more wired or wireless network interfaces 1650 , one or more input and output interfaces 1658 , and/or, one or more operating systems 1641 .
  • the central processing unit 1622 includes the NPU shown in FIG. 15 above.
  • the functions of the acquisition module 1401 in FIG. 14 are performed by the network interface 1650 in FIG. 16 .
  • the functions of the processing module 1402 in FIG. 14 are performed by the central processing unit 1622 in FIG. 16 .
  • the present application also provides a scene recognition device to which the scene recognition method is applied.
  • the scene recognition apparatus is configured to execute the function executed by the execution device in the foregoing method embodiment.
  • the scene identification apparatus may be the execution device in the above method embodiments, or the scene identification apparatus may also be a processor in the execution apparatus, or the scene identification apparatus may be a chip system in the execution apparatus.
  • the present application provides an embodiment of a scene identification apparatus 1700 .
  • the scene identification apparatus 1700 includes an acquisition module 1701 and a processing module 1702 , and optionally, the scene identification apparatus further includes a sending module 1703 .
  • an acquisition module 1701 configured to acquire a first scene image to be identified
  • a processing module 1702 configured to use an object detection model to detect a first area in the first scene image where objects unrelated to scene recognition are located;
  • the first convolutional neural network is obtained by training using the data set of the target image
  • the second convolutional neural network is obtained by training using the data set of the third image
  • the target image is obtained by training
  • the multiple sample object images generated by the image generation model are respectively replaced with the first area in the third image
  • the third image is obtained by using the object detection model to identify the first area in the first image that has nothing to do with scene recognition.
  • the first image is obtained by performing mask processing on the first region, and the first image is an image in the training data set.
  • the object detection model, the image generation model, and the scene recognition model can be neural network models, which can be embedded, integrated in or run on the above-mentioned neural network processor (NPU) shown in FIG. 15 .
  • NPU neural network processor
  • the acquisition module 1701 is replaced by a transceiver module.
  • the transceiver module is a transceiver.
  • the transceiver has the function of sending and/or receiving.
  • the transceiver is replaced by a receiver and/or a transmitter.
  • the transceiver module is a communication interface.
  • the communication interface is an input-output interface or a transceiver circuit.
  • the input and output interface includes an input interface and an output interface.
  • the transceiver circuit includes an input interface circuit and an output interface circuit.
  • the processing module 1702 is a processor, and the processor is a general-purpose processor or a special-purpose processor or the like.
  • the processor includes a transceiver unit for implementing receiving and transmitting functions.
  • the transceiver unit is a transceiver circuit, or an interface, or an interface circuit.
  • Transceiver circuits, interfaces, or interface circuits for implementing receiving and transmitting functions are deployed separately, or optionally, integrated together.
  • the above-mentioned transceiver circuit, interface or interface circuit is used for reading and writing code or data, or the above-mentioned transceiver circuit, interface or interface circuit is used for signal transmission or transmission.
  • the functions of the processing module 1702 are implemented by a processing device, and part or all of the functions of the processing device are implemented by software, hardware, or a combination thereof. Therefore, it can be understood that each of the above modules can be implemented by software, hardware or a combination of the two.
  • the processing device includes a memory and a processor, wherein the memory is used to store a computer program, and the processor reads and executes the computer program stored in the memory to perform corresponding processing and/or steps in the above method embodiments.
  • Processors include, but are not limited to, one or more of CPUs, DSPs, image signal processors, neural network processing units (NPUs), and microcontrollers.
  • the processing device includes only a processor.
  • the memory for storing the computer program is located outside the processing device, and the processor is connected to the memory through a circuit/wire to read and execute the computer program stored in the memory.
  • part or all of the functions of the processing device are implemented by hardware.
  • the processing device includes an input interface circuit, a logic circuit and an output interface circuit.
  • the processing means may be one or more chips, or one or more integrated circuits.
  • the obtaining module 1701 is configured to perform step S30 in the example corresponding to FIG. 10 in the above method embodiment.
  • the processing module 1702 is configured to execute steps S31 to S33 in the example corresponding to FIG. 10 in the above method embodiment.
  • the processing module 1702 is further configured to execute step S34A and step S34B.
  • the processing module 1702 is further configured to: extract the image features of the first scene image through the first convolutional layer of the first convolutional neural network, and use the first convolutional neural network
  • the second convolutional layer of the two-convolutional neural network extracts the image features of the second scene image, and outputs the image features of the second scene image to the first convolutional layer to match the first scene image.
  • the image features of the images are fused; the classification result is output according to the fused image features through the output layer of the first convolutional neural network.
  • the classification result indicates a first scene
  • the first scene has a corresponding relationship with the first noise reduction mode of the earphone
  • the processing module 1702 is further configured to adjust the noise reduction mode of the earphone to the first noise reduction mode according to the classification result;
  • the sending module 1703 is configured to send the classification result to the user equipment, where the classification result is used to trigger the user equipment to adjust the noise reduction mode of the earphone to the first noise reduction mode.
  • the classification result indicates a first scene
  • the first scene and the first volume value have a corresponding relationship
  • the processing module 1702 is further configured to adjust the system volume of the execution device to the first volume value according to the classification result;
  • the sending module 1703 is configured to send the classification result to the user equipment, where the classification result is used to trigger the user equipment to adjust the system volume of the user equipment to the first volume value.
  • the sending module 1703 is replaced by a transceiver module.
  • the transceiver module is a transceiver.
  • the transceiver has the function of sending and/or receiving.
  • the transceiver is replaced by a receiver and/or a transmitter.
  • the transceiver module is a communication interface.
  • the communication interface is an input-output interface or a transceiver circuit.
  • the input and output interface includes an input interface and an output interface.
  • the transceiver circuit includes an input interface circuit and an output interface circuit.
  • the obtaining module 1701 is further configured to receive the first scene image to be identified sent by the user equipment; or, collect the first scene image to be identified through a camera or an image sensor.
  • the embodiment of the present application further provides another electronic device.
  • the electronic device 1800 is configured to execute the functions executed by the execution device in the foregoing method embodiments.
  • the electronic device is described by taking a mobile phone as an example.
  • the electronic device 1800 includes components such as a processor 1801 , a memory 1802 , an input unit 1803 , a display unit 1804 , a camera 1805 , a communication unit 1806 , and an audio circuit 1807 .
  • the memory 1802 can be used to store software programs and modules, and the processor 1801 executes various functional applications and data processing of the device by running the software programs and modules stored in the memory 1802 .
  • Memory 1802 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
  • the processor 1801 may be the processing device mentioned in the embodiment corresponding to FIG. 17 .
  • the processor 1801 includes, but is not limited to, various types of processors, such as one or more of the aforementioned CPU, DSP, image signal processor, neural network processor as shown in 15, and microcontroller .
  • the input unit 1803 may be used to receive input numerical or character information, and generate key signal input related to user settings and function control of the device.
  • the input unit 1803 may include a touch panel 1831 .
  • the touch panel 1831 also known as a touch screen, collects the user's touch operations on or near it (such as the user's finger, stylus, etc., any suitable object or attachment on or near the touch panel 1831). operate).
  • the display unit 1804 may be used to display various image information.
  • the display unit 1804 may include a display panel 1841.
  • the display panel 1841 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like.
  • the touch panel 1831 can be integrated with the display panel 1841 to realize the input and output functions of the device.
  • the camera 1805 is used for collecting the scene image to be recognized, or is used for collecting the scene image, and sending the collected scene image to the database.
  • the communication unit 1806 is configured to establish a communication channel, so that the electronic device is connected to a remote server through the communication channel, and obtains the object detection model and the scene recognition model from the remote server.
  • the communication unit 1806 may include communication modules such as a wireless local area network module, a Bluetooth module, a baseband module, etc., as well as a radio frequency (RF) circuit corresponding to the communication module, for performing wireless local area network communication, Bluetooth communication, infrared communication and /or cellular communication system communication.
  • RF radio frequency
  • the communication module is used to control the communication of various components in the electronic device, and can support direct memory access.
  • various communication modules in the communication unit 1806 generally appear in the form of integrated circuit chips, and can be selectively combined, instead of including all communication modules and corresponding antenna groups.
  • the communication unit 1806 may only include a baseband chip, a radio frequency chip, and a corresponding antenna to provide communication functions in a cellular communication system.
  • the electronic device Via the wireless communication connection established by the communication unit 1806, the electronic device may be connected to a cellular network or the Internet.
  • Audio circuit 1807, speaker 1808, and microphone 1809 may provide an audio interface between the user and the cell phone.
  • the audio circuit 1807 can transmit the electrical signal converted from the received audio data to the speaker 1808, and the speaker 1808 converts it into a sound signal and outputs it.
  • the microphone 1809 converts the collected sound signals into electrical signals, which are converted into audio data after being received by the audio circuit 1807, and then the audio data is processed by the output processor 1801, and then sent to, for example, another mobile phone through the communication unit 1806, or the audio data is transmitted. Output to memory 1802 for further processing.
  • the electronic device is wired or wirelessly connected to an external headset (eg, connected through a Bluetooth module), and the communication unit 1806 is configured to send the scene image to be recognized to the training device, and receive the classification result of the scene image from the server, process the
  • the device 1801 is further configured to adjust the noise reduction mode of the earphone according to the classification result.
  • the processor 1801 is further configured to adjust the volume value of the system volume according to the classification result.
  • the processor 1801 is configured to perform scene recognition on the scene image to be recognized to obtain a classification result.
  • the processor 1801 adjusts the noise reduction mode of the earphone according to the classification result.
  • the processor 1801 is further configured to adjust the volume value of the system volume according to the classification result.
  • the embodiments of the present application provide a computer-readable medium, where the computer-readable storage medium is used to store a computer program, and when the computer program runs on a computer, the computer can execute the method executed by the training device in the above method embodiments; or, When the computer program runs on the computer, it causes the computer to execute the method executed by the execution device in the above method embodiments.
  • An embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, and the communication interface is, for example, an input/output interface, a pin, or a circuit.
  • the processor is configured to read the instructions to execute the method executed by the training device in the above method embodiments; or, the processor is configured to read the instructions to execute the method executed by the execution device in the above method embodiments.
  • An embodiment of the present application provides a computer program product that, when executed by a computer, implements the method executed by the training device in the foregoing method embodiments; or, when the computer program product is executed by a computer, realizes the execution of the foregoing method embodiments. The method performed by the device.
  • the processor mentioned in any one of the above is a general-purpose central processing unit (CPU), a microprocessor, and an application-specific integrated circuit (ASIC).
  • CPU central processing unit
  • ASIC application-specific integrated circuit

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

一种模型训练方法、场景识别方法及相关设备,用于提高场景识别的准确率。本申请实施例方法包括:获取第一图像,利用物体检测模型识别第一图像中与场景识别无关的目标物体的图像,对第一图像中目标物体所在的区域进行掩膜处理,得到第三图像;然后利用图像生成模型生成与场景识别无关的多张样本物体图像,组合样本物体图像和第三图像,得到目标图像;将目标图像输入到第一卷积神经网络进行训练,并将第三图像输入到第二卷积神经网络进行训练,得到场景识别模型,场景识别模型能够降低对目标图像中产生差异的图像特征的关注度,并更容易学习到不同场景类别之间差异特征,场景识别模型能够提升场景识别的准确率。

Description

一种模型训练方法、场景识别方法及相关设备
本申请要求于2021年03月22日提交中国专利局、申请号为202110301843.5、发明名称为“一种模型训练方法、场景识别方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及人工智能应用领域中的计算机视觉领域中的一种模型训练方法、场景识别方法及相关设备。
背景技术
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。
人工智能的应用领域包括计算机视觉领域,场景识别是计算机视觉领域的重要分支技术。场景识别是指对图像中能够体现的环境或主体(人或物)所处的环境进行识别(或称为“分类”),旨在通过提取和分析场景图像中的特征,获取场景的信息,从而对图像所属的场景进行识别。
当前技术中的场景识别装置大多采用通用的图像识别模型(既用于识别对象,又用于识别场景)来对场景图像进行识别,通用的图像识别模型对场景识别的准确率有限,且场景识别的应用场景受限。
发明内容
本申请实施例提供了一种模型训练方法、场景识别方法及相关设备,用于提高场景识别的准确率。
第一方面,本申请提供了一种模型训练方法,该方法应用于训练设备,所述方法包括:训练设备获取第一训练数据集,第一训练数据集中包括多张第一图像,该第一图像为场景图像,例如,一张第一图像为“办公室”场景的图像,第一图像中可能包括与场景识别无关的物体的图像;训练设备利用物体检测模型识别第一图像中的第一区域,第一区域是与场景识别无关的图像区域;然后,训练设备对第一区域进行掩膜处理,得到第三图像;掩膜处理的作用是对第一区域进行遮挡;再后,训练设备获取图像生成模型生成的多张样本物体图像,样本物体图像是与场景识别无关的物体的图像;训练设备将多张样本物体图像分别替换到第三图像中掩膜覆盖的第一区域,得到多张目标图像;组合得到多张目标图像,一方面从数据量上来说,对第一训练数据集中的图像的数量进行了扩充。另一方面从图像之间的差异上来说,针对同一个类别的图片,第三图像中保留了与场景识别相关的背景的图像,而图像生成模型生成的样本物体图像用于作为新合成的目标图像之间的差异图像。最后,训练设备利用目标图像的数据集训练第一卷积神经网络,并利用第三图像的数据集训练第二卷积神经网络,得到场景识别模型,场景识别模型包括第一卷积神经网络和第二卷积神经网络。本申请实施例中,训练设备通过大量的新合成的目标图像对第一卷积神经网络进行训练,同一个类别的场景图像中引入与场景识别无关的物体的图像,使得场景识 别模型降低对场景图像中差异图像的特征的关注度,从而减弱类内差异性对场景识别模型分类性能造成的不利影响。另外,训练设备遮挡掉与场景识别无关的图像区域后,通过与场景识别有关的图像(即第三图像)对第二卷积神经网络进行训练,第二卷积神经网络更容易学习到不同场景类别之间差异特征,从而减弱类间相似性对场景识别模型分类性能造成的不利影响。训练设备得到的场景识别模型能够降低同类别的场景图像的类内差异性对场景识别模型的分类性能带来的负面影响,和不同场景类别的类间相似性对场景识别模型的分类性能带来的负面影响,进而能够提升场景识别的准确率。
在一个可选的实现方式中,所述方法还包括:训练设备将第一图像输入到图像识别模型,该图像识别模型为通用的图像识别模型(既用于图像识别,也用于场景识别),训练设备利用图像识别模型得到第一图像的第一分类结果及第一图像的热力图,其中,热力图用于展示目标物体所在的区域,目标物体的图像特征是与场景识别无关的图像特征,第一分类结果指示的类别为非场景类别或错误的场景类别;训练设备对第一图像中除了目标物体所在第一区域之外的第二区域进行掩膜处理,即对第二区域进行遮挡,得到第二图像(即仅包含目标物体的图像);然后,训练设备利用第二训练数据集对第一模型进行训练,得到物体检测模型,第二训练数据集包括多个样本数据,样本数据包括输入数据和输出数据,其中,输入数据为第二图像,输出数据为位置坐标,位置坐标用于指示目标物体所在的区域。本实施例中,训练设备通过第一图像的热力图可以确定第一图像中图像识别模型做出该分类决策影响最大的区域,通过热力图能够确定与场景识别无关的目标物体的位置,通过第二图像对第一模型(如神经网络)进行训练,得到物体检测模型,物体检测模型用于识别一个场景图像中哪部分区域与场景识别无关,进而也就可以确定出一个场景图像中哪部分区域与场景识别有关。
在一个可选的实现方式中,所述方法还包括:训练设备利用第二图像对生成式对抗网络GAN进行训练,得到图像生成模型。图像生成模型用于生成大量的与场景识别无关的多张样本物体图像,从而可以得到用于训练场景识别模型的目标图像,通过多张样本物体图像得到的多张目标图像既对第一训练数据集中的图像的数量进行了扩充,又可以针对同一个类别的图片,用于作为新合成的目标图像之间的差异图像,从而减弱类内差异性对场景识别模型分类性能造成的不利影响,以提高场景识别模型的性能。
在一个可选的实现方式中,目标图像和第三图像均对应第一类别的标签,所述利用目标图像训练第一卷积神经网络,并利用第三图像训练第二卷积神经网络可以具体包括:训练设备通过第一卷积神经网络的第一卷积层提取目标图像的图像特征,并通过第二卷积神经网络的第二卷积层提取第三图像的图像特征,并将第三图像的图像特征输出至第一卷积层,以与目标图像的图像特征进行融合;然后,融合后的图像特征输出至第一卷积神经网络的输出层,通过第一卷积神经网络的输出层(如全连接层和分类器)输出第一类别的标签。第二卷积神经网络提取的第三图像的图像特征是第一图像中与场景识别相关的图像特征,第二卷积神经网络等效于注意力模型,第二卷积神经网络将提取的图像特征融合到第一卷积神经网络的最后一层卷积层,使得场景识别模型更关注与场景识别相关的图像特征。并且通过遮挡掉与场景识别无关的物体图像(目标物体)后,通过与场景识别有关的图像 对第二卷积神经网络进行训练,第二卷积神经网络更容易学习到不同场景类别之间差异特征,从而减弱类间相似性对场景识别模型分类性能造成的不利影响。
第二方面,本申请实施例提供了一种场景识别方法,应用于执行设备,该方法包括:执行设备获取待识别的第一场景图像,然后,执行设备利用物体检测模型检测第一场景图像中与场景识别无关的物体所在的第一区域;执行设备对第一区域进行掩膜处理,得到第二场景图像;再后,执行设备将第一场景图像输入到场景识别模型中的第一卷积神经网络,将第二场景图像输入到场景识别模型中的第二卷积神经网络,利用场景识别模型输出分类结果,第一卷积神经网络是利用目标图像的数据集进行训练得到的,第二卷积神经网络是利用第三图像的数据集训练得到的,并且目标图像是由图像生成模型生成的多张样本物体图像分别替换到第三图像中的第一区域后得到的,第三图像是利用物体检测模型识别第一图像中与场景识别无关的第一区域后,对第一区域进行掩膜处理后得到的,第一图像是训练数据集中的图像。本申请实施例中,第一卷积神经网络是通过对目标图像学习后得到的,而目标图像是由相同的背景图像与不同的差异物体图像(与场景识别无关的物体的图像)进行合成后得到的。使得场景识别模型对第一场景图像中与场景识别无关的图像特征关注度降低,从而减少相同类别的场景图像之间的类内差异性对场景识别模型的分类性能带来的负面影响。第二卷积神经网络是通过对与场景识别有关的图像学习后得到的,使得场景识别图像提取与场景识别有关的那部分图像的图像特征,且更关注与第一场景图像中与场景识别有关的图像特征,能够降低不同类别的场景图像的类间相似性对场景识别模型的分类性能带来的负面影响。以使第一场景图像的分类结果的准确性大大提高。
在一个可选的实现方式中,执行设备将第一场景图像输入到场景识别模型中的第一卷积神经网络,将第二场景图像输入到场景识别模型中的第二卷积神经网络,利用场景识别模型输出分类结果可以具体包括:执行设备通过第一卷积神经网络的第一卷积层提取第一场景图像的图像特征,并通过第二卷积神经网络的第二卷积层提取第二场景图像的图像特征,并将第二场景图像的图像特征输出至第一卷积层,以与第一场景图像的图像特征进行融合,从而使场景识别模型关注全局信息;第一卷积神经网络将融合后的图像特征输出至输出层,通过第一卷积神经网络的输出层(全连接层和分类器)输出分类结果。
在一个可选的实现方式中,若分类结果指示第一场景,第一场景与耳机的第一降噪模式具有对应关系;当执行设备是终端设备时,执行设备与耳机连接,所述方法还包括:执行设备根据分类结果将耳机的降噪模式调整为第一降噪模式,执行设备可以对场景图像进行识别,根据场景识别得到的分类结果自动调整耳机的降噪模式,无需用户手动来设置耳机的降噪模式。或者,当执行设备是服务器时,用户设备与耳机连接,所述方法还包括:执行设备向用户设备发送分类结果,分类结果用于触发用户设备将耳机的降噪模式调整为第一降噪模式。本申请实施例中,执行设备可以对场景图像进行识别,并将分类结果发送给用户设备,从而使得用户设备根据场景识别得到的分类结果自动调整耳机的降噪模式,无需用户手动来设置耳机的降噪模式。
在一个可选的实现方式中,若分类结果指示第一场景,第一场景与第一音量值具有对应关系;当执行设备是终端设备时,所述方法还包括:执行设备根据分类结果将执行设备 的系统音量调整为第一音量值。本申请实施例中,执行设备能够根据场景图像的分类结果自适应调整系统音量值,无需用户根据不同的环境频繁调整手机的系统音量值。或者,当执行设备是服务器时,所述方法还包括:执行设备向用户设备发送分类结果,分类结果用于触发用户设备将用户设备的系统音量调整为第一音量值,从而使得用户设备能够根据场景识别得到的分类结果自动调整调整手机的系统音量值,无需用户手动来调整手机的系统音量值,提升用于体验。
在一个可选的实现方式中,所述获取待识别的第一场景图像可以包括:执行设备接收用户设备发送的待识别的第一场景图像;或者,执行设备通过摄像头或图像传感器采集待识别的第一场景图像。
第三方面,本申请实施例提供了一种模型训练装置,包括:
获取模块,用于获取第一训练数据集,第一训练数据集中包括多张第一图像;
处理模块,用于利用物体检测模型识别第一图像中的第一区域,第一区域是与场景识别无关的图像区域;对第一区域进行掩膜处理,得到第三图像;获取图像生成模型生成的多张样本物体图像,样本物体图像是与场景识别无关的物体的图像;将多张样本物体图像分别替换到第三图像中掩膜覆盖的第一区域,得到多张目标图像;利用目标图像的数据集训练第一卷积神经网络,并利用第三图像的数据集训练第二卷积神经网络,得到场景识别模型,场景识别模型包括第一卷积神经网络和第二卷积神经网络。
在一个可选的实现方式中,处理模块,还用于将第一图像输入到图像识别模型,利用图像识别模型得到第一图像的第一分类结果及第一图像的热力图,热力图用于展示目标物体所在的区域,目标物体的图像特征是与场景识别无关的图像特征,第一分类结果指示的类别为非场景类别或错误的场景类别;对第一图像中除了目标物体所在第一区域之外的第二区域进行掩膜处理,得到第二图像;利用第二训练数据集对第一模型进行训练,得到物体检测模型,第二训练数据集包括多个样本数据,样本数据包括输入数据和输出数据,其中,输入数据为第二图像,输出数据为位置坐标,位置坐标用于指示目标物体所在的区域。
在一个可选的实现方式中,处理模块,还用于利用第二图像对生成式对抗网络GAN进行训练,得到图像生成模型。
在一个可选的实现方式中,目标图像和第三图像均对应第一类别的标签;处理模块,还用于通过第一卷积神经网络的第一卷积层提取目标图像的图像特征,并通过第二卷积神经网络的第二卷积层提取第三图像的图像特征,并将第三图像的图像特征输出至第一卷积层,以与目标图像的图像特征进行融合;通过第一卷积神经网络的输出层根据融合后的图像特征输出第一类别的标签。
第四方面,本申请实施例提供了一种场景识别装置,包括:
获取模块,用于获取待识别的第一场景图像;
处理模块,用于利用物体检测模型检测第一场景图像中与场景识别无关的物体所在的第一区域;对第一区域进行掩膜处理,得到第二场景图像;将第一场景图像输入到场景识别模型中的第一卷积神经网络,将第二场景图像输入到场景识别模型中的第二卷积神经网络,利用场景识别模型输出分类结果,其中,第一卷积神经网络是利用目标图像的数据集 进行训练得到的,第二卷积神经网络是利用第三图像的数据集训练得到的,目标图像是由图像生成模型生成的多张样本物体图像分别替换到第三图像中的第一区域后得到的,第三图像是利用物体检测模型识别第一图像中与场景识别无关的第一区域后,对第一区域进行掩膜处理后得到的,第一图像是训练数据集中的图像。
在一个可选的实现方式中,处理模块,还用于通过第一卷积神经网络的第一卷积层提取第一场景图像的图像特征,并通过第二卷积神经网络的第二卷积层提取第二场景图像的图像特征,并将第二场景图像的图像特征输出至第一卷积层,以与第一场景图像的图像特征进行融合;通过第一卷积神经网络的输出层根据融合后的图像特征输出分类结果。
在一个可选的实现方式中,所述装置还包括发送模块;若分类结果指示第一场景,第一场景与耳机的第一降噪模式具有对应关系;处理模块,还用于根据分类结果将耳机的降噪模式调整为第一降噪模式;或者,发送模块,用于向用户设备发送分类结果,分类结果用于触发用户设备将耳机的降噪模式调整为第一降噪模式。
在一个可选的实现方式中,若分类结果指示第一场景,第一场景与第一音量值具有对应关系;处理模块,还用于根据分类结果将执行设备的系统音量调整为第一音量值;或者,发送模块,还用于向用户设备发送分类结果,分类结果用于触发用户设备将用户设备的系统音量调整为第一音量值。
在一个可选的实现方式中,获取模块还具体用于:接收用户设备发送的待识别的第一场景图像;或者,通过摄像头或图像传感器采集待识别的第一场景图像。
第五方面,本申请实施例提供了一种电子设备,包括:处理器,所述处理器与存储器耦合,所述存储器用于存储程序或指令,当所述程序或指令被所述处理器执行时,使得所述电子设备执行如上述第一方面中任一项所述的方法;或者,当所述程序或指令被所述处理器执行时,使得所述电子设备执行如上述第二方面中任一项所述的方法。
第六方面,本申请实施例提供了一种计算机程序产品,所述计算机程序产品中包括计算机程序代码,当所述计算机程序代码被计算机执行时,使得计算机实现上述如上述第一方面中任一项所述的方法;或者,当所述计算机程序代码被计算机执行时,使得计算机实现上述如上述第二方面中任一项所述的方法。
第七方面,本申请实施例提供了一种计算机可读存储介质,用于储存计算机程序或指令,所述计算机程序或指令被执行时使得计算机执行如上述第一方面中任一项所述的方法;或者,所述计算机程序或指令被执行时使得计算机执行如上述第二方面中任一项所述的方法。
附图说明
图1为本申请实施例中的人工智能主体架构示意图;
图2A和图2B为本申请实施例中的系统架构的示意图;
图3为原图及原图的热力图的示意图;
图4为本申请实施例中对物体检测模型和图像生成模型的进行训练的步骤流程示意图;
图5为本申请实施例中对第一图像掩膜进行处理后,得到第二图像的示意图;
图6为本申请实施例中场景识别模型的架构的示意图;
图7为本申请实施例中对场景识别模型进行训练的步骤流程示意图;
图8为本申请实施例中对第一图像进行掩膜处理后,得到第三图像的示意图;
图9为本申请实施例中物体检测模型和场景识别模型的架构图;
图10为本申请实施例中一种场景识别方法的一个实施例的步骤流程示意图;
图11A、图11B和图11C为本申请实施例中耳机降噪模式与场景的对应关系的设置界面示意图;
图12为本申请实施例中修改场景与降噪模式的对应关系的场景示意图;
图13为本申请实施例中场景与系统音量值的对应关系的设置界面示意图;
图14为本申请实施例中一种模型训练装置的一个实施例的结构示意图;
图15为本申请实施例中神经网络处理器的一个实施例的结构示意图;
图16为本申请实施例中一种电子设备的结构示意图;
图17为本申请实施例中一种场景识别装置的一个实施例的结构示意图;
图18为本申请实施例中另一种电子设备的结构示意图。
具体实施方式
本申请涉及人工智能的应用领域中的计算机视觉领域,尤其涉及计算机视觉领域中的场景识别。首先对人工智能主体框架进行说明。
图1示出一种人工智能主体框架示意图,该主体框架描述了人工智能系统总体工作流程,适用于通用的人工智能领域需求。
下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。
“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。
“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施:
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。
(5)智能产品及行业应用
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶,平安城市,智能终端等。
场景识别是计算机视觉领域的一个重要分支技术。场景识别是指对图像中能够体现的环境或主体(人或物)所处的环境进行识别(或称为“分类”)。相对于以主体(或称为“对象”)为中心的图像识别,场景识别关注图像的全局信息。由此识别装置容易将与环境不相关的物体作为识别场景的关键特征,导致场景识别的两个难点。其一,相同场景类别的场景图像之间具有差异性(即类内差异性),这种差异性可能是由与环境不相关的物体之间的差别带来的。例如,图像A是小明戴口罩在机场的照片,图像B是小明没有戴口罩在机场的照片,图像A和图像B同样是“机场”场景,识别装置更容易将图像A中与环境无关的“人脸”作为图像识别的关键特征,得到错误的分类结果(如“医院”)。其二,不同场景类别的场景图像之间具有相似性(即类间相似性),这种类间相似性可能是由与环境不相关的物体的相似性带来的。例如,图像C为高铁内部包括座椅的图像,图像D为机场内部包括座椅的图像。识别装置更容易将座椅作为识别场景的关键特征,对图像D进行场景识别,将图像D中的座椅作为识别的关键特征,得到错误的分类结果(如“高铁”)。类内差异性和类间相似性导致场景识别的准确率降低。
基于上述问题,本申请实施例提供了一种场景图像识别方法,该方法用于降低场景图像的类内差异性及类间相似性,从而提高场景识别的准确率。请参阅图2A所示,本申请实施例提供了一种系统架构,数据采集设备210用于采集图像,将采集的图像作为训练数据存入数据库230,训练设备220基于数据库230中维护的图像数据生成物体检测模型和场景识别模型。其中,物体检测模型用于检测待识别的图像中“与场景(环境)识别无关”的区域。场景识别模型用于对待识别的场景图像进行识别。训练设备220由一个或多个服务器实现,可选地,训练设备220由一个或多个终端设备实现。执行设备240获取到来自训练设备220的物体检测模型和场景识别模型,将物体检测模型和场景识别模型装载于执行设备240内。执行设备240获取待识别的场景图像后,能够利用物体检测模型和场景识 别模型对待识别的场景图像进行识别,得到分类结果。执行设备240为终端设备,例如,执行设备240包括但不限于手机、个人计算机、平板电脑、可穿戴设备(例如手表、手环、VR/AR设备)和车载终端等。可选地,请参阅图2B所示,系统架构还包括用户设备250,用户设备250包括但不限于手机、个人计算机、平板电脑、可穿戴设备(例如手表、手环、VR/AR设备)和车载终端等。执行设备240由一个或多个服务器实现。用户设备250可以通过任何通信机制或通信标准的通信网络与执行设备240进行交互,通信网络可以是广域网、局域网、点对点连接等方式,或它们的任意组合。用户设备250用于采集待识别的场景图像,并将待识别的场景图像发送至执行设备240,执行设备240接收来自用户设备250的待识别的场景图像,利用物体检测模型和场景识别模型对待识别的场景图像进行识别,得到分类结果。执行设备240将该分类结果发送至用户设备250。可选地,训练设备220和执行设备240可以是相同的设备,例如,一个服务器(或服务器集群)既用于实现训练设备220的功能,又用于实现执行设备240的功能。
本申请实施例提供一种模型训练方法,该方法应用于上述系统架构中的训练设备。训练设备获取第一训练数据集,第一训练数据集中包括多张第一图像,利用物体检测模型识别第一图像中与场景识别无关的目标物体的图像,训练设备对第一图像中目标物体所在的区域进行掩膜处理,得到第三图像(即仅包含与场景识别有关的图像)。训练设备利用图像生成模型生成大量的与场景识别无关的样本物体图像,组合样本物体图像和第三图像,得到组合后的目标图像。训练设备将组合后的目标图像输入到第一卷积神经网络进行训练,并将第三图像输入到第二卷积神经网络进行训练,得到场景识别模型,场景识别模型包括第一卷积神经网络和第二卷积神经网络。通过大量的新合成的目标图像对第一卷积神经网络进行训练,从而使同一个类别的场景图像中引入与场景识别无关的物体图像,进而使场景识别模型降低对目标图像中产生差异的图像特征的关注度,从而降低同类别的场景图像的类内差异性对场景识别模型的分类性能带来的负面影响。并且通过与场景识别有关的图像对第二卷积神经网络进行训练,从而使第二卷积神经网络更容易学习到不同场景类别之间差异特征,进而降低不同场景类别的类间相似性对场景识别模型的分类性能带来的负面影响,提升了场景识别的准确率。
并且,本申请实施例提供了一种场景识别方法,该方法应用于上述系统架构中的执行设备。执行设备通过摄像头和/或图像传感器采集待识别的第一场景图像。然后,执行设备利用上述训练设备得到的物体检测模型检测第一场景图像中与场景识别无关的物体所在的第一区域。执行设备对第一区域进行掩膜处理,得到第二场景图像。执行设备将所述第一场景图像和第二场景图像输入到上述训练设备得到场景识别模型,利用所述场景识别模型输出分类结果。
为了更好理解本申请,首先对本申请中涉及的词语进行说明。
场景识别,是指对图像中能够体现的环境或对象(人或物)所处的环境进行分类,场景图像的类别可以包括但不限定于“机场”类,“高铁”类,“医院”类,“办公室”类,“咖啡厅”等等。可选地,场景图像的类别还可以是例如“室内场景”类、“室外场景”类,或者“嘈杂场景”类、“安静场景”类、“监听场景”类等。场景图像的类别根据具 体的应用场景进行配置,具体的并不限定。
场景图片的类内差异性,是指同一类别的场景图片具有差异,导致类内差异性大的图片容易被误分到其他类别中。例如,一张办公室场景的图像中包括“人脸”图像,该包含“人脸”的办公室图片由于引入了差异信息(人脸的图像)被误分到其他类别,即被误分到非“办公室”类别。
场景图片的类间相似性,是指不同类别的场景图像中具有相似的物体图像,导致不同类别的场景图像被误分到一类中。例如,高铁内部的图片和机场内部的图片中都包括“椅子”,由于“椅子”具有相似性,高铁内部的图片和机场内部的图片容易被分到同一类中,例如都被分到“高铁”类别,或者都被分到“机场”类别。
热力图(gradient-weighted class activation map,CAM),是帮助可视化卷积神经网络(convolutional neural networks,CNN)的工具,用于描述一张图像中的哪个局部位置让CNN做出了最终的分类决策。CAM中包括与输出类别相关的二维特征网格,每个网络的位置表示输出类别的重要程度。请参阅图3所示,图3为原图及原图的热力图的示意图,以热力图的形式呈现图像中每个网格位置与分类结果的相似程度。图3中包括一只猫和一只狗,CNN将该图像分类到“狗”的类别,从热力图上能够看出,CNN是识别到了“狗的脸部位置”的特征,即将狗的脸部的特征作为了分类的关键特征,将该图像分类到“狗”的类别中。
下面对热力图的基本原理进行简要说明。将一张图像输入到卷积神经网络中,通过卷积神经网络提取图像特征,对卷积神经网络模型的最后一个特征图(feature map)做全局平均池化(global average pooling,GAP),计算各通道均值,然后计算最大的那个类别的输出相对于最后一个特征图的梯度,再把这个梯度可视化到原图上。直观来说,热力图能够展现卷积神经网络抽取到的高层特征的哪部分对最终的分类决策影响最大。
生成式对抗网络(generative adversarial networks,GAN),用于生成样本数据。本申请中GAN用于生成一个图像中与场景识别无关的物体的图像。GAN包括生成模型(generative model,G)和判别模型(discriminative model,D)。其中,生成模型用于生成一个类似真实训练数据的样本,目标是越像真实样本越好。判别模型是一个二分类器,用于估计一个样本来自于真实训练样本的概率,若判别模型估计样本来自于真实的训练样本,判别模型输出大概率。若判别模型估计样本来自于生成模型生成的样本,则判别模型输出小概率。可以理解为,生成模型的目标是想方设法生成和真实样本一样的样本,使得判别模型判别不出来。而判别模型的目标是想方设法检测出来生成模型生成的样本。通过G和D的对抗与博弈,使得GAN生成的样本接近真实的样本,从而可以得到大量的样本数据。
本申请包括两个部分,第一个部分:模型训练过程。第二个部分:执行(推理)过程。下面首先对模型训练的过程进行说明。
一、模型训练过程。训练过程的执行主体是训练设备。模型训练的过程主要涉及3个模型:物体检测模型,图像生成模型和场景识别模型。
请参阅图4所示,首先对物体检测模型和图像生成模型的训练过程进行说明。
S10、训练设备获取第一训练数据集,第一训练数据集中包括多张第一图像(或者称为“原始图像”)。
数据采集设备采集图像,并将采集的图像存入数据库。训练设备从数据库获取第一训练数据集。例如,数据采集设备为带有图像传感器的设备,如照相机,摄像机或手机等。第一训练数据集中包括大量的不同类别的图像。例如,A1-“机场”类,A2-“高铁”类,A3-“地铁”类,A4-“办公室”类,A5-“医院”等等,具体的并不限定。应理解,对于第一训练数据集中图像的分类按照不同的需求分类方式有多种,具体的分类依据具体的应用场景不同而具有不同的分类。需要说明的是,本申请中为了区别第一训练数据集中的原始图像和对原始图像进行处理后的图像,将原始图像称为“第一图像”。对第一图像中“与场景识别有关的图像”进行掩膜处理后的图像称为“第二图像”(仅保留与场景识别无关的物体图像)。对第一图像中“与场景识别无关的物体图像”进行掩膜处理后的图像称为“第三图像”(仅保留与场景识别有关的图像)。
S11、训练设备将第一图像输入到图像识别模型,利用图像识别模型得到第一图像的第一分类结果及第一图像的热力图。
其中,热力图用于展示目标物体所在的区域,目标物体的图像特征是与场景识别无关的图像特征,第一分类结果指示的类别为非场景类别或错误的场景类别。图像识别模型是通用的对象识别模型,用于识别图像中的目标物体(或称为“目标对象”)。例如,第一图像是“一个人在办公室工作”的场景图像,将该第一图像输入到通用的图像识别模型,图像识别模型输出第一图像的第一分类结果是“人”,通过第一图像的热力图得到图像识别模型做出该分类决策影响最大的区域(即人脸所在的区域)。通用的图像识别模型更关注图像中主体的图像特征,故而输出的分类结果所指示的类别(如“人”)是非场景类别或错误的场景类别。
上述步骤S11的目的是得到第一图像的热力图,通过热力图能够确定与场景识别无关的目标物体(如“人脸”)的位置,从而可以得到仅包括目标物体的图像(下述步骤S12),也可以得到第一图像中遮挡目标物体后剩余区域的图像(下述步骤S22)。
S12、训练设备对第一图像中除了目标物体所在第一区域之外的第二区域进行掩膜处理,得到第二图像(即仅包含目标物体的图像)。
第一图像为第一训练数据集中的任意一张图像。第一训练数据集中的每张图像均会经过上述步骤S11和步骤S12的处理,即得到第二训练数据集,第二训练数据集中包含多张第二图像。需要说明的是,本申请中为了区别图像中“与场景识别无关的物体所在的区域”和“与场景识别有关的区域”,将“与场景识别无关的物体所在的区域”称为“第一区域”,将“与场景识别有关的区域”称为第二区域。示例性的,请参阅图5所示,图5中以第一图像A为例进行说明,第一图像A为包括“人脸”的办公室场景的图像,“人脸”501为与“办公室”场景识别无关的目标物体,“人脸”所在的第一区域502是与场景识别无关的区域。第一图像A中除了第一区域502之外的区域为第二区域503,对第二区域503进行掩膜处理(如将第二区域的像素值设置为0),得到的图像为第二图像A。
S13、训练设备利用第二训练数据集对第一模型进行训练,得到物体检测模型。物体检 测模型用于检测输入的第一图像中与场景识别无关的物体所在的第一区域。其中,第一模型可以是神经网络模型。
第二训练数据集包括多个样本数据,每个样本数据包括输入数据和输出数据,其中输入数据为第二图像,输出数据为位置坐标,位置坐标用于指示目标物体所在的矩形区域。
根据上述步骤S10-步骤S13,得到物体检测模型。
S14、训练设备通过第二图像对GAN网络进行训练,得到图像生成模型。训练设备利用图像生成模型生成与目标物体同类别的多个样本物体图像。
通过第二图像对GAN网络进行优化(或者说训练)的过程如下。当固定生成模型(G)的时候,对于判别模型(D)进行优化。当输入第二图像(即真实数据)时,D优化网络结构使自己输出1。当输入来自G生成的数据时,D优化网络结构使自己输出0。当固定D的时候,G优化自己的网络使自己输出尽可能和真实数据一样的样本,并且使得生成的样本经过D的判别之后,D能够输出高概率值。G和D的训练过程交替进行,这个对抗的过程使得G生成的图像越来越逼真,D“打假”的能力也越来越强。
例如,第二图像A为“人脸”的图像,图像生成模型会生成大量的“人脸”图像,图像生成模型生成的“人脸”图像并不是现实中某人的“人脸”,而是图像生成模型根据对第二图像A的学习后制造出来的,具有真实“人脸”的全部特征。再如,如果第二图像B是“椅子”的图像,图像生成模型会生成大量的“椅子”图像等等。
通过上述步骤S10-步骤S12,和步骤S14,得到图像生成模型。上述步骤S13和步骤S14没有时序上的限定,S13和S14可以同步执行,即同步得到图像生成模型和物体检测模型。或者,S13在步骤S14之前执行,即先得到物体检测模型,后得到图像生成模型。或者,S13在步骤S14之后执行,即先得到图像生成模型,后得到物体检测模型。
下面对场景识别模型的训练过程进行说明。首先对场景识别模型的架构进行说明。请参阅图6所示,场景识别模型包括两个分支结构(或者称为一个主干结构和一个分支结构),两个分支结构为两个并联的子网络。为了区分两个子网络,两个子网络分别称为第一卷积神经网络和第二卷积神经网络。其中,第一卷积神经网络包括多个第一卷积层、第一全连接层和分类器。其中,第一卷积层、第一全连接层和分类器依次连接。第二卷积神经网络包括多个第二卷积层和第二全连接层。第二全连接层连接到最后一层第一卷积层。本申请实施例中,为了区分第一卷积神经网络和第二卷积神经网络中的卷积层和全连接层,将第一卷积神经网络中的卷积层称为“第一卷积层”,将第二卷积神经网络中的卷积层称为“第二卷积层”,将第一卷积神经网络中的全连接层称为“第一全连接层”,第二卷积神经网络中的全连接层称为“第二全连接层”。
请参阅图7所示,场景识别模型的训练过程如下述步骤S20-步骤S25所示。
S20、训练设备获取第一数据训练集。第一训练数据集中包括多个第一图像(或称为“原始图像”)。
本步骤请参阅上述图5对应的示例中的步骤S10的说明,此处不赘述。
S21、训练设备将第一图像输入到物体检测模型,利用物体检测模型识别第一图像中的第一区域。第一区域是与场景识别无关的图像区域。物体检测模型是上述图4对应的示例 中的步骤S11-步骤S13得到的物体检测模型。
示例性地,第一图像C是一张前景图像是人脸,背景图像是办公室的一张场景图像。将第一图像C输入到物体检测模型,物体检测模型输出4个坐标点,4个坐标点指示出包括人脸的第一区域,第一区域是与场景识别无关的区域。
S22、训练设备对第一区域进行掩膜处理,得到第三图像。
示例性的,请参阅图8所示,第一图像C包括人脸501的区域是第一区域502,第一图像C中除了第一区域502之外的区域为第二区域503,对第一区域502进行掩膜处理,得到第三图像。掩膜处理的作用是对第一区域502进行遮挡,例如,将第一区域的像素值设置为“0”,使得第三图像中仅包含第二区域503的图像,即第三图像中主要包含与场景识别有关的图像。
S23、训练设备获取图像生成模型生成的多张样本物体图像。样本物体图像是与场景识别无关的物体的图像。
图像生成模型根据第一训练集中每张第一图像中“与场景识别无关”的物体生成大量的样本物体图像。本步骤请参阅上述图4对应的示例中的S14的说明,此处不赘述。
S24、训练设备将多张样本物体图像分别替换到第三图像中掩膜覆盖的区域,得到多张目标图像。
第三图像可以理解是已经遮挡掉与场景识别无关的物体图像(也称为“干扰图像”)后,仅包括与场景识别有关的背景的图像。例如,针对第一类别的场景图像中的一张图像,其中,第一类别为多个场景类别中的任一种类别。第一类别以“办公室”类别为例,如针对“办公室”类中的一张场景图像(第一图像A),训练设备将第一图像A中的干扰图像“人脸”对应的区域屏蔽后,得到第三图像A。然后,将图像生成模型生成的大量的不同的“人脸”图像替换到第三图像A中掩膜覆盖的区域,组合得到多张目标图像(组合后的新图像)。组合后的多张目标图像对应的标签仍然是“办公室”类。再如,针对“办公室”类别的另一张场景图像(第一图像B),将第一图像B中的干扰图像“椅子”对应的区域屏蔽后,得到第三图像B,即第三图像B中包含掩膜覆盖的区域,然后,将图像生成模型生成的大量的“椅子”图像分别替换到第三图像B中掩膜覆盖的区域,组合得到多张目标图像,多张目标图像对应的标签仍然是“办公室”。可选地,训练设备也可以将图像生成模型生成的“椅子”图像替换到第三图像A中掩膜覆盖的区域,组合得到多张目标图像。或者,训练设备将图像生成模型生成的“人脸”图像替换到第三图像B中掩膜覆盖的区域,组合得到多张目标图像。
本步骤中,组合第三图像和图像生成模型生成的样本物体图像,从而能够得到大量的新的目标图像。第一训练数据集中的每张第一图像都经过步骤S21和步骤S22处理,然后将图像生成模型生成的多张样本物体图像分别与第三图像组合。组合得到多张目标图像,一方面从数据量上来说,对第一训练数据集中的图像的数量进行了扩充。另一方面从图像之间的差异上来说,针对同一个类别的图片,第三图像中保留了与场景识别相关的背景的图像,而图像生成模型生成的样本物体图像用于作为新合成的场景图片之间的差异图像,组合得到的新的目标图像所对应的标签还是第一类别(如办公室类别),目标图像用于作为 场景识别模型的训练数据。用多张目标图像训练场景识别模型,多张目标图像有相同(或相似的)的背景的图像,从而降低场景识别模型对同一类别的场景图像的类内差异性的关注度(或敏感度),使得场景识别模型更少地关注同一类别的场景图像的类内差异性(如不同的前景图像),从而更多地关注同一场景图像的类内相似性(如相同的背景图像),进而提高场景识别模型的分类准确率。
S25、训练设备将目标图像输入到第一卷积神经网络,利用目标图像的数据集训练第一卷积神经网络,并将第三图像输入到第二卷积神经网络,利用第三图像的数据集训练第二卷积神经网络,得到场景识别模型,场景识别模型包括第一卷积神经网络和第二卷积神经网络。
在对场景识别模型进行训练时,对于第一卷积神经网络的训练数据和对第二卷积神经网络的训练数据是不同的。即对第一卷积神经网络的训练数据是大量的目标图像(即组合得到的新的场景图像),而对第二卷积神经网络的训练数据是第三图像(即将原始图像中与场景识别无关的物体掩膜掉后的图像)。
举例来说,一张办公室场景的原始图像A包括前景图像(人脸)和背景图像。其中“人脸”在这张原始图像A中是与场景识别无关的物体,那么将“人脸”所在的区域遮挡掉,得到图像B(第三图像),图像B就会作为第二卷积神经网络的输入。同时,将图像B中被遮挡的区域替换成其他的与场景识别不相关的物体图像(如物体生成模型生成的人脸,或椅子等),就会得到多张目标图像(如图像C,图像D和图像F等等)。多张目标图像就会作为第一卷积神经网络的输入。目标图像和第三图像的相同点是:目标图像(图像C,图像D和图像F)的背景图像都相同,都来自于原始图像A;第三图像(图像B)的图像信息也来自于原始图像A。目标图像和第三图像的不同点是:目标图像(图像C,图像D和图像F)中既包括与场景识别相关的图像,也包含与场景识别无关的物体图像;第三图像(图像B)中只包含与场景识别相关的图像。即对场景识别模型训练的过程中,场景识别模型的两个分支结构同时接收到两路训练数据。
下面对两个分支结构分别进行说明。针对第一分支结构,第一卷积神经网络的卷积层(也称为“第一卷积层”)用于提取目标图像的图像特征。第一卷积神经网络中可分为多个阶段的卷积特征提取操作,例如,多个阶段的卷积特征提取操作按照从左到右的顺序(从浅层到高层)可以记为“block_1”,“block_2”…“block_n”。每个阶段对应的图像的尺寸不同,从“block_1”到“block_n”的图像特征(feature)的尺寸变小。n以5为例,block_1的尺度为224×224×64;block_2的尺寸为112×112×128;block_3的尺寸为56×56×256;block_4的尺寸为28×28×512;block_5的尺寸为14×14×512。将最后一个卷积层(block_n)的前两个卷积层(block_n-2和block_n-1)的特征图通过池化(如平均池化),改变这两个block的尺寸后,将block_n-2和block_n-1的特征融合到最后一个block_n的图像特征中,从而使得多尺度特征进行融合,即高层次的特征和浅层次的特征进行融合。使得场景识别模型能够更关注全局特征。并且,通过大量的新合成的目标图像对第一卷积神经网络进行训练,同一个类别的场景图像中引入与场景识别无关的物体图像,使得场景识别模型对场景图像中差异图像的特征降低关注度,从而减弱类内差异性对 场景识别模型分类性能造成的不利影响。需要说明的是,本申请实施例中所述的“特征融合”可以通过对图像特征(或称为特征图)进行拼接(concatenate,简称concat)、求和或加权平均等方式实现。
同时针对第二分支结构,第二卷积神经网络的卷积层(也称为第二卷积层)用于提取第三图像的图像特征。第三图像的图像特征经过全连接层(第二全连接层),第二全连接层输出的图像特征融合到第一卷积神经网络的最后一层卷积层block_n,融合之后的图像特征通过第一卷积神经网络的全连接层(第一全连接层)和分类器输出分类结果(标签)。第二卷积神经网络提取的第三图像的图像特征是原始图像中与场景识别相关的图像特征,第二卷积神经网络等效于注意力模型,第二卷积神经网络将提取的图像特征融合到第一卷积神经网络的最后一层卷积层,使得场景识别模型更关注与场景识别相关的图像特征。并且通过遮挡掉与场景识别无关的物体图像后,通过与场景识别有关的图像对第二卷积神经网络进行训练,第二卷积神经网络更容易学习到不同场景类别之间差异特征,从而减弱类间相似性对场景识别模型分类性能造成的不利影响。
可选地,第一卷积神经网络和第二卷积神经网络提取到的浅层特征具有相似性,为了减少模型参数量,及减小模型体积,第一卷积神经网络和第二卷积神经网络可以复用部分卷积层。例如,第一卷积神经网络和第二卷积神经网络均包括20个卷积层,第一卷积神经网络和第二卷积神经网络可以复用前8个卷积层,而第一卷积神经网络中的第9个卷积层-第20个卷积层(如记做“卷积层9a-卷积层20a”)与第二卷积神经网络中的第9个卷积层-第20个卷积层(如记做“卷积层9b-卷积层20b”)分别部署。
二、场景识别的执行过程。场景识别的执行过程的执行主体是执行设备。例如执行设备可以是手机。
请参阅9所示,图9是物体检测模型和场景识别模型的架构图。终端设备中装载有场景识别模型和物体检测模型。物体检测模型用于检测输入的图像中与场景识别无关的物体所在的区域,场景识别模型用于对待识别的图像进行场景分类。场景识别模型的架构请参阅上述图6对应的架构说明,此处不赘述。请参阅图10所示,图10为一种场景识别方法的步骤流程示意图。
步骤S30、执行设备通过摄像头采集待识别的第一场景图像。
摄像头可以是用户主动开启的,例如,用户点击摄像头图标,执行设备接收到用户点击摄像头的操作,控制开启摄像头,摄像头采集第一场景图像。或者,摄像头可以是应用(application,APP)调用开启的,例如,在即时通信APP的视频通话过程中,摄像头开启,摄像头采集第一场景图像。又或者,摄像头可以是产生场景识别需求后自启动的,例如,执行设备通过传感器检测到设备的位置变化,执行设备当前所处场景可能也发生变化,需要重新识别场景,因此摄像头自启动,摄像头采集第一场景图像。执行设备以手机为例,摄像头可以是前置摄像头,也可以是后置摄像头,具体的并不限定。
步骤S31、执行设备利用物体检测模型检测第一场景图像中与场景识别无关的物体所在的第一区域。
本步骤中的物体检测模型是上述图4对应的示例中步骤S11-步骤S13中训练得到的物 体检测模型。执行设备将待识别的第一场景图像输入到物体检测模型,物体检测模型输出位置坐标,位置坐标用于指示第一区域。例如,位置坐标为4个像素点,4个像素点指示一个矩形区域,矩形区域(即第一区域)内的物体图像是与场景识别无关的图像。例如,第一场景图像是一张办公室场景的图像,第一场景图像中的中间区域是一个“人脸”的图像,通过物体检测模型检测到“人脸”所在第一区域。
步骤S32、执行设备对所述第一区域进行掩膜处理,得到第二场景图像。
掩膜处理的作用是对第一区域进行遮挡,使得第二场景图像中不包含与场景识别无关的图像,仅包含与场景识别有关的图像。例如,将“人脸”所在的矩形区域像素值设置为“0”,遮挡“人脸”所在的区域,得到第二场景图像。
步骤S33、执行设备将所述第一场景图像和第二场景图像输入到场景识别模型,利用场景识别模型输出分类结果。
场景识别模型包括第一卷积神经网络和第二卷积神经网络,其中,所述第一卷积神经网络用于接收第一场景图像,并提取第一场景图像的第一图像特征。第二卷积神经网络用于接收第二场景图像,并提取所述第二场景图像的第二图像特征,将第二图像特征输出至第一卷积神经网络的最后一层卷积层,将第二图像特征融合到第一图像特征,然后,第一卷积神经网络将融合后的图像特征输出至输出层(包括第一全连接层和分类器),通过输出层输出分类结果。
本申请实施例中,第一卷积神经网络是通过对目标图像学习后得到的,而目标图像是由相同背景图像与不同的差异物体图像(与场景识别无关的物体图像)进行合成后得到的。使得场景识别模型对第一场景图像中与场景识别无关的图像特征关注度降低,从而减少相同类别的场景图像之间的类内差异性对场景识别模型的分类性能带来的负面影响。第二卷积神经网络是通过对与场景识别有关的图像学习后得到的,使得场景识别图像提取与场景识别有关的那部分图像的图像特征,且更关注与第一场景图像中与场景识别有关的图像特征,能够降低不同类别的场景图像的类间相似性对场景识别模型的分类性能带来的负面影响。以使待识别的第一场景图像的分类结果的准确性大大提高。
本申请实施例提供的场景识别方法可以应用到很多具体的应用场景。在第一个应用场景中,手机能够根据场景图像的分类结果调整耳机的降噪模式,参见下述S34A的说明。在第二个应用场景中,手机能够根据场景图像的分类结果调整音量,参见下述S34B的说明。下面对第一场景图像的分类结果能够应用的应用场景进行说明。
第一个应用场景的相关说明。首先对耳机的降噪模式进行简要说明。手里中预先配置两个及两个以上的降噪模式。如第一模式(或称为“深度降噪模式”),第二模式(或称为“生活降噪模式”),第三模式(或称为“通透模式”或称为“监听模式”)等。耳机降噪的一般原理是,耳机通过耳机上设置的微麦拾取环境噪声,耳机产生抗噪声波将外部声音抵消,使得外部声音进入用户的耳朵之前实现全部降噪或部分降噪。其中,第一模式用于控制耳机开启深度降噪,使得耳机屏蔽周围环境中的大部分噪音。公共交通诸如机场,火车,地铁等的轰鸣声以及闹市区环境的嘈杂声容易带给人纷扰焦躁的感觉,如果将耳机切换到深度降噪模式,则可以有效隔绝环境的嘈杂声。第二模式用于控制耳机启动普通降噪,使 得耳机屏蔽周围环境周的少部分噪音。当耳机启动第二模式时,用户能够听到外界环境中部分声音,这种模式适用于餐厅,街道,商场等生活场所,在日常生活中能够过滤掉部分嘈杂的噪音,但同时也能够感知到周围环境的声音。第三模式是指在降低环境噪音的情况下,同时兼顾人声和语音,避免错过重要的工作信息。上述三种降噪模式仅是示例性说明,并非限定。
对当前技术中,耳机切换降噪模式的方法进行说明。当耳机连接手机时,用户需要通过手机中的设置界面来设置当前耳机的降噪模式,如“设置”-“通用”-“降噪模式”-“深度降噪”。例如,当前用户在地铁上,用户想要调整耳机的降噪模式,需要打开手机的设置界面,设置“深度降噪”模式,以屏蔽掉外界全部的噪音。当用户到超市时,用户需要重新打开设置界面,如“设置”-“耳机”-“降噪模式”-“生活降噪”等,操作步骤繁琐。或者,另一种切换降噪模式的方法中,打开降噪开关后,用户同时按压“音量+”、“音量-”按键进行循环切换三种降噪模式。按压“音量+”、“-按键”一次,进入生活降噪。第二次按压“音量+”、“-按键”,进入兼听模式。第三次按压“音量+”、“-按键”,切换到“深度降噪”模式,这种通过物理按键的方式切换耳机降噪模式,也是需要用户多次按压物理键来切换降噪模式,用户操作不便。
步骤S34A、执行设备根据第一场景图像的分类结果调整耳机的降噪模式。
本申请实施例中,手机可以对场景图像进行识别,根据场景识别得到的分类结果自动调整耳机的降噪模式,无需用户手动来设置降噪模式。示例性的,不同的场景与降噪模式具有对应关系,手机可以根据场景及场景与降噪模型的对应关系来调整降噪模式。不同的场景和降噪模式如下表1所示。
表1
Figure PCTCN2022081883-appb-000001
上表1中,各种降噪模式和场景的对应关系仅是举例说明,并不造成限定。上表1中的对应关系可以是预先默认配置的。或者,用户可以根据实际需要,自行设置各降噪模式和场景的对应关系。例如,请参阅图11A-图11C所示,手机显示设置界面,手机接收用户的选择操作(例如点击操作),手机根据用户的选择操作确定各降噪模式与场景的对应关系。如在深度降噪模式的设置界面,用户勾选“地铁”、“机场”和“高铁”,手机建立“地铁”、“机场”、“高铁”和深度降噪模式的对应关系。同理,在生活降噪模式的设置界面,用户勾选“咖啡厅”和“超市”,手机建立生活降噪模式与“咖啡厅”和“超市”的对应关系。在监听降噪模式的设置界面,用户勾选“办公室”,手机建立监听降噪模式与“办公室”的 对应关系。
在另一种可能的实现方式中,手机可根据用户在不同场景下对降噪模式的历史设置数据,基于预设规则、统计分析和/或统计学习,自动建立各种降噪模式和场景的对应关系。手机采集用户当前所处环境的场景图像,利用场景识别模型对场景图像进行识别,得到识别结果,识别结果用于指示用户所处的第一场景(或环境)。手机查询历史设置数据,历史设置数据包括用户设置的第一场景与每种耳机降噪模式的对应关系的历史数据,若第一场景与第一降噪模式的对应关系的设置频次大于第一阈值,则手机自动建立第一场景与第一降噪模式的对应关系。例如,第一场景以“地铁”为例,历史设置数据如下表2所示。
表2
Figure PCTCN2022081883-appb-000002
从上述表2可以看出在历史设置数据中,地铁场景下,用户设置“深度降噪模式”的频次(80%)大于第一阈值(如第一阈值为70%),用户设置“生活降噪模式”的频次(20%)小于第一阈值。则手机建立“地铁”与“深度降噪模式”的对应关系。本实现方式中,不需要用户手动设置,即可实现降噪模式的个性化调整。可选地,用户也可以手动修改手机自动建立的对应关系,进行个性化配置。将第一场景与第一降噪模式的对应关系修改为第一场景与第二降噪模式的对应关系。例如,请参阅图12所示,手机显示设置界面,设置界面显示“地铁”与“深度降噪模式”具有对应关系,“深度降噪模式”关联有选择键,手机响应于用户对选择键的操作,将“地铁”与“深度降噪模式”的对应关系修改为“地铁”与“生活降噪模式”。本实现方式中,手机可以接收用户的选择操作,修改手机自动建立的场景与降噪模式的对应关系,进行个性化配置,使得用户可以根据自身所处的环境及实际需求配置场景与降噪模式的对应关系,提升用户体验。
本申请实施例的一个应用场景,用户当前身处地铁环境中,通过耳机听音乐,用户可以开启手机的摄像头,或者手机的摄像头自启动,手机通过摄像头采集一张地铁内部中的场景照片,手机可以通过手机的前摄像头采集图像,也可以通过手机的后摄像头采集图像,具体的不限定。例如,手机通过前置摄像头采集到场景图像,虽然场景图像中包含用户的“人脸”图像,通过本实施例中场景识别的方法,可以准确的识别到该场景图像的分类结果为第一场景(如“地铁”场景)。手机根据第一场景及第一场景与第一降噪模式(如深度降噪模式)的对应关系,将耳机的降噪模型切换到第一降噪模式(如深度降噪模式)。本申请实施例中,手机可以对采集的场景图像进行场景识别,根据场景识别的分类结果自动调整耳机的降噪模式,不需要用户按照操作步骤调整降噪模式,实现方便。
再如,在另一个应用场景中,如果用户当前正在使用即时通信APP的视频通话功能,即时通信APP已经调用摄像头,摄像头实时采集用户所处的场景图像。为了降低手机的计算量,手机可以每间隔一个时间段获取一帧场景图像,然后对该场景图像进行场景识别。 例如,时间段的时长可以为10分钟,15分钟或20分钟等。时间段的时长的设置依据是,通常情况下,用户从一个环境到另一个不同的环境大概需要的时长。一般情况下,用户不会在短时间内频繁变更所处的环境。例如,用户从“地铁”到“办公室”。或者,从“办公室”到“超市”等,需要有时间间隔。例如,时间段以10分钟为例。当用户使用手机与对方进行视频通话时,用户的手机的摄像头每间隔10分钟采集一帧场景图像。例如,在2021.3.7 10:20:01采集场景图像A,手机识别场景图像A的分类结果是“地铁”,然后手机根据该分类结果调整手机降噪模式为“深度降噪模式”,耳机执行深度降噪模式,用户几乎听不到外界的噪声,仅能听到视频通话中对方的语音内容。用户在2021.3.7 10:25:00出地铁,手机在2021.3.7 10:30:01采集场景图像B,手机识别场景图像B的分类结果是“办公室”,手机根据该分类结果调整手机降噪模式为“监听降噪模式”。耳机切换到监听降噪模式,耳机屏蔽了环境中的噪声声音,用户听不到环境中的噪声声音,但是用户依然能听到办公室环境中同事打招呼的声音,以及同事间谈论交流问题的声音,同时用户能够听到视频通话中对方的语音内容。本申请实施例中,手机通过场景识别的分类结果自动调整耳机降噪模式,不需要用户手动按步骤调整耳机降噪模式,提升用户体验。
第二个应用场景的相关说明。系统声音包括耳机,铃声,通话及媒体的声音等。用户所处的环境不同,对于手机的系统音量需求不同。例如,在嘈杂的环境(如地铁,超市)中时,用户需要将系统音量调大,如需要将铃声和提示音的音量调大,才不会错过电话或信息,需要将通话音量调大,才能听清楚对方的声音。而当用户在比较安静的环境(如办公室,图书馆等)中时,又不希望手机的系统声音的音量过大。如手机的铃声和提示音等音量过大会影响到其他人,并且如果通话音量过大,可能泄露隐私。在这种相对安静的环境中,用户又会将系统声音的音量降低。如此,用户可能需要在不同的环境中反复调整系统声音的音量大小。通常情况下,用户为了方便调整系统声音的音量,在安静的环境中,直接将手机的铃声和提示音调整为静音,虽然用户的这种操作可以避免打扰他人,但是直接将手机的铃声调整为静音,也会使用户不能及时接到并回复用户的电话及信息。
步骤S34B、执行设备根据第一场景图像的分类结果调整执行设备的系统音量。
本申请实施例中,手机采集场景图像,手机能够根据场景图像的分类结果自适应调整系统音量值,无需用户根据不同的环境频繁调整手机的系统音量值。示例性的,请参阅图13所示,手机显示系统音量值的设置界面,设置界面显示每个场景对应的用于调整音量值的进度条,用户可以通过滑动进度条来设置每个场景对应的音量值。或者,在另一种实现方式中,无需用户设置不同场景对应的音量值,手机根据经验值,默认配置不同的场景与系统音量值的对应关系。不同的场景与系统音量值的对应关系如下表3所示。下表3中示出的具体场景及每个场景对应的音量值仅是示例性说明,并非限定。
表3
场景 系统音量值
地铁 90
机场 90
高铁 80
咖啡厅 50
超市 50
办公室 20
在一个应用场景中,用户身处于咖啡厅环境中,用户开启手机摄像头,或者手机摄像头自启动,手机获取摄像头采集到的一张场景图像C,手机对场景图像C进行场景识别,得到分类结果C(咖啡厅场景)。手机根据该分类结果C调整系统音量的音量值。例如,手机根据“咖啡厅”场景将系统声音调整至50。当手机来电,铃声的音量值为50,较小的音量既不会打扰到其他人,而且又能使用户听见铃声(或提示音),避免用户错失来电。当用户从咖啡厅进入到地铁后,用户身处地铁环境中,用户开启手机摄像头,或者手机摄像头自启动,手机通过摄像头采集到场景图像D,手机根据场景图像D对用户所处的环境进行识别,得到分类结果D(地铁场景),手机根据该分类结果D(地铁场景)将系统音量值调整至90,从而使得用户可以在地铁中仍然可以听到手机的系统声音。
再如,在另一个应用场景中,如果用户当前正在使用即时通信APP的视频通话功能,即时通信APP已经调用摄像头,摄像头实时采集用户所处的场景图像。为了降低手机的计算量,手机可以每间隔一个时间段获取一帧场景图像,然后对该场景图像进行场景识别。例如,时间段的时长可以为10分钟,15分钟或20分钟等。时间段的时长的设置依据是,通常情况下,用户从一个环境到另一个不同的环境大概需要的时长。当用户使用手机与对方进行视频通话时,用户的手机的摄像头每间隔10分钟采集一帧场景图像。例如,在2021.3.8 10:20:01采集场景图像C,手机识别场景图像C的分类结果是“地铁”,然后手机根据该分类结果调整耳机的音量值是90,耳机中的音量增大,用户可以清楚地听到耳机中的声音。用户在2021.3.8 10:25:00出地铁,手机在2021.3.8 10:30:01采集场景图像D,手机识别场景图像D的分类结果是“办公室”,手机根据该分类结果调整耳机的音量值是50。耳机的音量减小,用户既能够听到对方的语音内容,而且耳机的音量适中,不会引起用户耳朵不适,也不会泄露耳机中的语音信息。
本申请实施例中,手机通过摄像头采集用户所处环境的场景图像,然后对场景图像进行识别,根据场景图像的分类结果,即用户所处的环境来自适应调整系统音量值,无需用户根据所处的不同环境反复手动调节系统音量,提升用户体验。
应理解,在上述图2B对应的架构中,用户设备(如手机)从执行设备接收待识别的第一场景图像的分类结果,分类结果用于触发用户设备将耳机的降噪模式调整为所述第一降噪模式。该分类结果所指示的场景与第一降噪模式具有对应关系。即用户设备根据第一场景图像的分类结果调整耳机的降噪模式的具体说明请参阅上述步骤S34A的具体说明,此处不赘述。
可选地,用户设备(如手机)从执行设备接收待识别的第一场景图像的分类结果,分类结果用于触发用户设备将用户设备的系统音量调整为所述第一音量值。该分类结果所指示的场景与第一音量值具有对应关系。即用户设备根据第一场景图像的分类结果调整用户设备的系统音量值的说明请参阅上述步骤S34B的具体说明,此处不赘述。
相对于上述方法实施例,本申请还提供了模型训练方法所应用的装置。模型训练方法应用于一种模型训练装置,该模型训练装置可以是上述方法实施例中所述的训练设备,或者,模型训练装置也可以是训练设备中的处理器,或者,模型训练装置可以是训练设备中的芯片系统。请参阅图14所示,本申请提供了一种模型训练装置1400的一个实施例,模型训练装置包括获取模块1401和处理模块1402。
获取模块1401,用于获取第一训练数据集,所述第一训练数据集中包括多张第一图像;
处理模块1402,用于利用物体检测模型识别所述第一图像中的第一区域,所述第一区域是与场景识别无关的图像区域;对所述第一区域进行掩膜处理,得到第三图像;获取图像生成模型生成的多张样本物体图像,所述样本物体图像是与场景识别无关的物体的图像;将所述多张样本物体图像分别替换到所述第三图像中掩膜覆盖的第一区域,得到多张目标图像;利用所述目标图像的数据集训练第一卷积神经网络,并利用所述第三图像的数据集训练第二卷积神经网络,得到场景识别模型,所述场景识别模型包括所述第一卷积神经网络和所述第二卷积神经网络。
可选地,获取模块1401,由收发模块代替。可选地,收发模块为收发器。其中,收发器具有发送和/或接收的功能。可选地,收发器由接收器和/或发射器代替。
可选地,收发模块为通信接口。可选地,通信接口是输入输出接口或者收发电路。输入输出接口包括输入接口和输出接口。收发电路包括输入接口电路和输出接口电路。
可选地,处理模块1402为处理器,处理器是通用处理器或者专用处理器等。可选地,处理器包括用于实现接收和发送功能的收发单元。例如该收发单元是收发电路,或者是接口,或者是接口电路。用于实现接收和发送功能的收发电路、接口或接口电路是分开的部署的,可选地,是集成在一起部署的。上述收发电路、接口或接口电路用于代码或数据的读写,或者,上述收发电路、接口或接口电路用于信号的传输或传递。
进一步的,获取模块1401用于执行上述图4对应的示例中的步骤S10,图7对应的示例中的步骤S20。处理模块1402用于执行上述图4对应的示例中的步骤S11-步骤S14,图7对应的示例中的步骤S20-步骤S25。
具体的,在一个可能的实现方式中,处理模块1402还具体用于:
将所述第一图像输入到图像识别模型,利用所述图像识别模型得到第一图像的第一分类结果及所述第一图像的热力图,所述热力图用于展示目标物体所在的区域,所述目标物体的图像特征是与场景识别无关的图像特征,所述第一分类结果指示的类别为非场景类别或错误的场景类别;
对所述第一图像中除了所述目标物体所在第一区域之外的第二区域进行掩膜处理,得到第二图像;
利用第二训练数据集对第一模型进行训练,得到所述物体检测模型,所述第二训练数据集包括多个样本数据,所述样本数据包括输入数据和输出数据,其中,输入数据为所述第二图像,输出数据为位置坐标,所述位置坐标用于指示所述目标物体所在的区域。
在一个可能的实现方式中,处理模块1402,还用于利用所述第二图像对生成式对抗网络GAN进行训练,得到所述图像生成模型。
在一个可能的实现方式中,处理模块1402还具体用于:
通过所述第一卷积神经网络的第一卷积层提取所述目标图像的图像特征,并通过所述第二卷积神经网络的第二卷积层提取所述第三图像的图像特征,并将所述第三图像的图像特征输出至所述第一卷积层,以与所述目标图像的图像特征进行融合;
通过所述第一卷积神经网络的输出层根据融合后的图像特征输出所述第一类别的标签。
在一个可能的设计中,处理模块1402的功能由一个处理装置实现,处理装置的功能部分或全部通过软件、硬件或其结合实现。因此,可以理解,以上各个模块可以是软件,硬件或二者结合实现。此时,处理装置包括存储器和处理器,其中,存储器用于存储计算机程序,处理器读取并执行存储器中存储的计算机程序,以执行上述方法实施例中的相应处理和/或步骤。处理器包括但不限于CPU、DSP、图像信号处理器、神经网络处理器(neural network processing unit,NPU)和微控制器中的一个或多个。
可选地,处理装置仅包括处理器。用于存储计算机程序的存储器位于处理装置之外,处理器通过电路/电线与存储器连接,以读取并执行存储器中存储的计算机程序。可选地,处理装置的功能部分或全部通过硬件实现。此时,处理装置包括输入接口电路,逻辑电路和输出接口电路。可选地,所述处理装置可以是一个或多个芯片,或一个或多个集成电路。
可选地,物体检测模型、图像生成模型、场景识别模型可以是神经网络模型,可嵌入、集成于或运行于神经网络处理器(NPU)。
请参阅图15所示,为便于理解,简要介绍神经网络处理器150。神经网络处理器150作为协处理器挂载到主处理器上,主处理器例如可以包括CPU,主处理器用于分配任务。神经网络处理器的核心部分为运算电路1503,通过控制器1504控制运算电路1503提取存储器中的矩阵数据并进行乘法运算。在一些实现中,运算电路1503内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路1503是二维脉动阵列。运算电路1503还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路1503是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器1502中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器1501中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)1508中。
统一存储器1506用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(direct memory access controller,DMAC)1505被搬运到权重存储器1502中。输入数据也通过DMAC被搬运到统一存储器1506中。
总线接口单元(bus interface unit,BIU)1510,用于AXI总线与DMAC和取指存储器(instruction fetch buffer)1509的交互。
总线接口单元1510,用于取指存储器1509从外部存储器获取指令,还用于存储单元访问控制器1505从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器1506或将权重数据搬运到权重存储器1502中或将输入数据数据搬运到输入存储器1501中。
向量计算单元1507多个运算处理单元,在需要的情况下,对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/FC层网络计算,如池化(pooling),批归一化(batch normalization),局部响应归一化(local response normalization)等。
在一些实现中,向量计算单元能1507将经处理的输出的向量存储到统一缓存器1506。例如,向量计算单元1507可以将非线性函数应用到运算电路1503的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元1507生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路1503的激活输入,例如用于在神经网络中的后续层中的使用。控制器1504连接的取指存储器(instruction fetch buffer)1509,用于存储控制器1504使用的指令;统一存储器1506,输入存储器1501,权重存储器1502以及取指存储器1509均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
请参阅图16所示,本申请提供了一种电子设备1600,电子设备1600是上述方法实施例中的训练设备,用于执行上述方法实施例中训练设备的功能。本实施例中电子设备1600以服务器为例进行说明。
服务器包括一个或一个以上中央处理器(central processing units,CPU)1622(例如,一个或一个以上处理器)和存储器1632,一个或一个以上存储应用程序1642或数据1644的存储介质1630(例如一个或一个以上海量存储设备)。其中,存储器1632和存储介质1630是短暂存储或持久存储。存储在存储介质1630的程序包括一个或一个以上模块(图示没标出),每个模块包括对装置中的一系列指令操作。更进一步地,中央处理器1622设置为与存储介质1630通信,在服务器上执行存储介质1630中的一系列指令操作。
可选地,服务器还包括一个或一个以上电源1626,一个或一个以上有线或无线网络接口1650,一个或一个以上输入输出接口1658,和/或,一个或一个以上操作系统1641。
可选地,服务器还包括一个或一个以上电源1626,一个或一个以上有线或无线网络接口1650,一个或一个以上输入输出接口1658,和/或,一个或一个以上操作系统1641。
可选地,中央处理器1622包括上述图15所示的NPU。
另外,在一个可选的设计中,图14中的获取模块1401的功能由图16中的网络接口1650执行。图14中的处理模块1402的功能由图16中的中央处理器1622执行。
本申请还提供了场景识别方法所应用的场景识别装置。场景识别装置用于执行上述方法实施例中执行设备所执行的功能。场景识别装置可以是上述方法实施例中的执行设备,或者,场景识别装置也可以是执行设备中的处理器,或者,场景识别装置可以是执行设备中的芯片系统。请参阅图17所示,本申请提供了一种场景识别装置1700的一个实施例,场景识别装置1700包括获取模块1701和处理模块1702,可选地,场景识别装置还包括发送模块1703。
获取模块1701,用于获取待识别的第一场景图像;
处理模块1702,用于利用物体检测模型检测所述第一场景图像中与场景识别无关的物体所在的第一区域;
对所述第一区域进行掩膜处理,得到第二场景图像;
将所述第一场景图像输入到场景识别模型中的第一卷积神经网络,将所述第二场景图像输入到场景识别模型中的第二卷积神经网络,利用所述场景识别模型输出分类结果,其中,所述第一卷积神经网络是利用目标图像的数据集进行训练得到的,所述第二卷积神经网络是利用第三图像的数据集训练得到的,所述目标图像是由图像生成模型生成的多张样本物体图像分别替换到所述第三图像中的第一区域后得到的,所述第三图像是利用物体检测模型识别第一图像中与场景识别无关的第一区域后,对所述第一区域进行掩膜处理后得到的,所述第一图像是训练数据集中的图像。
可选地,物体检测模型、图像生成模型、场景识别模型可以是神经网络模型,可嵌入、集成于或运行于上述如上述图15所示的神经网络处理器(NPU)。
可选地,获取模块1701,由收发模块代替。可选地,收发模块为收发器。其中,收发器具有发送和/或接收的功能。可选地,收发器由接收器和/或发射器代替。
可选地,收发模块为通信接口。可选地,通信接口是输入输出接口或者收发电路。输入输出接口包括输入接口和输出接口。收发电路包括输入接口电路和输出接口电路。
可选地,处理模块1702为处理器,处理器是通用处理器或者专用处理器等。可选地,处理器包括用于实现接收和发送功能的收发单元。例如该收发单元是收发电路,或者是接口,或者是接口电路。用于实现接收和发送功能的收发电路、接口或接口电路是分开的部署的,可选地,是集成在一起部署的。上述收发电路、接口或接口电路用于代码或数据的读写,或者,上述收发电路、接口或接口电路用于信号的传输或传递。
在一个可能的设计中,处理模块1702的功能由一个处理装置实现,处理装置的功能部分或全部通过软件、硬件或其结合实现。因此,可以理解,以上各个模块可以是软件,硬件或二者结合实现。此时,处理装置包括存储器和处理器,其中,存储器用于存储计算机程序,处理器读取并执行存储器中存储的计算机程序,以执行上述方法实施例中的相应处理和/或步骤。处理器包括但不限于CPU、DSP、图像信号处理器、神经网络处理器(neural network processing unit,NPU)和微控制器中的一个或多个。
可选地,处理装置仅包括处理器。用于存储计算机程序的存储器位于处理装置之外,处理器通过电路/电线与存储器连接,以读取并执行存储器中存储的计算机程序。可选地,处理装置的功能部分或全部通过硬件实现。此时,处理装置包括输入接口电路,逻辑电路和输出接口电路。可选地,所述处理装置可以是一个或多个芯片,或一个或多个集成电路。
进一步的,获取模块1701用于执行上述方法实施例中图10对应的示例中的步骤S30。处理模块1702用于执行上述方法实施例中图10对应的示例中的步骤S31-步骤S33。可选地,当执行设备为终端设备时,处理模块1702还用于执行步骤S34A和步骤S34B。
具体的,在一个可选的实现方式中,处理模块1702还用于:通过所述第一卷积神经网络的第一卷积层提取所述第一场景图像的图像特征,并通过所述第二卷积神经网络的第二卷积层提取所述第二场景图像的图像特征,并将所述第二场景图像的图像特征输出至所述第一卷积层,以与所述第一场景图像的图像特征进行融合;通过所述第一卷积神经网络的输出层根据融合后的图像特征输出所述分类结果。
具体的,在一个可选的实现方式中,若所述分类结果指示第一场景,所述第一场景与所述耳机的第一降噪模式具有对应关系;
处理模块1702,还用于根据所述分类结果将所述耳机的降噪模式调整为所述第一降噪模式;
或者,
发送模块1703,用于向所述用户设备发送所述分类结果,所述分类结果用于触发所述用户设备将所述耳机的降噪模式调整为所述第一降噪模式。
在一个可选的实现方式中,若所述分类结果指示第一场景,所述第一场景与第一音量值具有对应关系;
处理模块1702,还用于根据所述分类结果将所述执行设备的系统音量调整为所述第一音量值;
或者,
发送模块1703,用于向所述用户设备发送所述分类结果,所述分类结果用于触发所述用户设备将所述用户设备的系统音量调整为所述第一音量值。
可选地,发送模块1703,由收发模块代替。可选地,收发模块为收发器。其中,收发器具有发送和/或接收的功能。可选地,收发器由接收器和/或发射器代替。
可选地,收发模块为通信接口。可选地,通信接口是输入输出接口或者收发电路。输入输出接口包括输入接口和输出接口。收发电路包括输入接口电路和输出接口电路。
在一个可选的实现方式中,所述获取模块1701,还用于接收用户设备发送的待识别的第一场景图像;或者,通过摄像头或图像传感器采集待识别的第一场景图像。
请参阅图18所示,本申请实施例还提供了另一种电子设备。该电子设备1800用于执行上述方法实施例中执行设备所执行的功能。本申请实施例中,电子设备以手机为例进行说明。电子设备1800包括处理器1801、存储器1802、输入单元1803、显示单元1804、摄像头1805、通信单元1806和音频电路1807等部件。存储器1802可用于存储软件程序以及模块,处理器1801通过运行存储在存储器1802的软件程序以及模块,从而执行装置的各种功能应用以及数据处理。存储器1802可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。处理器1801可以是如图17对应的实施例中提到的处理装置。可选地,处理器1801包括但不限于各类型的处理器,如之前提到的CPU、DSP、图像信号处理器、如15所示的神经网络处理器和微控制器中的一个或多个。
输入单元1803可用于接收输入的数字或字符信息,以及产生与装置的用户设置以及功能控制有关的键信号输入。具体地,输入单元1803可包括触控面板1831。触控面板1831,也称为触摸屏,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板1831上或在触控面板1831附近的操作)。
显示单元1804可用于显示各种图像信息。显示单元1804可包括显示面板1841,可选的,可以采用液晶显示器、有机发光二极管等形式来配置显示面板1841。在某些实施例中,可以将触控面板1831与显示面板1841集成而实现装置的输入和输出功能。
摄像头1805,用于采集待识别的场景图像,或者,用于采集场景图像,将采集的场景图像发送至数据库。
通信单元1806,用于建立通信信道,使电子设备通过通信信道以连接至远程服务器,并从所述远程服务器获取物体检测模型及场景识别模型。所述通信单元1806可以包括无线局域网模块、蓝牙模块、基带模块等通信模块,以及所述通信模块对应的射频(radio frequency,RF)电路,用于进行无线局域网络通信、蓝牙通信、红外线通信及/或蜂窝式通信系统通信。所述通信模块用于控制电子设备中的各组件的通信,并且可以支持直接内存存取。
可选地,所述通信单元1806中的各种通信模块一般以集成电路芯片的形式出现,并可进行选择性组合,而不必包括所有通信模块及对应的天线组。例如,所述通信单元1806可以仅包括基带芯片、射频芯片以及相应的天线以在一个蜂窝通信系统中提供通信功能。经由所述通信单元1806建立的无线通信连接,所述电子设备可以连接至蜂窝网或因特网。
音频电路1807、扬声器1808和传声器1809可提供用户与手机之间的音频接口。音频电路1807可将接收到的音频数据转换后的电信号,传输到扬声器1808,由扬声器1808转换为声音信号输出。传声器1809将收集的声音信号转换为电信号,由音频电路1807接收后转换为音频数据,再将音频数据输出处理器1801处理后,经通信单元1806以发送给比如另一手机,或者将音频数据输出至存储器1802以便进一步处理。
本申请实施例中,电子设备与外部耳机有线或无线连接(如通过蓝牙模块连接),通信单元1806用于向训练设备发送待识别的场景图像,并从服务器接收该场景图像的分类结果,处理器1801还用于根据分类结果调整耳机的降噪模式。或者,处理器1801还用于根据分类结果调整系统音量的音量值。
可选地,处理器1801用于对待识别的场景图像进行场景识别,得到分类结果。处理器1801根据该分类结果调整耳机的降噪模式。或者,处理器1801还用于根据分类结果调整系统音量的音量值。
本申请实施例提供了一种计算机可读介质,计算机可读存储介质用于存储计算机程序,当计算机程序在计算机上运行时,使得计算机执行上述方法实施例中训练设备所执行的方法;或者,当计算机程序在计算机上运行时,使得计算机执行上述方法实施例中执行设备所执行的方法。
本申请实施例提供了一种芯片,芯片包括处理器和通信接口,通信接口例如是输入/输出接口、管脚或电路等。处理器用于读取指令以执行上述方法实施例中训练设备所执行的方法;或者,处理器用于读取指令以执行上述方法实施例中执行设备所执行的方法。
本申请实施例提供了一种计算机程序产品,该计算机程序产品被计算机执行时实现上述方法实施例中训练设备所执行的方法;或者,该计算机程序产品被计算机执行时实现上述方法实施例中执行设备所执行的方法。
其中,可选地,上述任一处提到的处理器,是一个通用中央处理器(CPU),微处理器,特定应用集成电路(application-specific integrated circuit,ASIC)。
所属领域的技术人员能够清楚地了解到,为描述的方便和简洁,上述描述的系统,装 置和模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (12)

  1. 一种模型训练方法,其特征在于,应用于训练设备,包括:
    获取第一训练数据集,所述第一训练数据集中包括多张第一图像;
    利用物体检测模型识别所述第一图像中的第一区域,所述第一区域是与场景识别无关的图像区域;
    对所述第一区域进行掩膜处理,得到第三图像;
    获取图像生成模型生成的多张样本物体图像,所述样本物体图像是与场景识别无关的物体的图像;
    将所述多张样本物体图像分别替换到所述第三图像中掩膜覆盖的第一区域,得到多张目标图像;
    利用所述目标图像的数据集训练第一卷积神经网络,并利用所述第三图像的数据集训练第二卷积神经网络,得到场景识别模型,所述场景识别模型包括所述第一卷积神经网络和所述第二卷积神经网络。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    将所述第一图像输入到图像识别模型,利用所述图像识别模型得到所述第一图像的第一分类结果及所述第一图像的热力图,所述热力图用于展示目标物体所在的区域,所述目标物体的图像特征是与场景识别无关的图像特征,所述第一分类结果指示的类别为非场景类别或错误的场景类别;
    对所述第一图像中除了所述目标物体所在第一区域之外的第二区域进行掩膜处理,得到第二图像;
    利用第二训练数据集对第一模型进行训练,得到所述物体检测模型,所述第二训练数据集包括多个样本数据,所述样本数据包括输入数据和输出数据,其中,所述输入数据为所述第二图像,所述输出数据为位置坐标,所述位置坐标用于指示所述目标物体所在的区域。
  3. 根据权利要求2所述的方法,其特征在于,所述方法还包括:
    利用所述第二图像对生成式对抗网络GAN进行训练,得到所述图像生成模型。
  4. 根据权利要求1-3中任一项所述的方法,其特征在于,所述目标图像和所述第三图像均对应第一类别的标签,所述利用所述目标图像的数据集训练第一卷积神经网络,并利用所述第三图像的数据集训练所述第二卷积神经网络,包括:
    通过所述第一卷积神经网络的第一卷积层提取所述目标图像的图像特征,并通过所述第二卷积神经网络的第二卷积层提取所述第三图像的图像特征,并将所述第三图像的图像特征输出至所述第一卷积层,以与所述目标图像的图像特征进行融合;
    通过所述第一卷积神经网络的输出层根据融合后的图像特征输出所述第一类别的标签。
  5. 一种场景识别方法,其特征在于,应用于执行设备,包括:
    获取待识别的第一场景图像;
    利用物体检测模型检测所述第一场景图像中与场景识别无关的物体所在的第一区域;
    对所述第一区域进行掩膜处理,得到第二场景图像;
    将所述第一场景图像输入到场景识别模型中的第一卷积神经网络,将所述第二场景图像输入到所述场景识别模型中的第二卷积神经网络,利用所述场景识别模型输出分类结果,其中,所述第一卷积神经网络是利用目标图像的数据集进行训练得到的,所述第二卷积神经网络是利用第三图像的数据集训练得到的,所述目标图像是由图像生成模型生成的多张样本物体图像分别替换到所述第三图像中的第一区域后得到的,所述第三图像是利用物体检测模型识别第一图像中与场景识别无关的第一区域后,对所述第一区域进行掩膜处理后得到的,所述第一图像是训练数据集中的图像。
  6. 根据权利要求5所述的方法,其特征在于,所述将所述第一场景图像输入到场景识别模型中的第一卷积神经网络,将所述第二场景图像输入到场景识别模型中的第二卷积神经网络,利用所述场景识别模型输出分类结果,包括:
    通过所述第一卷积神经网络的第一卷积层提取所述第一场景图像的图像特征,并通过所述第二卷积神经网络的第二卷积层提取所述第二场景图像的图像特征,并将所述第二场景图像的图像特征输出至所述第一卷积层,以与所述第一场景图像的图像特征进行融合;
    通过所述第一卷积神经网络的输出层根据融合后的图像特征输出所述分类结果。
  7. 根据权利要求5或6所述的方法,其特征在于,若所述分类结果指示第一场景,所述第一场景与耳机的第一降噪模式具有对应关系;
    所述执行设备是终端设备,所述执行设备与所述耳机连接,所述方法还包括:
    根据所述分类结果将所述耳机的降噪模式调整为所述第一降噪模式;
    或者,
    所述执行设备是服务器,用户设备与所述耳机连接,所述方法还包括:
    向所述用户设备发送所述分类结果,所述分类结果用于触发所述用户设备将所述耳机的降噪模式调整为所述第一降噪模式。
  8. 根据权利要求5或6所述的方法,其特征在于,若所述分类结果指示第一场景,所述第一场景与第一音量值具有对应关系;
    所述执行设备是终端设备,所述方法还包括:
    根据所述分类结果将所述执行设备的系统音量调整为所述第一音量值;
    或者,
    所述执行设备是服务器,所述方法还包括:
    向用户设备发送所述分类结果,所述分类结果用于触发所述用户设备将所述用户设备的系统音量调整为所述第一音量值。
  9. 根据权利要求5-8中任一项所述的方法,其特征在于,所述获取待识别的第一场景图像,包括:
    接收用户设备发送的待识别的第一场景图像;
    或者,
    通过摄像头或图像传感器采集待识别的第一场景图像。
  10. 一种电子设备,其特征在于,包括:处理器,所述处理器与存储器耦合,所述存储器用于存储程序或指令,当所述程序或指令被所述处理器执行时,使得所述电子设备执行 如权利要求1至4中任一项所述的方法;或者,当所述程序或指令被所述处理器执行时,使得所述电子设备执行如权利要求5至9中任一项所述的方法。
  11. 一种计算机程序产品,所述计算机程序产品中包括计算机程序代码,其特征在于,当所述计算机程序代码被计算机执行时,使得计算机实现上述如权利要求1至4中任一项所述的方法;或者,当所述计算机程序代码被计算机执行时,使得计算机实现上述如权利要求5至9中任一项所述的方法。
  12. 一种计算机可读存储介质,其特征在于,用于储存计算机程序或指令,所述计算机程序或指令被执行时使得计算机执行如权利要求1至4中任一项所述的方法;或者,所述计算机程序或指令被执行时使得计算机执行如权利要求5至9中任一项所述的方法。
PCT/CN2022/081883 2021-03-22 2022-03-21 一种模型训练方法、场景识别方法及相关设备 WO2022199500A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US18/551,258 US20240169687A1 (en) 2021-03-22 2022-03-21 Model training method, scene recognition method, and related device
EP22774161.8A EP4287068A1 (en) 2021-03-22 2022-03-21 Model training method, scene recognition method, and related device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110301843.5A CN115187824A (zh) 2021-03-22 2021-03-22 一种模型训练方法、场景识别方法及相关设备
CN202110301843.5 2021-03-22

Publications (1)

Publication Number Publication Date
WO2022199500A1 true WO2022199500A1 (zh) 2022-09-29

Family

ID=83396119

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/081883 WO2022199500A1 (zh) 2021-03-22 2022-03-21 一种模型训练方法、场景识别方法及相关设备

Country Status (4)

Country Link
US (1) US20240169687A1 (zh)
EP (1) EP4287068A1 (zh)
CN (1) CN115187824A (zh)
WO (1) WO2022199500A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116110222A (zh) * 2022-11-29 2023-05-12 东风商用车有限公司 基于大数据的车辆应用场景分析方法
CN116128458A (zh) * 2023-04-12 2023-05-16 华中科技大学同济医学院附属同济医院 用于医院经费卡报账的智能自动审核系统
WO2024088031A1 (zh) * 2022-10-27 2024-05-02 华为云计算技术有限公司 一种数据采集方法、装置及相关设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004062605A (ja) * 2002-07-30 2004-02-26 Fuji Photo Film Co Ltd シーン識別方法および装置ならびにプログラム
CN108038491A (zh) * 2017-11-16 2018-05-15 深圳市华尊科技股份有限公司 一种图像分类方法及装置
CN108961302A (zh) * 2018-07-16 2018-12-07 Oppo广东移动通信有限公司 图像处理方法、装置、移动终端及计算机可读存储介质
CN109727264A (zh) * 2019-01-10 2019-05-07 南京旷云科技有限公司 图像生成方法、神经网络的训练方法、装置和电子设备
CN112204565A (zh) * 2018-02-15 2021-01-08 得麦股份有限公司 用于基于视觉背景无关语法模型推断场景的系统和方法
CN112348117A (zh) * 2020-11-30 2021-02-09 腾讯科技(深圳)有限公司 场景识别方法、装置、计算机设备和存储介质
CN112446398A (zh) * 2019-09-02 2021-03-05 华为技术有限公司 图像分类方法以及装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004062605A (ja) * 2002-07-30 2004-02-26 Fuji Photo Film Co Ltd シーン識別方法および装置ならびにプログラム
CN108038491A (zh) * 2017-11-16 2018-05-15 深圳市华尊科技股份有限公司 一种图像分类方法及装置
CN112204565A (zh) * 2018-02-15 2021-01-08 得麦股份有限公司 用于基于视觉背景无关语法模型推断场景的系统和方法
CN108961302A (zh) * 2018-07-16 2018-12-07 Oppo广东移动通信有限公司 图像处理方法、装置、移动终端及计算机可读存储介质
CN109727264A (zh) * 2019-01-10 2019-05-07 南京旷云科技有限公司 图像生成方法、神经网络的训练方法、装置和电子设备
CN112446398A (zh) * 2019-09-02 2021-03-05 华为技术有限公司 图像分类方法以及装置
CN112348117A (zh) * 2020-11-30 2021-02-09 腾讯科技(深圳)有限公司 场景识别方法、装置、计算机设备和存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JI ZHONG; WANG JING; SU YUTING; SONG ZHANJIE; XING SHIKAI: "Balance between object and background: Object-enhanced features for scene image classification", NEUROCOMPUTING, ELSEVIER, AMSTERDAM, NL, vol. 120, 1 January 1900 (1900-01-01), AMSTERDAM, NL , pages 15 - 23, XP028696899, ISSN: 0925-2312, DOI: 10.1016/j.neucom.2012.02.054 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024088031A1 (zh) * 2022-10-27 2024-05-02 华为云计算技术有限公司 一种数据采集方法、装置及相关设备
CN116110222A (zh) * 2022-11-29 2023-05-12 东风商用车有限公司 基于大数据的车辆应用场景分析方法
CN116128458A (zh) * 2023-04-12 2023-05-16 华中科技大学同济医学院附属同济医院 用于医院经费卡报账的智能自动审核系统
CN116128458B (zh) * 2023-04-12 2024-02-20 华中科技大学同济医学院附属同济医院 用于医院经费卡报账的智能自动审核系统

Also Published As

Publication number Publication date
EP4287068A1 (en) 2023-12-06
CN115187824A (zh) 2022-10-14
US20240169687A1 (en) 2024-05-23

Similar Documents

Publication Publication Date Title
WO2022199500A1 (zh) 一种模型训练方法、场景识别方法及相关设备
CN111985265B (zh) 图像处理方法和装置
CN105654952B (zh) 用于输出语音的电子设备、服务器和方法
CN110544488B (zh) 一种多人语音的分离方法和装置
CN109299315B (zh) 多媒体资源分类方法、装置、计算机设备及存储介质
CN104408402B (zh) 人脸识别方法及装置
WO2019024717A1 (zh) 防伪处理方法及相关产品
CN110147467A (zh) 一种文本描述的生成方法、装置、移动终端及存储介质
CN107705251A (zh) 图片拼接方法、移动终端及计算机可读存储介质
EP4191579A1 (en) Electronic device and speech recognition method therefor, and medium
CN111009031B (zh) 一种人脸模型生成的方法、模型生成的方法及装置
JPWO2018230160A1 (ja) 情報処理システム、情報処理方法、およびプログラム
CN105635452A (zh) 移动终端及其联系人标识方法
CN114816610B (zh) 一种页面分类方法、页面分类装置和终端设备
CN112446832A (zh) 一种图像处理方法及电子设备
CN107704514A (zh) 一种照片管理方法、装置及计算机可读存储介质
CN114067776A (zh) 电子设备及其音频降噪方法和介质
WO2022022585A1 (zh) 电子设备及其音频降噪方法和介质
CN110544287A (zh) 一种配图处理方法及电子设备
CN113742460B (zh) 生成虚拟角色的方法及装置
CN114943976B (zh) 模型生成的方法、装置、电子设备和存储介质
CN113343709B (zh) 意图识别模型的训练方法、意图识别方法、装置及设备
CN114822543A (zh) 唇语识别方法、样本标注方法、模型训练方法及装置、设备、存储介质
CN111310701B (zh) 手势识别方法、装置、设备及存储介质
WO2021129444A1 (zh) 文件聚类方法及装置、存储介质和电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22774161

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022774161

Country of ref document: EP

Effective date: 20230829

WWE Wipo information: entry into national phase

Ref document number: 18551258

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE