WO2022002242A1 - Scene recognition method and system, and electronic device and medium - Google Patents

Scene recognition method and system, and electronic device and medium Download PDF

Info

Publication number
WO2022002242A1
WO2022002242A1 PCT/CN2021/104224 CN2021104224W WO2022002242A1 WO 2022002242 A1 WO2022002242 A1 WO 2022002242A1 CN 2021104224 W CN2021104224 W CN 2021104224W WO 2022002242 A1 WO2022002242 A1 WO 2022002242A1
Authority
WO
WIPO (PCT)
Prior art keywords
scene
network
training
recognition
trained
Prior art date
Application number
PCT/CN2021/104224
Other languages
French (fr)
Chinese (zh)
Inventor
吴臻志
祝夭龙
Original Assignee
北京灵汐科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202010633911.3A external-priority patent/CN111797763A/en
Priority claimed from CN202010633894.3A external-priority patent/CN111797762A/en
Application filed by 北京灵汐科技有限公司 filed Critical 北京灵汐科技有限公司
Publication of WO2022002242A1 publication Critical patent/WO2022002242A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Definitions

  • the present application relates to the field of identification technologies, and in particular, to a scene identification method and system, an electronic device, and a computer-readable medium.
  • Neural network refers to a mathematical model that applies a structure similar to the synaptic connections of the brain for information processing. Neural networks can be used to recognize scenes.
  • the present application provides a scene identification method and system, an electronic device, and a computer-readable medium, which can realize accurate identification of various scenes.
  • an embodiment of the present application provides a scene recognition method, including: extracting features of scene data to be recognized; inputting the extracted features into a scene recognition network for recognition, and obtaining multiple scene recognition results corresponding to different scenes.
  • an embodiment of the present application provides a scene recognition system, including: a backbone network configured to extract features of scene data to be recognized; a scene recognition network device, including a scene recognition network, and the scene recognition network device is set to The extracted features are input into the scene recognition network, and multiple scene recognition results corresponding to different scenes are obtained.
  • an electronic device which includes:
  • processors one or more processors
  • a memory on which one or more programs are stored, and when the one or more programs are executed by the one or more processors, so that the one or more processors implement any one of the embodiments of the present application scene recognition method;
  • One or more I/O interfaces connected between the processor and the memory, are configured to realize the information interaction between the processor and the memory.
  • embodiments of the present application provide a computer-readable medium on which a computer program is stored, and when the program is executed by a processor, implements any one of the scene recognition methods in the embodiments of the present application.
  • a scene recognition method and system, an electronic device, and a computer-readable medium proposed in the present application input the features of the extracted scene data into a scene recognition network capable of identifying whether the scene data is a corresponding scene for a variety of scenes. Recognition to obtain multiple scene recognition results corresponding to different scenes respectively. Compared with the related art, which can only obtain the similarity between scene data and each scene, the solution identification result of the present application is more accurate.
  • FIG. 1 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application.
  • FIG. 2 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application.
  • FIG. 7 is a schematic flowchart of a scene recognition system provided by an embodiment of the present application.
  • FIG. 8 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application.
  • FIG. 9 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application.
  • FIG. 10 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application.
  • FIG. 11 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application.
  • FIG. 12 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application.
  • FIG. 13 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 15 is a schematic structural diagram of a computer-readable medium provided by an embodiment of the present application.
  • FIG. 1 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application.
  • the system includes, but is not limited to, a backbone network 110 and a scene recognition network device 120 .
  • the backbone network 110 is configured to extract features of the scene data to be identified.
  • the backbone network is responsible for feature extraction of scene data.
  • the scene data includes at least one of scene video data, scene picture data and scene text data.
  • the backbone network is a deep neural network pre-trained with text, and the scene data obtains vectors representing text features through the backbone network.
  • the backbone network is a deep neural network pre-trained with an image network (ImageNet), and the scene data obtains a vector representing image features through the backbone network.
  • ImageNet image network
  • the backbone network is a multi-layer deep neural network that removes the front network part of the last few fully connected layers.
  • scene data is collected by a collection device such as a camera or a microphone, and the collected scene data is stored in the memory.
  • the scene identification network device 120 includes a scene identification network, and the scene identification network device 120 is configured to input the extracted features into the scene identification network to obtain multiple scene identification results corresponding to different scenes.
  • the scene recognition network can identify whether the scene data is a corresponding scene for a variety of scenes.
  • the features of the extracted scene data are input into the scene recognition network for identification, and multiple scene identification results corresponding to different scenes are obtained, and the scene identification results can represent whether the scene is a corresponding scene.
  • the solution identification result of this embodiment has higher accuracy.
  • FIG. 2 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application.
  • the system includes, but is not limited to, a backbone network 210 and a scene recognition network device 220 .
  • the scene identification network device 220 includes a scene identification network, and the scene identification network includes a plurality of scene networks corresponding to different scenes, for example, scene network 1, scene network 2, and scene network 3 in FIG. 2 .
  • the scene recognition network device 220 is set so that the extracted features pass through each scene network in parallel, and respectively obtain the scene recognition result corresponding to each scene network.
  • Each scene network can be set as a single-layer fully connected or multi-layer perceptron (MLP), and each scene network is called a head. There can be multiple heads that exist in parallel without affecting each other, or new heads can be added. Each head outputs a binary classification, that is, whether the scene data corresponds to the scene corresponding to the scene network.
  • a scene recognition network composed of multiple scene networks can also be called a multi-head network.
  • the scene recognition result output by the neural network is the approximation of each scene, rather than whether it is the exact result of which scene.
  • the approximation of the scene data N and the scene A is 40%
  • the similarity between scene data N and scene B is 30%
  • the similarity between scene data N and scene C is 30%
  • the recognition accuracy is poor.
  • the extracted features are passed through different scene networks in parallel, and the scene recognition results corresponding to each scene network are obtained respectively.
  • scene network 1 outputs identification result 1, indicating that scene data N and The scene corresponding to the scene network 1 is similar
  • the scene network 2 outputs the recognition result 0, indicating that the scene data N is not similar to the scene corresponding to the scene network 2
  • the scene network 3 outputs the recognition result 0, indicating that the scene data N and the scene corresponding to the scene network 3 are not similar. Therefore, it is clear that the scene data N is the scene data corresponding to the scene network 1, and the recognition result is more accurate.
  • FIG. 3 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application.
  • the system includes, but is not limited to, a backbone network 310 and a scene recognition network device 320 .
  • the scene identification network device 320 includes a scene identification network, the scene identification network includes an attention network, the attention network includes subnets corresponding to different scene identifiers, and multiple scene identifiers respectively correspond to different scenes.
  • the scene recognition network device 320 is configured to obtain the scene recognition results corresponding to each scene identifier through the extracted features through the subnets corresponding to each scene identifier.
  • the attention network is a kind of gated network.
  • some neural network nodes are connected, and the connected neural network nodes form a sub-network.
  • the form of attention input can be one-hot encoding or activity value.
  • the form of attention input is one-hot encoding
  • the scene identifier of scene A is [1,0]
  • the corresponding gated branch A is turned on (subnet A works)
  • the gated branch B is turned off, at this time the attention
  • the neurons controlled by the gated branch A are in working state, and the neurons controlled by the gated branch B are inhibited (no output is produced regardless of the input situation).
  • the scene identifier of scene B is [0,1], the corresponding gated branch B is turned on (subnet B works), and the gated branch A is turned off.
  • the neurons controlled by the gated branch B in the attention network are in In the working state, the neurons controlled by gated branch A are inhibited (no output is produced regardless of the input situation).
  • the gated input is a set of values, and each value is used for the activation activity of one gated branch. For example, the activity of gated branch A is 0.2, the activity of gated branch B is 0.8, and the gated branch If the input is [0.2, 0.8], the corresponding gated branch B is turned on (subnet B works), and the gated branch A is turned off.
  • the scene recognition result output by the neural network is the approximation degree to each scene, rather than whether it is the exact result of which scene, for example, the approximation degree to scene A is 40% , the similarity with scene B is 30%, the similarity with scene C is 30%, and the recognition accuracy is poor.
  • the scene identification results corresponding to each scene identification will be obtained through the subnets corresponding to different scene identifications.
  • subnet A outputs identification result 1, indicating that scene data N and Scenario A is similar
  • subnet B outputs a recognition result of 0, indicating that scene data N is not similar to scene B
  • subnet C outputs a recognition result of 0, indicating that scene data N is not similar to scene C, and the recognition result is more accurate. Therefore, it is clear that the scene data N is the scene data corresponding to the subnet A, and the accuracy of the identification result is higher.
  • FIG. 4 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application.
  • the system includes, but is not limited to, a positive sample device 410 , a backbone network 420 and a scene device network device 430 .
  • the positive sample device 410 is configured to output scene data to be identified to the backbone network.
  • the positive sample device collects the data of the current scene, obtains text data, image data or video data, and waits to identify the scene data.
  • the backbone network 420 is configured to extract features of the scene data to be identified.
  • the scene identification network device 430 includes a scene identification network, and the scene identification network device 430 is configured to input the extracted features into the scene identification network to obtain multiple scene identification results corresponding to different scenes.
  • the features of the extracted scene data are input into the scene recognition network for identification, and multiple scene identification results corresponding to different scenes are obtained, and the scene identification results can represent whether the scene is a corresponding scene.
  • the solution identification result of this embodiment has higher accuracy.
  • FIG. 5 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application.
  • the system includes, but is not limited to, a positive sample device 510 , a negative sample generator 520 , a scene identification device 530 , a backbone network 540 and a scene identification network device 550 .
  • the positive sample device 510 is configured to output training positive samples to the backbone network.
  • the negative sample generator 520 is configured to output training negative samples to the backbone network.
  • the training positive samples are selected scene files, and the training negative samples are other scene files except the selected scene.
  • scene data refers to collected scene data directly stored in a storage space (eg, memory)
  • scene files are an ordered collection of scene data. For example, read the data of 128 sectors from 0 to 127 in the memory, or read the first 128 bytes of the tellme.txt file in the X directory in the memory.
  • the scene identification device 530 is configured to obtain the target scene identification of the target scene, and output the target scene identification to the backbone network.
  • the target scene identification is set to identify the selected scene, and the selected scene is the target scene.
  • the backbone network 540 is configured to extract training features according to the target scene identifier, and the training features include training features for training positive samples and training features for training negative samples.
  • the scene recognition network device 550 is configured to train the scene recognition network to be trained for the target scene according to the training feature to obtain the scene recognition network.
  • the target scene may be a newly-added scene
  • the training-to-be-trained scene recognition network may be trained for the newly-added scene to obtain the ability to identify whether the scene data is the newly-added scene and whether the scene data is the original scene.
  • the scene recognition network of the scene; the target scene can also be an existing scene, and the scene recognition network to be trained can be trained for the existing scene, so as to update the recognition function for the existing scene, and the recognition function for other scenes. constant. It is more convenient and quicker to use the solution of this embodiment to train the scene recognition network.
  • FIG. 6 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application.
  • the system includes, but is not limited to, a positive sample device 610 , a negative sample generator 620 , a scene identification device 630 , a backbone network 640 and a scene identification network device 650 .
  • the positive sample device 610 is configured to output training positive samples to the backbone network.
  • the negative sample generator 620 is configured to output training negative samples to the backbone network.
  • the training positive samples are selected scene files, and the training negative samples are other scene files except the selected scene.
  • scene data refers to collected scene data directly stored in a storage space (eg, memory)
  • scene files are an ordered collection of scene data. For example, read the data of 128 sectors from 0 to 127 in the memory, or read the first 128 bytes of the tellme.txt file in the X directory in the memory.
  • the scene identification device 630 is configured to obtain the target scene identification of the target scene, and output the target scene identification to the backbone network.
  • the target scene identification is set to identify the selected scene, and the selected scene is the target scene.
  • the backbone network 640 is configured to extract training features according to the target scene identifier, and the training features include training features for training positive samples and training features for training negative samples.
  • the scene recognition network device 650 includes a plurality of scene networks and new scene networks corresponding to different scenes respectively, and the target scene identifier of the target scene corresponds to the new scene network; the scene recognition network device 650 is configured to train positive samples and training negative samples for training The feature passes through the new scene network to obtain the training recognition result corresponding to the new scene network; according to the training recognition result corresponding to the new scene network, the label of the training positive sample and the label of the training negative sample, determine the weight of the new scene network, and obtain The trained scene network.
  • the scene recognition network device 650 includes a plurality of scene networks corresponding to different scenes respectively, and the target scene identifier of the target scene corresponds to an existing scene network in the scene recognition network device; the scene recognition network device 650 is set to pass the training feature through The existing scene network is used to obtain the training identification result corresponding to the existing scene network; according to the training identification result corresponding to the existing scene network, the label of the training positive sample and the label of the training negative sample, the weight of the existing scene network is updated, and the updated scene network.
  • the multi-head network device may be instructed to identify scene data, train a new scene network, or update an existing scene network by means of button triggering, button triggering, or sending an instruction.
  • the neural network when a new scene recognition function needs to be added, the neural network is retrained according to the samples corresponding to the original scene recognition function and the samples corresponding to the new scene recognition function.
  • the original neural network can recognize the scene A, but cannot recognize the scene.
  • the neural network is retrained according to the samples of scene A and scene B, so that the similarity between scene data and scene A and scene B can be identified, for example, the similarity between scene data and scene A is 30%, and the similarity with scene B is 60%.
  • the neural network when the scene recognition function needs to be updated, the neural network is retrained according to the samples corresponding to the scene recognition function that need to be updated and the samples corresponding to other scene recognition functions that do not need to be updated.
  • the original neural network can recognize the scene A and the scene. B. If the ability to recognize scene B needs to be updated, the neural network is retrained according to the samples of scene A and the updated scene B.
  • the scene recognition network device needs to update the scene recognition function, there is no need to retrain the entire scene recognition network, and only the scene network that needs to be updated can be retrained, which is convenient and quick to update.
  • FIG. 7 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application.
  • the system includes, but is not limited to, a positive sample device 710 , a negative sample generator 720 , a scene identification device 730 , a backbone network 740 and a scene identification network device 750 .
  • the scene identification network device 750 includes a scene identification network, the scene identification network includes an attention network, the attention network includes subnetworks corresponding to different scene identifiers, and multiple scene identifiers correspond to different scenes respectively.
  • the positive sample device 710 is configured to output training positive samples to the backbone network.
  • the negative sample generator 720 is configured to output training negative samples to the backbone network.
  • the training positive samples are selected scene files, and the training negative samples are other scene files except the selected scene.
  • scene data refers to collected scene data directly stored in a storage space (eg, memory)
  • scene files are an ordered collection of scene data. For example, read the data of 128 sectors from 0 to 127 in the memory, or read the first 128 bytes of the tellme.txt file in the X directory in the memory.
  • the scene identification device 730 is configured to obtain the target scene identification of the target scene, and output the target scene identification to the backbone network.
  • the target scene identification is set to identify the selected scene, and the selected scene is the target scene.
  • the backbone network 740 is configured to extract training features according to the target scene identifier, and the training features include training features for training positive samples and training features for training negative samples.
  • the scene recognition network device 750 is configured to input the training feature and the target scene identifier into the attention network to be trained, and obtain the training recognition result of the attention network to be trained corresponding to the target scene identifier; according to the training recognition result, the label of the training positive sample and the training The label of the negative sample determines the weight of the attention network to be trained corresponding to the target scene identification, and obtains the trained attention network corresponding to the target scene identification.
  • the subnet corresponding to the target scene identifier acquired by the scene identification device may be a new subnet, that is, the training process is a training process of a new subnet (new scene); the subnet corresponding to the target scene identifier acquired by the scene identification device may be Existing subnet, that is, the training process is an update process of the existing subnet (existing scene).
  • the attention network can be instructed to recognize scene data, train a new scene network, or update an existing scene network by means of button triggering, button triggering, or sending an instruction.
  • the neural network when a new scene recognition function needs to be added, the neural network is retrained according to the samples corresponding to the original scene recognition function and the samples corresponding to the new scene recognition function.
  • the original neural network can recognize the scene A, but cannot recognize the scene.
  • the neural network is retrained according to the samples of scene A and scene B, so that the similarity between scene data and scene A and scene B can be identified, for example, the similarity between scene data and scene A is 30%, and the similarity with scene B is 60%.
  • the neural network when the scene recognition function needs to be updated, the neural network is retrained according to the samples corresponding to the scene recognition function that need to be updated and the samples corresponding to other scene recognition functions that do not need to be updated.
  • the original neural network can recognize the scene A and the scene. B. If the ability to recognize scene B needs to be updated, the neural network is retrained according to the samples of scene A and the updated scene B.
  • FIG. 8 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application. The method includes but is not limited to step S110 and step S120.
  • Step S110 extracting features of the scene data to be identified.
  • the scene data includes at least one of scene video data, scene picture data and scene text data.
  • the size of the scene data to be recognized can be 64*64*3.
  • the scene data with the size of 64*64*3 has a higher resolution and reduces the dimension. It's clearer after processing.
  • Step S120 Input the extracted features into a scene recognition network for recognition, and obtain multiple scene recognition results corresponding to different scenes.
  • the features of the extracted scene data are input into the scene recognition network for identification, and multiple scene identification results corresponding to different scenes are obtained, and the scene identification results can represent whether the scene is a corresponding scene.
  • the solution identification result of this embodiment has higher accuracy.
  • FIG. 9 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application.
  • the scene recognition network includes multiple scene networks respectively corresponding to different scenes.
  • the method includes but is not limited to step S210 and step S220.
  • Step S210 extracting features of the scene data to be identified.
  • the scene data includes at least one of scene video data, scene picture data and scene text data.
  • the size of the scene data to be recognized can be 64*64*3.
  • the scene data with the size of 64*64*3 has a higher resolution and reduces the dimension. It's clearer after processing.
  • Step S220 Pass the extracted features through each of the scene networks in parallel to obtain scene recognition results corresponding to each of the scene networks.
  • the scene recognition result output by the neural network is the similarity with each scene, rather than whether it is the exact result of which scene, for example, the similarity with scene A is 40%, The similarity with scene B is 30%, the similarity with scene C is 30%, and the recognition accuracy is poor.
  • the extracted features are passed through different scene networks in parallel, and the scene recognition results corresponding to each scene network are obtained respectively.
  • scene network 1 outputs identification result 1, indicating that scene data N and The scene corresponding to the scene network 1 is similar
  • the scene network 2 outputs the recognition result 0, indicating that the scene data N is not similar to the scene corresponding to the scene network 2
  • the scene network 3 outputs the recognition result 0, indicating that the scene data N and the scene corresponding to the scene network 3 are not similar. Therefore, it is clear that the scene data N is the scene data corresponding to the scene network 1, and the recognition result is more accurate.
  • FIG. 10 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application.
  • the scene recognition network includes an attention network, and the attention network includes a plurality of scene identifiers respectively corresponding to different scenes.
  • the method includes but is not limited to step S310 and step S320.
  • Step S310 extracting features of the scene data to be identified.
  • the scene data includes at least one of scene video data, scene picture data and scene text data.
  • the size of the scene data to be recognized can be 64*64*3.
  • the scene data with the size of 64*64*3 has a higher resolution and reduces the dimension. It's clearer after processing.
  • Step S320 traverse a plurality of the scene identifiers of the attention network according to the extracted features, and obtain a scene recognition result corresponding to each of the scene identifiers.
  • the scene recognition result output by the neural network is the similarity with each scene, rather than whether it is the exact result of which scene, for example, the similarity with scene A is 40%, The similarity with scene B is 30%, the similarity with scene C is 30%, and the recognition accuracy is poor.
  • the scene identification results corresponding to each scene identification will be obtained through the subnets corresponding to different scene identifications.
  • subnet A outputs identification result 1, indicating that scene data N and Scenario A is similar
  • subnet B outputs a recognition result of 0, indicating that scene data N is not similar to scene B
  • subnet C outputs a recognition result of 0, indicating that scene data N is not similar to scene C, and the recognition result is more accurate. Therefore, it is clear that the scene data N is the scene data corresponding to the subnet A, and the accuracy of the identification result is higher.
  • FIG. 11 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application.
  • the method includes but is not limited to step S410, step S420, step S430, and step S440.
  • Step S410 Extract training features according to the target scene identifier of the target scene, where the training features include training features for training positive samples and training features for training negative samples.
  • the training positive samples are selected scene files, and the training negative samples are other scene files except the selected scene.
  • Step S420 According to the training feature, train the scene recognition network to be trained for the target scene to obtain the scene recognition network.
  • Step S430 extracting features of the scene data to be identified.
  • the scene data includes at least one of scene video data, scene picture data and scene text data.
  • the size of the scene data to be recognized can be 64*64*3.
  • the scene data with the size of 64*64*3 has a higher resolution and reduces the dimension. It's clearer after processing.
  • Step S440 Input the extracted features into a scene recognition network for recognition, and obtain multiple scene recognition results corresponding to different scenes.
  • the target scene may be a newly-added scene
  • the training-to-be-trained scene recognition network may be trained for the newly-added scene to obtain the ability to identify whether the scene data is the newly-added scene and whether the scene data is the original scene.
  • the scene recognition network of the scene; the target scene can also be an existing scene, and the scene recognition network to be trained can be trained for the existing scene, so as to update the recognition function for the existing scene, and the recognition function for other scenes. constant. It is more convenient and quicker to use the solution of this embodiment to train the scene recognition network.
  • FIG. 12 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application.
  • the scene recognition network includes multiple scene networks respectively corresponding to different scenes.
  • the method includes but is not limited to step 510, step 520, step 530, step S540 and step S550.
  • Step 510 Extract training features according to the target scene identifier of the target scene, where the training features include training features for training positive samples and training features for training negative samples.
  • the training positive samples are selected scene files, and the training negative samples are other scene files except the selected scene.
  • Step 520 Pass the training feature through the network to be trained corresponding to the target scene to obtain a training identification result corresponding to the network to be trained.
  • the network to be trained is an existing scene network or a new scene network.
  • Step 530 Determine the weight of the network to be trained according to the training recognition result, the label of the training positive sample and the label of the training negative sample, and obtain the trained scene network.
  • the training mechanism of the network to be trained is as follows, among which:
  • Y pr is the obtained output
  • Y gt is the correct output
  • W is the weight
  • X is the input
  • is the activation function (sigmoid)
  • is a constant.
  • Step S540 extracting features of the scene data to be identified.
  • Step S550 Pass the extracted features through each of the scene networks in parallel to obtain scene recognition results corresponding to each of the scene networks.
  • the neural network when a new scene recognition function needs to be added, the neural network is retrained according to the samples corresponding to the original scene recognition function and the samples corresponding to the new scene recognition function.
  • the original neural network can recognize the scene A, but cannot recognize the scene.
  • the neural network is retrained according to the samples of scene A and scene B, so that the similarity between scene data and scene A and scene B can be identified, for example, the similarity between scene data and scene A is 30%, and the similarity with scene B is 60%.
  • the neural network when the scene recognition function needs to be updated, the neural network is retrained according to the samples corresponding to the scene recognition function that need to be updated and the samples corresponding to other scene recognition functions that do not need to be updated.
  • the original neural network can recognize the scene A and the scene. B. If the ability to recognize scene B needs to be updated, the neural network is retrained according to the samples of scene A and the updated scene B.
  • FIG. 13 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application.
  • the scene recognition network includes an attention network, and the attention network includes a plurality of scene identifiers respectively corresponding to different scenes.
  • the method includes, but is not limited to, steps S610, S620, S630, S640, and S650.
  • Step S610 Extract training features according to the target scene identifier of the target scene, where the training features include training features for training positive samples and training features for training negative samples.
  • Step S620 Input the training feature and the target scene identifier into the attention network to be trained, and obtain the training recognition result of the attention network to be trained corresponding to the target scene identifier.
  • the target scene identifier corresponds to an existing subnet or a new subnet in the network to be trained.
  • Step S630 according to the training recognition result, the label of the training positive sample and the label of the training negative sample, determine the weight of the attention network to be trained corresponding to the target scene identifier, and obtain the corresponding target scene identifier.
  • the trained attention network according to the training recognition result, the label of the training positive sample and the label of the training negative sample, determine the weight of the attention network to be trained corresponding to the target scene identifier, and obtain the corresponding target scene identifier.
  • Step S640 extracting features of the scene data to be identified.
  • Step S650 traverse a plurality of the scene identifiers of the attention network according to the extracted features, and obtain a scene recognition result corresponding to each of the scene identifiers.
  • the neural network when a new scene recognition function needs to be added, the neural network is retrained according to the samples corresponding to the original scene recognition function and the samples corresponding to the new scene recognition function.
  • the original neural network can recognize the scene A, but cannot recognize the scene.
  • the neural network is retrained according to the samples of scene A and scene B, so that the similarity between scene data and scene A and scene B can be identified, for example, the similarity between scene data and scene A is 30%, and the similarity with scene B is 60%.
  • the neural network when the scene recognition function needs to be updated, the neural network is retrained according to the samples corresponding to the scene recognition function that need to be updated and the samples corresponding to other scene recognition functions that do not need to be updated.
  • the original neural network can recognize the scene A and the scene. B. If the ability to recognize scene B needs to be updated, the neural network is retrained according to the samples of scene A and the updated scene B.
  • FIG. 14 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • the electronic equipment includes:
  • processors 810 one or more processors 810;
  • the memory 820 on which one or more programs are stored, and when the one or more programs are executed by one or more processors, the one or more processors implement any one of the scene recognition methods in the embodiments of the present application;
  • One or more I/O interfaces 830 connected between the processor and the memory, are configured to realize the information exchange between the processor and the memory.
  • the processor 810 is a device with data processing capability, including but not limited to a central processing unit (CPU), etc.;
  • the memory 820 is a device with data storage capability, including but not limited to random access memory (RAM, more specifically Such as SDRAM, DDR, etc.), read only memory (ROM), electrified erasable programmable read only memory (EEPROM), flash memory (FLASH);
  • the I/O interface (read and write interface) 830 is connected between the processor 810 and the memory 820 , which can realize the information interaction between the processor 810 and the memory 820, which includes but is not limited to a data bus (Bus) and the like.
  • Buss data bus
  • processor 810, memory 820, and I/O interface 830 are interconnected by bus 840, which in turn is connected to other components of the computing device.
  • FIG. 15 is a schematic structural diagram of a computer-readable medium provided by an embodiment of the present application.
  • the computer-readable medium has a computer program stored thereon, and when the program is executed by the processor, any one of the scene recognition methods in the embodiments of the present application is implemented.
  • the various embodiments of the present application may be implemented in hardware or special purpose circuits, software, logic, or any combination thereof.
  • some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software that may be executed by a controller, microprocessor or other computing device, although the application is not limited thereto.
  • Embodiments of the present application may be implemented by the execution of computer program instructions by a data processor of a mobile device, eg in a processor entity, or by hardware, or by a combination of software and hardware.
  • the computer program instructions may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state setting data, or source code written in any combination of one or more programming languages or object code.
  • ISA instruction set architecture
  • the block diagrams of any logic flow in the figures of the present application may represent program steps, or may represent interconnected logic circuits, modules and functions, or may represent a combination of program steps and logic circuits, modules and functions.
  • Computer programs can be stored on memory.
  • the memory may be of any type suitable for the local technical environment and may be implemented using any suitable data storage technology, such as but not limited to read only memory (ROM), random access memory (RAM), optical memory devices and systems (Digital Versatile Discs). DVD or CD disc) etc.
  • Computer-readable media may include non-transitory storage media.
  • the data processor may be of any type suitable for the local technical environment, such as, but not limited to, a general purpose computer, special purpose computer, microprocessor, digital signal processor (DSP), application specific integrated circuit (ASIC), programmable logic device (FPGA) and processors based on multi-core processor architectures.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA programmable logic device

Abstract

Provided are a scene recognition method and system, and an electronic device and a computer-readable medium. The method comprises: extracting a feature of scene data to be recognized; and inputting the extracted feature into a scene recognition network for recognition, so as to obtain a plurality of scene recognition results respectively corresponding to different scenes. By means of the scene recognition method and system provided in the present application, a feature of extracted scene data is input into a scene recognition network which is capable of recognizing, for various scenes, whether the scene data is a corresponding scene, so as to obtain a plurality of scene recognition results respectively corresponding to different scenes. Compared with the related art whereby only the similarity between scene data and each scene can be obtained, the solution of the present application provides a higher-accuracy recognition result.

Description

一种场景识别方法和系统、电子设备、介质A scene recognition method and system, electronic device and medium 技术领域technical field
本申请涉及识别技术领域,具体涉及一种场景识别方法和系统、一种电子设备、一种计算机可读介质。The present application relates to the field of identification technologies, and in particular, to a scene identification method and system, an electronic device, and a computer-readable medium.
背景技术Background technique
神经网络是指一种应用类似于大脑神经突触联接的结构进行信息处理的数学模型。神经网络可以用于对场景进行识别。Neural network refers to a mathematical model that applies a structure similar to the synaptic connections of the brain for information processing. Neural networks can be used to recognize scenes.
但是,在一些相关技术中,用神经网络对场景进行识别的精确度和灵活性较差。However, in some related technologies, the accuracy and flexibility of scene recognition by neural networks are poor.
发明内容SUMMARY OF THE INVENTION
本申请提供一种场景识别方法和系统、一种电子设备、一种计算机可读介质,实现对各种场景的精确识别。The present application provides a scene identification method and system, an electronic device, and a computer-readable medium, which can realize accurate identification of various scenes.
为实现上述目的,本申请实施例提供一种场景识别方法,包括:提取待识别的场景数据的特征;将提取的特征输入场景识别网络进行识别,得到分别对应不同场景的多个场景识别结果。To achieve the above purpose, an embodiment of the present application provides a scene recognition method, including: extracting features of scene data to be recognized; inputting the extracted features into a scene recognition network for recognition, and obtaining multiple scene recognition results corresponding to different scenes.
为实现上述目的,本申请实施例提供一种场景识别系统,包括:骨干网络,设置为提取待识别的场景数据的特征;场景识别网络装置,包括场景识别网络,所述场景识别网络装置设置为将提取的特征输入所述场景识别网络,得到分别对应不同场景的多个场景识别结果。In order to achieve the above purpose, an embodiment of the present application provides a scene recognition system, including: a backbone network configured to extract features of scene data to be recognized; a scene recognition network device, including a scene recognition network, and the scene recognition network device is set to The extracted features are input into the scene recognition network, and multiple scene recognition results corresponding to different scenes are obtained.
为实现上述目的,本申请实施例提供一种电子设备,其包括:To achieve the above purpose, an embodiment of the present application provides an electronic device, which includes:
一个或多个处理器;one or more processors;
存储器,其上存储有一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现本申请实施例中的任意一种场景识别方法;a memory on which one or more programs are stored, and when the one or more programs are executed by the one or more processors, so that the one or more processors implement any one of the embodiments of the present application scene recognition method;
一个或多个I/O接口,连接在所述处理器与存储器之间,配置为实现所述处理器与存储器的信息交互。One or more I/O interfaces, connected between the processor and the memory, are configured to realize the information interaction between the processor and the memory.
为实现上述目的,本申请实施例提供一种计算机可读介质,其上存储有计算机程序,所述程序被处理器执行时实现本申请实施例中的任意一种场景识别方法。To achieve the above purpose, embodiments of the present application provide a computer-readable medium on which a computer program is stored, and when the program is executed by a processor, implements any one of the scene recognition methods in the embodiments of the present application.
本申请提出的一种场景识别方法和系统、一种电子设备、一种计算机可读介质,将提取的场景数据的特征输入能够针对多种场景识别场景数据是否为对应的场景的场景识别网络进行识别,得到分别对应不同场景的多个场景识别结果。相比于相关技术仅可以得到场景数据与各场景的相似度,本申请的方案识别结果精确度更高。A scene recognition method and system, an electronic device, and a computer-readable medium proposed in the present application, input the features of the extracted scene data into a scene recognition network capable of identifying whether the scene data is a corresponding scene for a variety of scenes. Recognition to obtain multiple scene recognition results corresponding to different scenes respectively. Compared with the related art, which can only obtain the similarity between scene data and each scene, the solution identification result of the present application is more accurate.
附图说明Description of drawings
图1是本申请实施例提供的场景识别系统的结构示意图;1 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application;
图2是本申请实施例提供的场景识别系统的结构示意图;2 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application;
图3是本申请实施例提供的场景识别系统的结构示意图;3 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application;
图4是本申请实施例提供的场景识别系统的结构示意图;4 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application;
图5是本申请实施例提供的场景识别系统的结构示意图;5 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application;
图6是本申请实施例提供的场景识别系统的结构示意图;6 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application;
图7是本申请实施例提供的场景识别系统的流程示意图;7 is a schematic flowchart of a scene recognition system provided by an embodiment of the present application;
图8是本申请实施例提供的场景识别方法的流程示意图;8 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application;
图9是本申请实施例提供的场景识别方法的流程示意图;9 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application;
图10是本申请实施例提供的场景识别方法的流程示意图;10 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application;
图11是本申请实施例提供的场景识别方法的流程示意图;11 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application;
图12是本申请实施例提供的场景识别方法的流程示意图;12 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application;
图13是本申请实施例提供的场景识别方法的流程示意图;13 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application;
图14是本申请实施例提供的电子设备的结构示意图;14 is a schematic structural diagram of an electronic device provided by an embodiment of the present application;
图15是本申请实施例提供的计算机可读介质的结构示意图。FIG. 15 is a schematic structural diagram of a computer-readable medium provided by an embodiment of the present application.
具体实施方式detailed description
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。需要说明的是,虽然在装置示意图中进行了功能模块划分,在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于装置中的模块划分,或流程图中的顺序执行所示出或描述的步骤。In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application. It should be noted that although the functional modules are divided in the schematic diagram of the device, and the logical sequence is shown in the flowchart, in some cases, the modules may be divided differently from the device, or executed in the order in the flowchart. steps shown or described.
下面结合附图,对本申请实施例作进一步阐述。The embodiments of the present application will be further described below with reference to the accompanying drawings.
如图1所示,图1是本申请实施例提供的场景识别系统的结构示意图。该系统包括但不限于骨干网络110和场景识别网络装置120。As shown in FIG. 1 , FIG. 1 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application. The system includes, but is not limited to, a backbone network 110 and a scene recognition network device 120 .
骨干网络110,设置为提取待识别的场景数据的特征。The backbone network 110 is configured to extract features of the scene data to be identified.
骨干网络负责场景数据的特征抽取。所述场景数据至少包括场景视频数据、场景图片数据和场景文本数据之一。场景数据为场景文本数据的情况下,骨干网络为采用文本预训练的深度神经网络,场景数据通过骨干网得到代表文本特征的向量。场景数据为场景视频数据或场景图片数据的情况下,骨干网络为采用图像网络(ImageNet)预训练的深度神经网络,场景数据通过骨干网得到代表图片特征的向量。可选的,骨干网络为多层深度神经网络去除最后几层全连接层的前部网络部分。The backbone network is responsible for feature extraction of scene data. The scene data includes at least one of scene video data, scene picture data and scene text data. When the scene data is scene text data, the backbone network is a deep neural network pre-trained with text, and the scene data obtains vectors representing text features through the backbone network. When the scene data is scene video data or scene picture data, the backbone network is a deep neural network pre-trained with an image network (ImageNet), and the scene data obtains a vector representing image features through the backbone network. Optionally, the backbone network is a multi-layer deep neural network that removes the front network part of the last few fully connected layers.
在本实施例中,可选的,通过摄像头或者麦克风等采集设备采集场景数据,所采集的场景数据存储到内存中。In this embodiment, optionally, scene data is collected by a collection device such as a camera or a microphone, and the collected scene data is stored in the memory.
场景识别网络装置120,包括场景识别网络,场景识别网络装置120设置为将提取的特征输入所述场景识别网络,得到分别对应不同场景的多个场景识别结果。The scene identification network device 120 includes a scene identification network, and the scene identification network device 120 is configured to input the extracted features into the scene identification network to obtain multiple scene identification results corresponding to different scenes.
场景识别网络能够针对多种场景识别场景数据是否为对应的场景。The scene recognition network can identify whether the scene data is a corresponding scene for a variety of scenes.
采取本实施例的方案,将提取的场景数据的特征输入场景识别网络进行识别,得到分别对应不同场景的多个场景识别结果,场景识别结果能够表征场景是否为对 应的场景。相比于相关技术仅可以得到场景数据与各场景的相似度,本实施例的方案识别结果精确度更高。Using the solution of this embodiment, the features of the extracted scene data are input into the scene recognition network for identification, and multiple scene identification results corresponding to different scenes are obtained, and the scene identification results can represent whether the scene is a corresponding scene. Compared with the related art, which can only obtain the similarity between the scene data and each scene, the solution identification result of this embodiment has higher accuracy.
如图2所示,图2是本申请实施例提供的场景识别系统的结构示意图。该系统包括但不限于骨干网络210和场景识别网络装置220。场景识别网络装置220包括场景识别网络,场景识别网络包括分别对应不同场景的多个场景网络,例如,图2中的场景网络1、场景网络2、场景网络3。As shown in FIG. 2 , FIG. 2 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application. The system includes, but is not limited to, a backbone network 210 and a scene recognition network device 220 . The scene identification network device 220 includes a scene identification network, and the scene identification network includes a plurality of scene networks corresponding to different scenes, for example, scene network 1, scene network 2, and scene network 3 in FIG. 2 .
场景识别网络装置220设置为提取的特征并行通过各个场景网络,分别得到各个场景网络对应的场景识别结果。The scene recognition network device 220 is set so that the extracted features pass through each scene network in parallel, and respectively obtain the scene recognition result corresponding to each scene network.
每个场景网络可以设置为单层全连接或多层感知器(Multi-Layer Perceptron,MLP),每个场景网络称为头。可以有并行存在而互不影响的多个头,也可以增加新的头。每个头输出一个二分类,即场景数据是否为本场景网络对应的场景。多个场景网络组成的场景识别网络也可以称为多头网络。Each scene network can be set as a single-layer fully connected or multi-layer perceptron (MLP), and each scene network is called a head. There can be multiple heads that exist in parallel without affecting each other, or new heads can be added. Each head outputs a binary classification, that is, whether the scene data corresponds to the scene corresponding to the scene network. A scene recognition network composed of multiple scene networks can also be called a multi-head network.
采用相关技术的方案,对于场景数据N而言,神经网络输出的场景识别结果为与各场景的近似度,而不是具体是否为哪个场景的准确结果,例如场景数据N与场景A的近似度为40%,场景数据N与场景B的近似度为30%,场景数据N与场景C的近似度为30%,识别精确度差。采用本实施例的方案,将提取的特征并行通过不同场景网络,分别得到各场景网络对应的场景识别结果,例如,对于场景数据N而言,场景网络1输出识别结果1,表示场景数据N与场景网络1对应的场景近似,场景网络2输出识别结果0,表示场景数据N与场景网络2对应的场景不近似,场景网络3输出识别结果0,表示场景数据N与场景网络3对应的场景不近似,从而明确场景数据N为场景网络1对应的场景数据,识别结果精确度更高。With the solution of the related art, for the scene data N, the scene recognition result output by the neural network is the approximation of each scene, rather than whether it is the exact result of which scene. For example, the approximation of the scene data N and the scene A is 40%, the similarity between scene data N and scene B is 30%, and the similarity between scene data N and scene C is 30%, and the recognition accuracy is poor. Using the solution of this embodiment, the extracted features are passed through different scene networks in parallel, and the scene recognition results corresponding to each scene network are obtained respectively. For example, for scene data N, scene network 1 outputs identification result 1, indicating that scene data N and The scene corresponding to the scene network 1 is similar, the scene network 2 outputs the recognition result 0, indicating that the scene data N is not similar to the scene corresponding to the scene network 2, and the scene network 3 outputs the recognition result 0, indicating that the scene data N and the scene corresponding to the scene network 3 are not similar. Therefore, it is clear that the scene data N is the scene data corresponding to the scene network 1, and the recognition result is more accurate.
如图3所示,图3是本申请实施例提供的场景识别系统的结构示意图。该系统包括但不限于骨干网络310和场景识别网络装置320。场景识别网络装置320包括场景识别网络,场景识别网络包括注意力网络,注意力网络包括不同场景标识对应的子网,多个场景标识分别对应不同场景。As shown in FIG. 3 , FIG. 3 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application. The system includes, but is not limited to, a backbone network 310 and a scene recognition network device 320 . The scene identification network device 320 includes a scene identification network, the scene identification network includes an attention network, the attention network includes subnets corresponding to different scene identifiers, and multiple scene identifiers respectively correspond to different scenes.
场景识别网络装置320设置为提取的特征通过各个场景标识对应的子网,分别得到各个场景标识对应的场景识别结果。The scene recognition network device 320 is configured to obtain the scene recognition results corresponding to each scene identifier through the extracted features through the subnets corresponding to each scene identifier.
注意力网络属于门控网络的一种,对于每一个注意力输入(在本实施例中为场景标识),部分神经网络节点连接,连接的神经网络节点形成子网。注意力输入的形式可以采用独热编码或者活跃度数值等。例如,注意力输入的形式为独热编码,场景A的场景标识为[1,0],对应的门控支路A开启(子网A工作),门控支路B关闭,此时注意力网络中门控支路A控制的神经元处于工作状态,门控支路B控制的神经元被抑制(无论输入情况如何都不产生输出)。场景B的场景标识为[0,1],对应的门控支路B开启(子网B工作),门控支路A关闭,此时注意力网络中门控支路B控制的神经元处于工作状态,门控支路A控制的神经元被抑制(无论输入 情况如何都不产生输出)。或者,门控输入为一组数值,每个数值用于一路门控支路的激活活跃度,例如门控支路A的活跃度为0.2,门控支路B的活跃度为0.8,门控输入为[0.2,0.8],则对应门控支路B开启(子网B工作),门控支路A关闭。The attention network is a kind of gated network. For each attention input (in this embodiment, a scene identifier), some neural network nodes are connected, and the connected neural network nodes form a sub-network. The form of attention input can be one-hot encoding or activity value. For example, the form of attention input is one-hot encoding, the scene identifier of scene A is [1,0], the corresponding gated branch A is turned on (subnet A works), and the gated branch B is turned off, at this time the attention In the network, the neurons controlled by the gated branch A are in working state, and the neurons controlled by the gated branch B are inhibited (no output is produced regardless of the input situation). The scene identifier of scene B is [0,1], the corresponding gated branch B is turned on (subnet B works), and the gated branch A is turned off. At this time, the neurons controlled by the gated branch B in the attention network are in In the working state, the neurons controlled by gated branch A are inhibited (no output is produced regardless of the input situation). Alternatively, the gated input is a set of values, and each value is used for the activation activity of one gated branch. For example, the activity of gated branch A is 0.2, the activity of gated branch B is 0.8, and the gated branch If the input is [0.2, 0.8], the corresponding gated branch B is turned on (subnet B works), and the gated branch A is turned off.
采用相关技术的方案,对于场景数据N而言,神经网络输出的场景识别结果为与各个场景的近似度,而不是具体是否为哪个场景的准确结果,例如,与场景A的近似度为40%,与场景B的近似度为30%,与场景C的近似度为30%,识别精确度差。采用本实施例的方案,将通过不同场景标识对应的子网,分别得到各场景标识对应的场景识别结果,例如,对于场景数据N而言,子网A输出识别结果1,表示场景数据N与场景A近似,子网B输出识别结果0,表示场景数据N与场景B不近似,子网C输出识别结果0,表示场景数据N与场景C不近似,识别结果精确度更高。从而明确场景数据N为子网A对应的场景数据,识别结果精确度更高。Using the solution of the related art, for the scene data N, the scene recognition result output by the neural network is the approximation degree to each scene, rather than whether it is the exact result of which scene, for example, the approximation degree to scene A is 40% , the similarity with scene B is 30%, the similarity with scene C is 30%, and the recognition accuracy is poor. Using the solution of this embodiment, the scene identification results corresponding to each scene identification will be obtained through the subnets corresponding to different scene identifications. For example, for scene data N, subnet A outputs identification result 1, indicating that scene data N and Scenario A is similar, subnet B outputs a recognition result of 0, indicating that scene data N is not similar to scene B, and subnet C outputs a recognition result of 0, indicating that scene data N is not similar to scene C, and the recognition result is more accurate. Therefore, it is clear that the scene data N is the scene data corresponding to the subnet A, and the accuracy of the identification result is higher.
如图4所示,图4是本申请实施例提供的场景识别系统的结构示意图。该系统包括但不限于正样本装置410、骨干网络420和场景设备网络装置430。As shown in FIG. 4 , FIG. 4 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application. The system includes, but is not limited to, a positive sample device 410 , a backbone network 420 and a scene device network device 430 .
正样本装置410,设置为向所述骨干网络输出待识别场景数据。The positive sample device 410 is configured to output scene data to be identified to the backbone network.
正样本装置采集当前场景的数据,得到文本数据、图像数据或视频数据等待识别场景数据。The positive sample device collects the data of the current scene, obtains text data, image data or video data, and waits to identify the scene data.
骨干网络420,设置为提取待识别的场景数据的特征。The backbone network 420 is configured to extract features of the scene data to be identified.
场景识别网络装置430,包括场景识别网络,场景识别网络装置430设置为将提取的特征输入所述场景识别网络,得到分别对应不同场景的多个场景识别结果。The scene identification network device 430 includes a scene identification network, and the scene identification network device 430 is configured to input the extracted features into the scene identification network to obtain multiple scene identification results corresponding to different scenes.
采取本实施例的方案,将提取的场景数据的特征输入场景识别网络进行识别,得到分别对应不同场景的多个场景识别结果,场景识别结果能够表征场景是否为对应的场景。相比于相关技术仅可以得到场景数据与各场景的相似度,本实施例的方案识别结果精确度更高。Using the solution of this embodiment, the features of the extracted scene data are input into the scene recognition network for identification, and multiple scene identification results corresponding to different scenes are obtained, and the scene identification results can represent whether the scene is a corresponding scene. Compared with the related art, which can only obtain the similarity between the scene data and each scene, the solution identification result of this embodiment has higher accuracy.
如图5所示,图5是本申请实施例提供的场景识别系统的结构示意图。该系统包括但不限于正样本装置510、负样本产生器520、场景标识装置530、骨干网络540和场景识别网络装置550。As shown in FIG. 5 , FIG. 5 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application. The system includes, but is not limited to, a positive sample device 510 , a negative sample generator 520 , a scene identification device 530 , a backbone network 540 and a scene identification network device 550 .
正样本装置510,设置为向骨干网络输出训练正样本。The positive sample device 510 is configured to output training positive samples to the backbone network.
负样本产生器520,设置为向骨干网络输出训练负样本。The negative sample generator 520 is configured to output training negative samples to the backbone network.
其中训练正样本是选定场景文件,训练负样本是除选定场景外的其他场景文件。场景文件与场景数据的区别在于:场景数据是指直接存储在存储空间(例如内存)中的采集到的场景的数据,场景文件是场景数据的有序集合。举例说明,读取内存上0~127这128个扇区的数据,或者读取内存中X目录下的tellme.txt文件的前128字节。The training positive samples are selected scene files, and the training negative samples are other scene files except the selected scene. The difference between scene files and scene data is that: scene data refers to collected scene data directly stored in a storage space (eg, memory), and scene files are an ordered collection of scene data. For example, read the data of 128 sectors from 0 to 127 in the memory, or read the first 128 bytes of the tellme.txt file in the X directory in the memory.
场景标识装置530,设置为获取目标场景的目标场景标识,并将目标场景标识输出给骨干网络。目标场景标识设置为标识选定场景,选定场景即目标场景。The scene identification device 530 is configured to obtain the target scene identification of the target scene, and output the target scene identification to the backbone network. The target scene identification is set to identify the selected scene, and the selected scene is the target scene.
骨干网络540,设置为根据目标场景标识提取训练特征,训练特征包括训练正样本的训练特征和训练负样本的训练特征。场景识别网络装置550,设置为根据训练特征,针对目标场景对待训练场景识别网络进行训练,得到场景识别网络。The backbone network 540 is configured to extract training features according to the target scene identifier, and the training features include training features for training positive samples and training features for training negative samples. The scene recognition network device 550 is configured to train the scene recognition network to be trained for the target scene according to the training feature to obtain the scene recognition network.
在本实施例中,目标场景可以是新增的场景,可以针对新增的场景对待训练场景识别网络进行训练,得到能够识别场景数据是否为新增的场景、且能够识别场景数据是否为原有的场景的场景识别网络;目标场景也可以是已有场景,可以针对该已有场景对待训练场景识别网络进行训练,从而对针对该已有场景的识别功能进行更新,针对其他场景的识别功能则保持不变。采用本实施例的方案对场景识别网络进行训练更加方便快捷。In this embodiment, the target scene may be a newly-added scene, and the training-to-be-trained scene recognition network may be trained for the newly-added scene to obtain the ability to identify whether the scene data is the newly-added scene and whether the scene data is the original scene. The scene recognition network of the scene; the target scene can also be an existing scene, and the scene recognition network to be trained can be trained for the existing scene, so as to update the recognition function for the existing scene, and the recognition function for other scenes. constant. It is more convenient and quicker to use the solution of this embodiment to train the scene recognition network.
如图6所示,图6是本申请实施例提供的场景识别系统的结构示意图。该系统包括但不限于正样本装置610、负样本产生器620、场景标识装置630、骨干网络640和场景识别网络装置650。As shown in FIG. 6 , FIG. 6 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application. The system includes, but is not limited to, a positive sample device 610 , a negative sample generator 620 , a scene identification device 630 , a backbone network 640 and a scene identification network device 650 .
正样本装置610,设置为向骨干网络输出训练正样本。The positive sample device 610 is configured to output training positive samples to the backbone network.
负样本产生器620,设置为向骨干网络输出训练负样本。The negative sample generator 620 is configured to output training negative samples to the backbone network.
其中训练正样本是选定场景文件,训练负样本是除选定场景外的其他场景文件。场景文件与场景数据的区别在于:场景数据是指直接存储在存储空间(例如内存)中的采集到的场景的数据,场景文件是场景数据的有序集合。举例说明,读取内存上0~127这128个扇区的数据,或者读取内存中X目录下的tellme.txt文件的前128字节。The training positive samples are selected scene files, and the training negative samples are other scene files except the selected scene. The difference between scene files and scene data is that: scene data refers to collected scene data directly stored in a storage space (eg, memory), and scene files are an ordered collection of scene data. For example, read the data of 128 sectors from 0 to 127 in the memory, or read the first 128 bytes of the tellme.txt file in the X directory in the memory.
场景标识装置630,设置为获取目标场景的目标场景标识,并将目标场景标识输出给骨干网络。目标场景标识设置为标识选定场景,选定场景即目标场景。The scene identification device 630 is configured to obtain the target scene identification of the target scene, and output the target scene identification to the backbone network. The target scene identification is set to identify the selected scene, and the selected scene is the target scene.
骨干网络640,设置为根据目标场景标识提取训练特征,训练特征包括训练正样本的训练特征和训练负样本的训练特征。The backbone network 640 is configured to extract training features according to the target scene identifier, and the training features include training features for training positive samples and training features for training negative samples.
场景识别网络装置650,包括分别对应不同场景的多个场景网络和新场景网络,目标场景的目标场景标识与新场景网络对应;场景识别网络装置650设置为将训练正样本和训练负样本的训练特征通过新场景网络,得到新场景网络对应的训练识别结果;根据新场景网络对应的训练识别结果、所述训练正样本的标签和所述训练负样本的标签,确定新场景网络的权重,得到训练后的场景网络。The scene recognition network device 650 includes a plurality of scene networks and new scene networks corresponding to different scenes respectively, and the target scene identifier of the target scene corresponds to the new scene network; the scene recognition network device 650 is configured to train positive samples and training negative samples for training The feature passes through the new scene network to obtain the training recognition result corresponding to the new scene network; according to the training recognition result corresponding to the new scene network, the label of the training positive sample and the label of the training negative sample, determine the weight of the new scene network, and obtain The trained scene network.
或者,场景识别网络装置650,包括分别对应不同场景的多个场景网络,目标场景的目标场景标识与场景识别网络装置中的一个已有场景网络对应;场景识别网络装置650设置为将训练特征通过已有场景网络,得到已有场景网络对应的训练识别结果;根据已有场景网络对应的训练识别结果、训练正样本的标签和训练负样本的标签,更新已有场景网络的权重,得到更新后的场景网络。Or, the scene recognition network device 650 includes a plurality of scene networks corresponding to different scenes respectively, and the target scene identifier of the target scene corresponds to an existing scene network in the scene recognition network device; the scene recognition network device 650 is set to pass the training feature through The existing scene network is used to obtain the training identification result corresponding to the existing scene network; according to the training identification result corresponding to the existing scene network, the label of the training positive sample and the label of the training negative sample, the weight of the existing scene network is updated, and the updated scene network.
可选的,可以通过按钮触发、按键触发或者发送指令等方式指示多头网络装置识别场景数据、训练新场景网络或者更新已有场景网络。Optionally, the multi-head network device may be instructed to identify scene data, train a new scene network, or update an existing scene network by means of button triggering, button triggering, or sending an instruction.
相关技术中,需要增加新场景识别功能的情况下,根据原有场景识别功能对应 的样本以及新场景识别功能对应的样本重新训练神经网络,例如,原神经网络可以识别场景A,而无法识别场景B,需要增加识别场景B的情况下,则根据场景A和场景B的样本重新训练神经网络,从而可以识别场景数据与场景A以及场景B的相似度,例如,场景数据与场景A的相似度为30%,与场景B的相似度为60%。采用本实施例的方案,场景识别网络装置需要增加新场景识别功能的情况下,无需对整个场景识别网络重新训练,仅对新场景网络进行训练即可,训练方便快捷,识别灵活准确。In the related art, when a new scene recognition function needs to be added, the neural network is retrained according to the samples corresponding to the original scene recognition function and the samples corresponding to the new scene recognition function. For example, the original neural network can recognize the scene A, but cannot recognize the scene. B. When it is necessary to increase the recognition of scene B, the neural network is retrained according to the samples of scene A and scene B, so that the similarity between scene data and scene A and scene B can be identified, for example, the similarity between scene data and scene A is 30%, and the similarity with scene B is 60%. With the solution of this embodiment, when the scene recognition network device needs to add a new scene recognition function, there is no need to retrain the entire scene recognition network, but only the new scene network can be trained. The training is convenient and fast, and the recognition is flexible and accurate.
相关技术中,需要更新场景识别功能的情况下,根据需要更新的场景识别功能对应的样本以及其他无需更新的场景识别功能对应的样本重新训练神经网络,例如,原神经网络可以识别场景A和场景B,需要更新识别场景B的能力的情况下,则根据场景A和更新后的场景B的样本重新训练神经网络。采用本实施例的方案,场景识别网络装置需要更新场景识别功能的情况下,无需对整个场景识别网络重新训练,仅对需要更新的场景网络重新进行训练即可,更新方便快捷。In the related art, when the scene recognition function needs to be updated, the neural network is retrained according to the samples corresponding to the scene recognition function that need to be updated and the samples corresponding to other scene recognition functions that do not need to be updated. For example, the original neural network can recognize the scene A and the scene. B. If the ability to recognize scene B needs to be updated, the neural network is retrained according to the samples of scene A and the updated scene B. With the solution of this embodiment, when the scene recognition network device needs to update the scene recognition function, there is no need to retrain the entire scene recognition network, and only the scene network that needs to be updated can be retrained, which is convenient and quick to update.
如图7所示,图7是本申请实施例提供的场景识别系统的结构示意图。该系统包括但不限于正样本装置710、负样本产生器720、场景标识装置730、骨干网络740和场景识别网络装置750。场景识别网络装置750包括场景识别网络,场景识别网络包括注意力网络,注意力网络包括不同场景标识对应的子网,多个场景标识分别对应不同场景。As shown in FIG. 7 , FIG. 7 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application. The system includes, but is not limited to, a positive sample device 710 , a negative sample generator 720 , a scene identification device 730 , a backbone network 740 and a scene identification network device 750 . The scene identification network device 750 includes a scene identification network, the scene identification network includes an attention network, the attention network includes subnetworks corresponding to different scene identifiers, and multiple scene identifiers correspond to different scenes respectively.
正样本装置710,设置为向骨干网络输出训练正样本。The positive sample device 710 is configured to output training positive samples to the backbone network.
负样本产生器720,设置为向骨干网络输出训练负样本。The negative sample generator 720 is configured to output training negative samples to the backbone network.
其中训练正样本是选定场景文件,训练负样本是除选定场景外的其他场景文件。场景文件与场景数据的区别在于:场景数据是指直接存储在存储空间(例如内存)中的采集到的场景的数据,场景文件是场景数据的有序集合。举例说明,读取内存上0~127这128个扇区的数据,或者读取内存中X目录下的tellme.txt文件的前128字节。The training positive samples are selected scene files, and the training negative samples are other scene files except the selected scene. The difference between scene files and scene data is that: scene data refers to collected scene data directly stored in a storage space (eg, memory), and scene files are an ordered collection of scene data. For example, read the data of 128 sectors from 0 to 127 in the memory, or read the first 128 bytes of the tellme.txt file in the X directory in the memory.
场景标识装置730,设置为获取目标场景的目标场景标识,并将目标场景标识输出给骨干网络。目标场景标识设置为标识选定场景,选定场景即目标场景。The scene identification device 730 is configured to obtain the target scene identification of the target scene, and output the target scene identification to the backbone network. The target scene identification is set to identify the selected scene, and the selected scene is the target scene.
骨干网络740,设置为根据目标场景标识提取训练特征,训练特征包括训练正样本的训练特征和训练负样本的训练特征。The backbone network 740 is configured to extract training features according to the target scene identifier, and the training features include training features for training positive samples and training features for training negative samples.
场景识别网络装置750,设置为将训练特征和目标场景标识输入待训练注意力网络,得到目标场景标识对应的待训练注意力网络的训练识别结果;根据训练识别结果、训练正样本的标签和训练负样本的标签,确定目标场景标识对应的待训练注意力网络的权重,得到目标场景标识对应的训练后的注意力网络。The scene recognition network device 750 is configured to input the training feature and the target scene identifier into the attention network to be trained, and obtain the training recognition result of the attention network to be trained corresponding to the target scene identifier; according to the training recognition result, the label of the training positive sample and the training The label of the negative sample determines the weight of the attention network to be trained corresponding to the target scene identification, and obtains the trained attention network corresponding to the target scene identification.
其中,场景标识装置获取的目标场景标识对应的子网可以是新子网,即该训练过程为新子网(新场景)的训练过程;场景标识装置获取的目标场景标识对应的子网可以是已有的子网,即该训练过程为已有子网(已有场景)的更新过程。Wherein, the subnet corresponding to the target scene identifier acquired by the scene identification device may be a new subnet, that is, the training process is a training process of a new subnet (new scene); the subnet corresponding to the target scene identifier acquired by the scene identification device may be Existing subnet, that is, the training process is an update process of the existing subnet (existing scene).
可选的,可以通过按钮触发、按键触发或者发送指令等方式指示注意力网络识别场景数据、训练新场景网络或者更新已有场景网络。Optionally, the attention network can be instructed to recognize scene data, train a new scene network, or update an existing scene network by means of button triggering, button triggering, or sending an instruction.
相关技术中,需要增加新场景识别功能的情况下,根据原有场景识别功能对应的样本以及新场景识别功能对应的样本重新训练神经网络,例如,原神经网络可以识别场景A,而无法识别场景B,需要增加识别场景B的情况下,则根据场景A和场景B的样本重新训练神经网络,从而可以识别场景数据与场景A以及场景B的相似度,例如,场景数据与场景A的相似度为30%,与场景B的相似度为60%。采用本实施例的方案,注意力网络需要增加新场景识别功能的情况下,无需对整个注意力网络重新训练,仅对新场景对应的子网进行训练即可,训练方便快捷,识别灵活准确。In the related art, when a new scene recognition function needs to be added, the neural network is retrained according to the samples corresponding to the original scene recognition function and the samples corresponding to the new scene recognition function. For example, the original neural network can recognize the scene A, but cannot recognize the scene. B. When it is necessary to increase the recognition of scene B, the neural network is retrained according to the samples of scene A and scene B, so that the similarity between scene data and scene A and scene B can be identified, for example, the similarity between scene data and scene A is 30%, and the similarity with scene B is 60%. With the solution of this embodiment, when the attention network needs to add a new scene recognition function, there is no need to retrain the entire attention network, and only the sub-network corresponding to the new scene can be trained. The training is convenient and fast, and the recognition is flexible and accurate.
相关技术中,需要更新场景识别功能的情况下,根据需要更新的场景识别功能对应的样本以及其他无需更新的场景识别功能对应的样本重新训练神经网络,例如,原神经网络可以识别场景A和场景B,需要更新识别场景B的能力的情况下,则根据场景A和更新后的场景B的样本重新训练神经网络。采用本实施例的方案,注意力网络需要更新场景识别功能的情况下,无需对整个注意力网络重新训练,仅对需要更新的场景子网重新进行训练即可,更新方便快捷。In the related art, when the scene recognition function needs to be updated, the neural network is retrained according to the samples corresponding to the scene recognition function that need to be updated and the samples corresponding to other scene recognition functions that do not need to be updated. For example, the original neural network can recognize the scene A and the scene. B. If the ability to recognize scene B needs to be updated, the neural network is retrained according to the samples of scene A and the updated scene B. With the solution of this embodiment, when the attention network needs to update the scene recognition function, there is no need to retrain the entire attention network, and only the scene subnet that needs to be updated can be retrained, which is convenient and quick to update.
如图8所示,图8是本申请实施例提供的场景识别方法的流程示意图。该方法包括但不限于步骤S110和步骤S120。As shown in FIG. 8 , FIG. 8 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application. The method includes but is not limited to step S110 and step S120.
步骤S110、提取待识别的场景数据的特征。Step S110, extracting features of the scene data to be identified.
场景数据至少包括场景视频数据、场景图片数据和场景文本数据之一。可选的,待识别的场景数据的大小可以为64*64*3,相比于大小为32*32*3的场景数据,大小为64*64*3的场景数据分辨率更高,降维处理后更清楚。The scene data includes at least one of scene video data, scene picture data and scene text data. Optionally, the size of the scene data to be recognized can be 64*64*3. Compared with the scene data with the size of 32*32*3, the scene data with the size of 64*64*3 has a higher resolution and reduces the dimension. It's clearer after processing.
步骤S120、将提取的特征输入场景识别网络进行识别,得到分别对应不同场景的多个场景识别结果。Step S120: Input the extracted features into a scene recognition network for recognition, and obtain multiple scene recognition results corresponding to different scenes.
采取本实施例的方案,将提取的场景数据的特征输入场景识别网络进行识别,得到分别对应不同场景的多个场景识别结果,场景识别结果能够表征场景是否为对应的场景。相比于相关技术仅可以得到场景数据与各场景的相似度,本实施例的方案识别结果精确度更高。Using the solution of this embodiment, the features of the extracted scene data are input into the scene recognition network for identification, and multiple scene identification results corresponding to different scenes are obtained, and the scene identification results can represent whether the scene is a corresponding scene. Compared with the related art, which can only obtain the similarity between the scene data and each scene, the solution identification result of this embodiment has higher accuracy.
如图9所示,图9是本申请实施例提供的场景识别方法的流程示意图。场景识别网络包括分别对应不同场景的多个场景网络。该方法包括但不限于步骤S210、步骤S220。As shown in FIG. 9 , FIG. 9 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application. The scene recognition network includes multiple scene networks respectively corresponding to different scenes. The method includes but is not limited to step S210 and step S220.
步骤S210、提取待识别的场景数据的特征。Step S210, extracting features of the scene data to be identified.
场景数据至少包括场景视频数据、场景图片数据和场景文本数据之一。可选的,待识别的场景数据的大小可以为64*64*3,相比于大小为32*32*3的场景数据,大小为64*64*3的场景数据分辨率更高,降维处理后更清楚。The scene data includes at least one of scene video data, scene picture data and scene text data. Optionally, the size of the scene data to be recognized can be 64*64*3. Compared with the scene data with the size of 32*32*3, the scene data with the size of 64*64*3 has a higher resolution and reduces the dimension. It's clearer after processing.
步骤S220、将提取的特征并行通过各个所述场景网络,分别得到各个所述场景网络对应的场景识别结果。Step S220: Pass the extracted features through each of the scene networks in parallel to obtain scene recognition results corresponding to each of the scene networks.
采用相关技术的方案,对于场景数据N而言,神经网络输出的场景识别结果为与各场景的近似度,而不是具体是否为哪个场景的准确结果,例如与场景A的近似度为40%,与场景B的近似度为30%,与场景C的近似度为30%,识别精确度差。采用本实施例的方案,将提取的特征并行通过不同场景网络,分别得到各场景网络对应的场景识别结果,例如,对于场景数据N而言,场景网络1输出识别结果1,表示场景数据N与场景网络1对应的场景近似,场景网络2输出识别结果0,表示场景数据N与场景网络2对应的场景不近似,场景网络3输出识别结果0,表示场景数据N与场景网络3对应的场景不近似,从而明确场景数据N为场景网络1对应的场景数据,识别结果精确度更高。Using the solution of the related art, for the scene data N, the scene recognition result output by the neural network is the similarity with each scene, rather than whether it is the exact result of which scene, for example, the similarity with scene A is 40%, The similarity with scene B is 30%, the similarity with scene C is 30%, and the recognition accuracy is poor. Using the solution of this embodiment, the extracted features are passed through different scene networks in parallel, and the scene recognition results corresponding to each scene network are obtained respectively. For example, for scene data N, scene network 1 outputs identification result 1, indicating that scene data N and The scene corresponding to the scene network 1 is similar, the scene network 2 outputs the recognition result 0, indicating that the scene data N is not similar to the scene corresponding to the scene network 2, and the scene network 3 outputs the recognition result 0, indicating that the scene data N and the scene corresponding to the scene network 3 are not similar. Therefore, it is clear that the scene data N is the scene data corresponding to the scene network 1, and the recognition result is more accurate.
如图10所示,图10是本申请实施例提供的场景识别方法的流程示意图。场景识别网络包括注意力网络,所述注意力网络包括分别对应不同场景的多个场景标识。该方法包括但不限于步骤S310、步骤S320。As shown in FIG. 10 , FIG. 10 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application. The scene recognition network includes an attention network, and the attention network includes a plurality of scene identifiers respectively corresponding to different scenes. The method includes but is not limited to step S310 and step S320.
步骤S310、提取待识别的场景数据的特征。Step S310, extracting features of the scene data to be identified.
场景数据至少包括场景视频数据、场景图片数据和场景文本数据之一。可选的,待识别的场景数据的大小可以为64*64*3,相比于大小为32*32*3的场景数据,大小为64*64*3的场景数据分辨率更高,降维处理后更清楚。The scene data includes at least one of scene video data, scene picture data and scene text data. Optionally, the size of the scene data to be recognized can be 64*64*3. Compared with the scene data with the size of 32*32*3, the scene data with the size of 64*64*3 has a higher resolution and reduces the dimension. It's clearer after processing.
步骤S320、根据提取的特征遍历注意力网络的多个所述场景标识,得到各个所述场景标识对应的场景识别结果。Step S320 , traverse a plurality of the scene identifiers of the attention network according to the extracted features, and obtain a scene recognition result corresponding to each of the scene identifiers.
采用相关技术的方案,对于场景数据N而言,神经网络输出的场景识别结果为与各场景的近似度,而不是具体是否为哪个场景的准确结果,例如与场景A的近似度为40%,与场景B的近似度为30%,与场景C的近似度为30%,识别精确度差。采用本实施例的方案,将通过不同场景标识对应的子网,分别得到各场景标识对应的场景识别结果,例如,对于场景数据N而言,子网A输出识别结果1,表示场景数据N与场景A近似,子网B输出识别结果0,表示场景数据N与场景B不近似,子网C输出识别结果0,表示场景数据N与场景C不近似,识别结果精确度更高。从而明确场景数据N为子网A对应的场景数据,识别结果精确度更高。Using the solution of the related art, for the scene data N, the scene recognition result output by the neural network is the similarity with each scene, rather than whether it is the exact result of which scene, for example, the similarity with scene A is 40%, The similarity with scene B is 30%, the similarity with scene C is 30%, and the recognition accuracy is poor. Using the solution of this embodiment, the scene identification results corresponding to each scene identification will be obtained through the subnets corresponding to different scene identifications. For example, for scene data N, subnet A outputs identification result 1, indicating that scene data N and Scenario A is similar, subnet B outputs a recognition result of 0, indicating that scene data N is not similar to scene B, and subnet C outputs a recognition result of 0, indicating that scene data N is not similar to scene C, and the recognition result is more accurate. Therefore, it is clear that the scene data N is the scene data corresponding to the subnet A, and the accuracy of the identification result is higher.
如图11所示,图11是本申请实施例提供的场景识别方法的流程示意图。该方法包括但不限于步骤S410、步骤S420、步骤S430、步骤S440。As shown in FIG. 11 , FIG. 11 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application. The method includes but is not limited to step S410, step S420, step S430, and step S440.
步骤S410、根据目标场景的目标场景标识提取训练特征,所述训练特征包括训练正样本的训练特征和训练负样本的训练特征。Step S410: Extract training features according to the target scene identifier of the target scene, where the training features include training features for training positive samples and training features for training negative samples.
其中训练正样本是选定场景文件,训练负样本是除选定场景外的其他场景文件。The training positive samples are selected scene files, and the training negative samples are other scene files except the selected scene.
步骤S420、根据所述训练特征,针对所述目标场景对待训练场景识别网络进行训练,得到所述场景识别网络。Step S420: According to the training feature, train the scene recognition network to be trained for the target scene to obtain the scene recognition network.
步骤S430、提取待识别的场景数据的特征。Step S430, extracting features of the scene data to be identified.
场景数据至少包括场景视频数据、场景图片数据和场景文本数据之一。可选的,待识别的场景数据的大小可以为64*64*3,相比于大小为32*32*3的场景数据,大小为64*64*3的场景数据分辨率更高,降维处理后更清楚。The scene data includes at least one of scene video data, scene picture data and scene text data. Optionally, the size of the scene data to be recognized can be 64*64*3. Compared with the scene data with the size of 32*32*3, the scene data with the size of 64*64*3 has a higher resolution and reduces the dimension. It's clearer after processing.
步骤S440、将提取的特征输入场景识别网络进行识别,得到分别对应不同场景的多个场景识别结果。Step S440: Input the extracted features into a scene recognition network for recognition, and obtain multiple scene recognition results corresponding to different scenes.
在本实施例中,目标场景可以是新增的场景,可以针对新增的场景对待训练场景识别网络进行训练,得到能够识别场景数据是否为新增的场景、且能够识别场景数据是否为原有的场景的场景识别网络;目标场景也可以是已有场景,可以针对该已有场景对待训练场景识别网络进行训练,从而对针对该已有场景的识别功能进行更新,针对其他场景的识别功能则保持不变。采用本实施例的方案对场景识别网络进行训练更加方便快捷。In this embodiment, the target scene may be a newly-added scene, and the training-to-be-trained scene recognition network may be trained for the newly-added scene to obtain the ability to identify whether the scene data is the newly-added scene and whether the scene data is the original scene. The scene recognition network of the scene; the target scene can also be an existing scene, and the scene recognition network to be trained can be trained for the existing scene, so as to update the recognition function for the existing scene, and the recognition function for other scenes. constant. It is more convenient and quicker to use the solution of this embodiment to train the scene recognition network.
如图12所示,图12是本申请实施例提供的场景识别方法的流程示意图。场景识别网络包括分别对应不同场景的多个场景网络。该方法包括但不限于步骤510、步骤520、步骤530、步骤S540和步骤S550。As shown in FIG. 12 , FIG. 12 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application. The scene recognition network includes multiple scene networks respectively corresponding to different scenes. The method includes but is not limited to step 510, step 520, step 530, step S540 and step S550.
步骤510、根据目标场景的目标场景标识提取训练特征,所述训练特征包括训练正样本的训练特征和训练负样本的训练特征。Step 510: Extract training features according to the target scene identifier of the target scene, where the training features include training features for training positive samples and training features for training negative samples.
其中训练正样本是选定场景文件,训练负样本是除选定场景外的其他场景文件。The training positive samples are selected scene files, and the training negative samples are other scene files except the selected scene.
步骤520、将所述训练特征通过所述目标场景对应的待训练网络,得到所述待训练网络对应的训练识别结果。Step 520: Pass the training feature through the network to be trained corresponding to the target scene to obtain a training identification result corresponding to the network to be trained.
待训练网络为已有场景网络或者新场景网络。The network to be trained is an existing scene network or a new scene network.
步骤530、根据所述训练识别结果、所述训练正样本的标签和所述训练负样本的标签,确定所述待训练网络的权重,得到训练后的场景网络。Step 530: Determine the weight of the network to be trained according to the training recognition result, the label of the training positive sample and the label of the training negative sample, and obtain the trained scene network.
待训练网络的训练机制如下,其中:The training mechanism of the network to be trained is as follows, among which:
Y pr为得到的输出,Y gt为正确的输出,W为权重,X为输入,σ为激活函数(sigmoid),η为常量。 Y pr is the obtained output, Y gt is the correct output, W is the weight, X is the input, σ is the activation function (sigmoid), and η is a constant.
Y pr=σ(WX),where WX≡Z Y pr =σ(WX), where WX≡Z
权重更新量为(W=W+ΔW):The weight update amount is (W=W+ΔW):
Figure PCTCN2021104224-appb-000001
Figure PCTCN2021104224-appb-000001
步骤S540、提取待识别的场景数据的特征。Step S540, extracting features of the scene data to be identified.
步骤S550、将提取的特征并行通过各个所述场景网络,分别得到各个所述场景网络对应的场景识别结果。Step S550: Pass the extracted features through each of the scene networks in parallel to obtain scene recognition results corresponding to each of the scene networks.
相关技术中,需要增加新场景识别功能的情况下,根据原有场景识别功能对应 的样本以及新场景识别功能对应的样本重新训练神经网络,例如,原神经网络可以识别场景A,而无法识别场景B,需要增加识别场景B的情况下,则根据场景A和场景B的样本重新训练神经网络,从而可以识别场景数据与场景A以及场景B的相似度,例如,场景数据与场景A的相似度为30%,与场景B的相似度为60%。采用本实施例的方案,需要增加新场景识别功能的情况下,仅对新场景网络进行训练即可,训练方便快捷,识别灵活准确。In the related art, when a new scene recognition function needs to be added, the neural network is retrained according to the samples corresponding to the original scene recognition function and the samples corresponding to the new scene recognition function. For example, the original neural network can recognize the scene A, but cannot recognize the scene. B. When it is necessary to increase the recognition of scene B, the neural network is retrained according to the samples of scene A and scene B, so that the similarity between scene data and scene A and scene B can be identified, for example, the similarity between scene data and scene A is 30%, and the similarity with scene B is 60%. With the solution of this embodiment, when a new scene recognition function needs to be added, only the new scene network can be trained, the training is convenient and fast, and the recognition is flexible and accurate.
相关技术中,需要更新场景识别功能的情况下,根据需要更新的场景识别功能对应的样本以及其他无需更新的场景识别功能对应的样本重新训练神经网络,例如,原神经网络可以识别场景A和场景B,需要更新识别场景B的能力的情况下,则根据场景A和更新后的场景B的样本重新训练神经网络。采用本实施例的方案,需要更新场景识别功能的情况下,仅对需要更新的场景网络(已有场景网络)重新进行训练即可,更新方便快捷。In the related art, when the scene recognition function needs to be updated, the neural network is retrained according to the samples corresponding to the scene recognition function that need to be updated and the samples corresponding to other scene recognition functions that do not need to be updated. For example, the original neural network can recognize the scene A and the scene. B. If the ability to recognize scene B needs to be updated, the neural network is retrained according to the samples of scene A and the updated scene B. With the solution of this embodiment, when the scene recognition function needs to be updated, only the scene network that needs to be updated (existing scene network) needs to be retrained, and the update is convenient and fast.
如图13所示,图13是本申请实施例提供的场景识别方法的流程示意图。场景识别网络包括注意力网络,所述注意力网络包括分别对应不同场景的多个场景标识。该方法包括但不限于步骤S610、步骤S620、步骤S630、步骤S640、步骤S650。As shown in FIG. 13 , FIG. 13 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application. The scene recognition network includes an attention network, and the attention network includes a plurality of scene identifiers respectively corresponding to different scenes. The method includes, but is not limited to, steps S610, S620, S630, S640, and S650.
步骤S610、根据目标场景的目标场景标识提取训练特征,所述训练特征包括训练正样本的训练特征和训练负样本的训练特征。Step S610: Extract training features according to the target scene identifier of the target scene, where the training features include training features for training positive samples and training features for training negative samples.
步骤S620、将所述训练特征和所述目标场景标识输入待训练注意力网络,得到所述目标场景标识对应的待训练注意力网络的训练识别结果。Step S620: Input the training feature and the target scene identifier into the attention network to be trained, and obtain the training recognition result of the attention network to be trained corresponding to the target scene identifier.
目标场景标识对应待训练网络中已有子网或者新子网。The target scene identifier corresponds to an existing subnet or a new subnet in the network to be trained.
步骤S630、根据所述训练识别结果、所述训练正样本的标签和所述训练负样本的标签,确定所述目标场景标识对应的待训练注意力网络的权重,得到所述目标场景标识对应的训练后的注意力网络。Step S630, according to the training recognition result, the label of the training positive sample and the label of the training negative sample, determine the weight of the attention network to be trained corresponding to the target scene identifier, and obtain the corresponding target scene identifier. The trained attention network.
步骤S640、提取待识别的场景数据的特征。Step S640, extracting features of the scene data to be identified.
步骤S650、根据提取的特征遍历注意力网络的多个所述场景标识,得到各个所述场景标识对应的场景识别结果。Step S650 , traverse a plurality of the scene identifiers of the attention network according to the extracted features, and obtain a scene recognition result corresponding to each of the scene identifiers.
相关技术中,需要增加新场景识别功能的情况下,根据原有场景识别功能对应的样本以及新场景识别功能对应的样本重新训练神经网络,例如,原神经网络可以识别场景A,而无法识别场景B,需要增加识别场景B的情况下,则根据场景A和场景B的样本重新训练神经网络,从而可以识别场景数据与场景A以及场景B的相似度,例如,场景数据与场景A的相似度为30%,与场景B的相似度为60%。采用本实施例的方案,注意力网络需要增加新场景识别功能的情况下,无需对整个注意力网络重新训练,仅对新场景对应的子网进行训练即可,训练方便快捷,识别灵活准确。In the related art, when a new scene recognition function needs to be added, the neural network is retrained according to the samples corresponding to the original scene recognition function and the samples corresponding to the new scene recognition function. For example, the original neural network can recognize the scene A, but cannot recognize the scene. B. When it is necessary to increase the recognition of scene B, the neural network is retrained according to the samples of scene A and scene B, so that the similarity between scene data and scene A and scene B can be identified, for example, the similarity between scene data and scene A is 30%, and the similarity with scene B is 60%. With the solution of this embodiment, when the attention network needs to add a new scene recognition function, there is no need to retrain the entire attention network, and only the sub-network corresponding to the new scene can be trained. The training is convenient and fast, and the recognition is flexible and accurate.
相关技术中,需要更新场景识别功能的情况下,根据需要更新的场景识别功能对应的样本以及其他无需更新的场景识别功能对应的样本重新训练神经网络,例如, 原神经网络可以识别场景A和场景B,需要更新识别场景B的能力的情况下,则根据场景A和更新后的场景B的样本重新训练神经网络。采用本实施例的方案,注意力网络需要更新场景识别功能的情况下,无需对整个注意力网络重新训练,仅对需要更新的场景子网重新进行训练即可,更新方便快捷。In the related art, when the scene recognition function needs to be updated, the neural network is retrained according to the samples corresponding to the scene recognition function that need to be updated and the samples corresponding to other scene recognition functions that do not need to be updated. For example, the original neural network can recognize the scene A and the scene. B. If the ability to recognize scene B needs to be updated, the neural network is retrained according to the samples of scene A and the updated scene B. With the solution of this embodiment, when the attention network needs to update the scene recognition function, there is no need to retrain the entire attention network, and only the scene subnet that needs to be updated can be retrained, which is convenient and quick to update.
如图14所示,图14是本申请实施例提供一种电子设备的结构示意图。该电子设备包括:As shown in FIG. 14 , FIG. 14 is a schematic structural diagram of an electronic device provided by an embodiment of the present application. The electronic equipment includes:
一个或多个处理器810;one or more processors 810;
存储器820,其上存储有一个或多个程序,当一个或多个程序被一个或多个处理器执行,使得一个或多个处理器实现本申请实施例中的任意一种场景识别方法;the memory 820, on which one or more programs are stored, and when the one or more programs are executed by one or more processors, the one or more processors implement any one of the scene recognition methods in the embodiments of the present application;
一个或多个I/O接口830,连接在处理器与存储器之间,配置为实现处理器与存储器的信息交互。One or more I/O interfaces 830, connected between the processor and the memory, are configured to realize the information exchange between the processor and the memory.
其中,处理器810为具有数据处理能力的器件,其包括但不限于中央处理器(CPU)等;存储器820为具有数据存储能力的器件,其包括但不限于随机存取存储器(RAM,更具体如SDRAM、DDR等)、只读存储器(ROM)、带电可擦可编程只读存储器(EEPROM)、闪存(FLASH);I/O接口(读写接口)830连接在处理器810与存储器820间,能实现处理器810与存储器820的信息交互,其包括但不限于数据总线(Bus)等。The processor 810 is a device with data processing capability, including but not limited to a central processing unit (CPU), etc.; the memory 820 is a device with data storage capability, including but not limited to random access memory (RAM, more specifically Such as SDRAM, DDR, etc.), read only memory (ROM), electrified erasable programmable read only memory (EEPROM), flash memory (FLASH); the I/O interface (read and write interface) 830 is connected between the processor 810 and the memory 820 , which can realize the information interaction between the processor 810 and the memory 820, which includes but is not limited to a data bus (Bus) and the like.
在一些实施例中,处理器810、存储器820和I/O接口830通过总线840相互连接,进而与计算设备的其它组件连接。In some embodiments, processor 810, memory 820, and I/O interface 830 are interconnected by bus 840, which in turn is connected to other components of the computing device.
如图15所示,图15是本申请实施例提供一种计算机可读介质的结构示意图。该计算机可读介质,其上存储有计算机程序,程序被处理器执行时实现本申请实施例中的任意一种场景识别方法。As shown in FIG. 15 , FIG. 15 is a schematic structural diagram of a computer-readable medium provided by an embodiment of the present application. The computer-readable medium has a computer program stored thereon, and when the program is executed by the processor, any one of the scene recognition methods in the embodiments of the present application is implemented.
通过以上关于实施方式的描述,所属领域的技术人员可以清楚地了解到,本申请可借助软件及必需的通用硬件来实现,当然也可以通过硬件实现,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如计算机的软盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、闪存(FLASH)、硬盘或光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。From the above description of the embodiments, those skilled in the art can clearly understand that the present application can be implemented by means of software and necessary general-purpose hardware, and of course can also be implemented by hardware, but in many cases the former is a better implementation manner . Based on such understanding, the technical solutions of the present application can be embodied in the form of software products in essence or the parts that make contributions to related technologies, and the computer software products can be stored in a computer-readable storage medium, such as a computer floppy disk, Read-Only Memory (ROM), Random Access Memory (RAM), flash memory (FLASH), hard disk or optical disk, etc., including several instructions to make a computer device (which can be a personal computer, A server, or a network device, etc.) executes the methods described in the various embodiments of the present application.
以上所述,仅为本申请的示例性实施例而已,并非用于限定本申请的保护范围。The above descriptions are merely exemplary embodiments of the present application, and are not intended to limit the protection scope of the present application.
一般来说,本申请的多种实施例可以在硬件或专用电路、软件、逻辑或其任何组合中实现。例如,一些方面可以被实现在硬件中,而其它方面可以被实现在可以被控制器、微处理器或其它计算装置执行的固件或软件中,尽管本申请不限于此。In general, the various embodiments of the present application may be implemented in hardware or special purpose circuits, software, logic, or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software that may be executed by a controller, microprocessor or other computing device, although the application is not limited thereto.
本申请的实施例可以通过移动装置的数据处理器执行计算机程序指令来实现,例如在处理器实体中,或者通过硬件,或者通过软件和硬件的组合。计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码。Embodiments of the present application may be implemented by the execution of computer program instructions by a data processor of a mobile device, eg in a processor entity, or by hardware, or by a combination of software and hardware. The computer program instructions may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state setting data, or source code written in any combination of one or more programming languages or object code.
本申请附图中的任何逻辑流程的框图可以表示程序步骤,或者可以表示相互连接的逻辑电路、模块和功能,或者可以表示程序步骤与逻辑电路、模块和功能的组合。计算机程序可以存储在存储器上。存储器可以具有任何适合于本地技术环境的类型并且可以使用任何适合的数据存储技术实现,例如但不限于只读存储器(ROM)、随机访问存储器(RAM)、光存储器装置和系统(数码多功能光碟DVD或CD光盘)等。计算机可读介质可以包括非瞬时性存储介质。数据处理器可以是任何适合于本地技术环境的类型,例如但不限于通用计算机、专用计算机、微处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、可编程逻辑器件(FPGA)以及基于多核处理器架构的处理器。The block diagrams of any logic flow in the figures of the present application may represent program steps, or may represent interconnected logic circuits, modules and functions, or may represent a combination of program steps and logic circuits, modules and functions. Computer programs can be stored on memory. The memory may be of any type suitable for the local technical environment and may be implemented using any suitable data storage technology, such as but not limited to read only memory (ROM), random access memory (RAM), optical memory devices and systems (Digital Versatile Discs). DVD or CD disc) etc. Computer-readable media may include non-transitory storage media. The data processor may be of any type suitable for the local technical environment, such as, but not limited to, a general purpose computer, special purpose computer, microprocessor, digital signal processor (DSP), application specific integrated circuit (ASIC), programmable logic device (FPGA) and processors based on multi-core processor architectures.
通过示范性和非限制性的示例,上文已提供了对本申请的示范实施例的详细描述。但结合附图和权利要求来考虑,对以上实施例的多种修改和调整对本领域技术人员来说是显而易见的,但不偏离本发明的范围。因此,本发明的恰当范围将根据权利要求确定。The foregoing has provided a detailed description of exemplary embodiments of the present application, by way of illustrative and non-limiting example. However, when considered in conjunction with the accompanying drawings and claims, various modifications and adjustments to the above embodiments will be apparent to those skilled in the art without departing from the scope of the present invention. Accordingly, the proper scope of the invention will be determined with reference to the appended claims.

Claims (23)

  1. 一种场景识别方法,其特征在于:包括:A scene recognition method, characterized in that: comprising:
    提取待识别的场景数据的特征;Extract the features of the scene data to be identified;
    将提取的特征输入场景识别网络进行识别,得到分别对应不同场景的多个场景识别结果。The extracted features are input into the scene recognition network for recognition, and multiple scene recognition results corresponding to different scenes are obtained.
  2. 根据权利要求1所述的方法,其特征在于,所述场景识别网络包括分别对应不同场景的多个场景网络;将提取的特征输入场景识别网络进行识别,得到分别对应不同场景的多个场景识别结果的步骤包括:The method according to claim 1, wherein the scene recognition network comprises a plurality of scene networks corresponding to different scenes; the extracted features are input into the scene recognition network for recognition, and a plurality of scene recognition networks corresponding to different scenes are obtained. The resulting steps include:
    将提取的特征并行通过各个所述场景网络,分别得到各个所述场景网络对应的场景识别结果。The extracted features are passed through each of the scene networks in parallel to obtain scene recognition results corresponding to each of the scene networks.
  3. 根据权利要求1所述的方法,其特征在于,所述场景识别网络包括注意力网络,所述注意力网络包括分别对应不同场景的多个场景标识;将提取的特征输入场景识别网络进行识别,得到分别对应不同场景的多个场景识别结果的步骤包括:The method according to claim 1, wherein the scene recognition network comprises an attention network, and the attention network comprises a plurality of scene identifiers corresponding to different scenes; the extracted features are input into the scene recognition network for recognition, The steps of obtaining multiple scene recognition results corresponding to different scenes respectively include:
    根据提取的特征遍历注意力网络的多个所述场景标识,得到各个所述场景标识对应的场景识别结果。Traverse a plurality of the scene identifiers of the attention network according to the extracted features, and obtain a scene recognition result corresponding to each of the scene identifiers.
  4. 根据权利要求3所述的方法,其特征在于,一个所述场景标识对应所述注意力网络中的一个子网;所述场景标识为所述注意力网络中子网对应的编码或者活跃度数值。The method according to claim 3, wherein one scene identifier corresponds to a subnet in the attention network; and the scene identifier is a code or activity value corresponding to a subnet in the attention network .
  5. 根据权利要求1至4中任意一项所述的方法,其特征在于,在提取待识别的场景数据的特征的步骤之前,还包括:The method according to any one of claims 1 to 4, characterized in that, before the step of extracting the feature of the scene data to be identified, the method further comprises:
    根据目标场景的目标场景标识提取训练特征,所述训练特征包括训练正样本的训练特征和训练负样本的训练特征;Extracting training features according to the target scene identifier of the target scene, the training features include training features for training positive samples and training features for training negative samples;
    根据所述训练特征,针对所述目标场景对待训练场景识别网络进行训练,得到所述场景识别网络。According to the training feature, the scene recognition network to be trained is trained for the target scene to obtain the scene recognition network.
  6. 根据权利要求5所述的方法,其特征在于,所述场景识别网络包括分别对应不同场景的多个场景网络;根据所述训练特征,针对所述目标场景对待训练场景识别网络进行训练,得到所述场景识别网络的步骤包括:The method according to claim 5, wherein the scene recognition network comprises a plurality of scene networks respectively corresponding to different scenes; according to the training feature, the scene recognition network to be trained is trained for the target scene to obtain the The steps of describing the scene recognition network include:
    将所述训练特征通过所述目标场景对应的待训练网络,得到所述待训练网络对应的训练识别结果;Passing the training feature through the network to be trained corresponding to the target scene to obtain a training identification result corresponding to the network to be trained;
    根据所述训练识别结果、所述训练正样本的标签和所述训练负样本的标签,确定所述待训练网络的权重,得到训练后的场景网络。According to the training recognition result, the labels of the training positive samples and the labels of the training negative samples, the weight of the network to be trained is determined, and the trained scene network is obtained.
  7. 根据权利要求6所述的方法,其特征在于,所述待训练网络为已有场景网络或者新场景网络。The method according to claim 6, wherein the network to be trained is an existing scene network or a new scene network.
  8. 根据权利要求5所述的方法,其特征在于,所述场景识别网络包括注意力网络,所述注意力网络包括分别对应不同场景的多个场景标识;根据所述训练特征,针对所述目标场景对待训练场景识别网络进行训练,得到所述场景识别网络的步骤 包括:The method according to claim 5, wherein the scene recognition network includes an attention network, and the attention network includes a plurality of scene identifiers corresponding to different scenes; according to the training feature, for the target scene The scene recognition network to be trained is trained, and the steps of obtaining the scene recognition network include:
    将所述训练特征和所述目标场景标识输入待训练注意力网络,得到所述目标场景标识对应的待训练注意力网络的训练识别结果;Inputting the training feature and the target scene identifier into the attention network to be trained, to obtain the training recognition result of the attention network to be trained corresponding to the target scene identifier;
    根据所述训练识别结果、所述训练正样本的标签和所述训练负样本的标签,确定所述目标场景标识对应的待训练注意力网络的权重,得到所述目标场景标识对应的训练后的注意力网络。According to the training recognition result, the label of the training positive sample and the label of the training negative sample, determine the weight of the attention network to be trained corresponding to the target scene identifier, and obtain the training result corresponding to the target scene identifier attention network.
  9. 根据权利要求5所述的方法,其特征在于,所述训练正样本为场景文件,所述训练负样本为非场景文件。The method according to claim 5, wherein the training positive samples are scene files, and the training negative samples are non-scene files.
  10. 根据权利要求1至4中任意一项所述的方法,其特征在于,所述场景数据至少包括场景视频数据、场景图片数据和场景文本数据之一。The method according to any one of claims 1 to 4, wherein the scene data includes at least one of scene video data, scene picture data and scene text data.
  11. 一种场景识别系统,其特征在于,包括:A scene recognition system, characterized in that it includes:
    骨干网络,设置为提取待识别的场景数据的特征;The backbone network is set to extract the features of the scene data to be recognized;
    场景识别网络装置,包括场景识别网络,所述场景识别网络装置设置为将提取的特征输入所述场景识别网络,得到分别对应不同场景的多个场景识别结果。The scene recognition network device includes a scene recognition network, and the scene recognition network device is configured to input the extracted features into the scene recognition network to obtain a plurality of scene recognition results corresponding to different scenes.
  12. 根据权利要求11所述的系统,其特征在于,所述场景识别网络包括分别对应不同场景的多个场景网络;所述场景识别网络装置设置为将提取的特征并行通过各个所述场景网络,分别得到各个所述场景网络对应的场景识别结果。The system according to claim 11, wherein the scene recognition network comprises a plurality of scene networks corresponding to different scenes; the scene recognition network device is configured to pass the extracted features through each of the scene networks in parallel, respectively. A scene recognition result corresponding to each of the scene networks is obtained.
  13. 根据权利要求11所述的系统,其特征在于,所述场景识别网络包括注意力网络,所述注意力网络包括不同场景标识对应的子网,多个所述场景标识分别对应不同场景;所述场景识别网络装置设置为将提取的特征通过各个所述场景标识对应的子网,分别得到各个所述场景标识对应的场景识别结果。The system according to claim 11, wherein the scene recognition network includes an attention network, and the attention network includes subnetworks corresponding to different scene identifiers, and a plurality of the scene identifiers correspond to different scenes respectively; the The scene recognition network device is configured to pass the extracted features through the subnets corresponding to each of the scene identifiers to obtain scene recognition results corresponding to each of the scene identifiers.
  14. 根据权利要求13所述的系统,其特征在于,所述场景标识为所述注意力网络中子网对应的编码或者活跃度数值。The system according to claim 13, wherein the scene identifier is a code or an activity value corresponding to a subnet in the attention network.
  15. 根据权利要求11至14中任意一项所述的系统,其特征在于,还包括:The system according to any one of claims 11 to 14, further comprising:
    正样本装置,设置为向所述骨干网络输出待识别的场景数据。The positive sample device is configured to output scene data to be identified to the backbone network.
  16. 根据权利要求15所述的系统,其特征在于,还包括:The system of claim 15, further comprising:
    场景标识装置,设置为获取目标场景的目标场景标识,并将所述目标场景标识输出给所述骨干网络;a scene identification device, configured to obtain a target scene identification of a target scene, and output the target scene identification to the backbone network;
    负样本产生器,设置为向所述骨干网络输出训练负样本;a negative sample generator, configured to output training negative samples to the backbone network;
    所述正样本装置,还设置为向所述骨干网络输出训练正样本;The positive sample device is further configured to output training positive samples to the backbone network;
    所述骨干网络根据所述目标场景标识提取训练特征,所述训练特征包括训练正样本的训练特征和训练负样本的训练特征;The backbone network extracts training features according to the target scene identifier, and the training features include training features for training positive samples and training features for training negative samples;
    所述场景识别网络装置根据所述训练特征,针对所述目标场景对待训练场景识别网络进行训练,得到所述场景识别网络。The scene recognition network device trains the scene recognition network to be trained for the target scene according to the training feature to obtain the scene recognition network.
  17. 根据权利要求16所述的系统,其特征在于,所述场景识别网络包括分别对应不同场景的多个场景网络;The system according to claim 16, wherein the scene recognition network comprises a plurality of scene networks corresponding to different scenes;
    所述场景识别网络装置将所述训练特征通过所述目标场景对应的待训练网络, 得到所述待训练网络对应的训练识别结果;根据所述待训练网络对应的训练识别结果、所述训练正样本的标签和所述训练负样本的标签,确定所述待训练网络的权重,得到训练后的场景网络。The scene recognition network device passes the training feature through the network to be trained corresponding to the target scene to obtain a training recognition result corresponding to the network to be trained; according to the training recognition result corresponding to the network to be trained, the training positive The label of the sample and the label of the training negative sample are used to determine the weight of the network to be trained, and the trained scene network is obtained.
  18. 根据权利要求17所述的系统,其特征在于,所述待训练网络为已有场景网络或者新场景网络。The system according to claim 17, wherein the network to be trained is an existing scene network or a new scene network.
  19. 根据权利要求16所述的系统,其特征在于,所述场景识别网络包括注意力网络,所述注意力网络包括不同场景标识对应的子网,多个所述场景标识分别对应不同场景;The system according to claim 16, wherein the scene recognition network comprises an attention network, and the attention network comprises subnetworks corresponding to different scene identifiers, and a plurality of the scene identifiers correspond to different scenes respectively;
    所述场景识别网络装置将所述训练特征和所述目标场景标识输入待训练注意力网络,得到所述目标场景标识对应的待训练注意力网络的训练识别结果;根据所述训练识别结果、所述训练正样本的标签和所述训练负样本的标签,确定所述目标场景标识对应的待训练注意力网络的权重,得到所述目标场景标识对应的训练后的注意力网络。The scene recognition network device inputs the training feature and the target scene identifier into the attention network to be trained, and obtains the training recognition result of the attention network to be trained corresponding to the target scene identifier; The label of the training positive sample and the label of the training negative sample are determined, the weight of the attention network to be trained corresponding to the target scene identifier is determined, and the trained attention network corresponding to the target scene identifier is obtained.
  20. 根据权利要求11至14中任意一项所述的系统,其特征在于,所述骨干网络设置为深度神经网络。The system according to any one of claims 11 to 14, wherein the backbone network is configured as a deep neural network.
  21. 根据权利要求12所述的系统,其特征在于,所述场景网络设置为单层全连接或多层感知器。The system according to claim 12, wherein the scene network is set as a single-layer fully connected or a multi-layer perceptron.
  22. 一种电子设备,其包括:An electronic device comprising:
    一个或多个处理器;one or more processors;
    存储器,其上存储有一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现根据权利要求1至10中任意一项所述的场景识别方法;a memory having stored thereon one or more programs which, when executed by the one or more processors, cause the one or more processors to implement any one of claims 1 to 10 The scene recognition method described in item;
    一个或多个I/O接口,连接在所述处理器与存储器之间,配置为实现所述处理器与存储器的信息交互。One or more I/O interfaces, connected between the processor and the memory, are configured to realize the information interaction between the processor and the memory.
  23. 一种计算机可读介质,其上存储有计算机程序,所述程序被处理器执行时实现根据权利要求1至10中任意一项所述的场景识别方法。A computer-readable medium having a computer program stored thereon, the program implementing the scene recognition method according to any one of claims 1 to 10 when the program is executed by a processor.
PCT/CN2021/104224 2020-07-02 2021-07-02 Scene recognition method and system, and electronic device and medium WO2022002242A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202010633911.3 2020-07-02
CN202010633894.3 2020-07-02
CN202010633911.3A CN111797763A (en) 2020-07-02 2020-07-02 Scene recognition method and system
CN202010633894.3A CN111797762A (en) 2020-07-02 2020-07-02 Scene recognition method and system

Publications (1)

Publication Number Publication Date
WO2022002242A1 true WO2022002242A1 (en) 2022-01-06

Family

ID=79317469

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/104224 WO2022002242A1 (en) 2020-07-02 2021-07-02 Scene recognition method and system, and electronic device and medium

Country Status (1)

Country Link
WO (1) WO2022002242A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114740751A (en) * 2022-06-15 2022-07-12 新缪斯(深圳)音乐科技产业发展有限公司 Music scene recognition method and system based on artificial intelligence
CN116170829A (en) * 2023-04-26 2023-05-26 浙江省公众信息产业有限公司 Operation and maintenance scene identification method and device for independent private network service
CN116528282A (en) * 2023-07-04 2023-08-01 亚信科技(中国)有限公司 Coverage scene recognition method, device, electronic equipment and readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663448A (en) * 2012-03-07 2012-09-12 北京理工大学 Network based augmented reality object identification analysis method
CN105930794A (en) * 2016-04-20 2016-09-07 东北大学 Indoor scene identification method based on cloud computing
CN111104898A (en) * 2019-12-18 2020-05-05 武汉大学 Image scene classification method and device based on target semantics and attention mechanism
CN111797763A (en) * 2020-07-02 2020-10-20 北京灵汐科技有限公司 Scene recognition method and system
CN111797762A (en) * 2020-07-02 2020-10-20 北京灵汐科技有限公司 Scene recognition method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663448A (en) * 2012-03-07 2012-09-12 北京理工大学 Network based augmented reality object identification analysis method
CN105930794A (en) * 2016-04-20 2016-09-07 东北大学 Indoor scene identification method based on cloud computing
CN111104898A (en) * 2019-12-18 2020-05-05 武汉大学 Image scene classification method and device based on target semantics and attention mechanism
CN111797763A (en) * 2020-07-02 2020-10-20 北京灵汐科技有限公司 Scene recognition method and system
CN111797762A (en) * 2020-07-02 2020-10-20 北京灵汐科技有限公司 Scene recognition method and system

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114740751A (en) * 2022-06-15 2022-07-12 新缪斯(深圳)音乐科技产业发展有限公司 Music scene recognition method and system based on artificial intelligence
CN114740751B (en) * 2022-06-15 2022-09-02 新缪斯(深圳)音乐科技产业发展有限公司 Music scene recognition method and system based on artificial intelligence
CN116170829A (en) * 2023-04-26 2023-05-26 浙江省公众信息产业有限公司 Operation and maintenance scene identification method and device for independent private network service
CN116528282A (en) * 2023-07-04 2023-08-01 亚信科技(中国)有限公司 Coverage scene recognition method, device, electronic equipment and readable storage medium
CN116528282B (en) * 2023-07-04 2023-09-22 亚信科技(中国)有限公司 Coverage scene recognition method, device, electronic equipment and readable storage medium

Similar Documents

Publication Publication Date Title
WO2022002242A1 (en) Scene recognition method and system, and electronic device and medium
CN109977262B (en) Method and device for acquiring candidate segments from video and processing equipment
US10002290B2 (en) Learning device and learning method for object detection
CN109117879B (en) Image classification method, device and system
JP2016072964A (en) System and method for subject re-identification
CN106850338B (en) Semantic analysis-based R +1 type application layer protocol identification method and device
CN111797762A (en) Scene recognition method and system
JP2017062778A (en) Method and device for classifying object of image, and corresponding computer program product and computer-readable medium
Li et al. Domain adaption of vehicle detector based on convolutional neural networks
CN111291887A (en) Neural network training method, image recognition method, device and electronic equipment
US11380133B2 (en) Domain adaptation-based object recognition apparatus and method
CN111401196A (en) Method, computer device and computer readable storage medium for self-adaptive face clustering in limited space
CN109063790B (en) Object recognition model optimization method and device and electronic equipment
CN113012054A (en) Sample enhancement method and training method based on sectional drawing, system and electronic equipment thereof
Wang et al. Rethinking the learning paradigm for dynamic facial expression recognition
US11423262B2 (en) Automatically filtering out objects based on user preferences
CN111797763A (en) Scene recognition method and system
KR20200018154A (en) Acoustic information recognition method and system using semi-supervised learning based on variational auto encoder model
CN109145991B (en) Image group generation method, image group generation device and electronic equipment
Baba et al. Stray dogs behavior detection in urban area video surveillance streams
JP2009122829A (en) Information processing apparatus, information processing method, and program
WO2022228325A1 (en) Behavior detection method, electronic device, and computer readable storage medium
KR102050422B1 (en) Apparatus and method for recognizing character
Nguyen et al. Real-time smile detection using deep learning
CN110659631A (en) License plate recognition method and terminal equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21833875

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 28.04.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21833875

Country of ref document: EP

Kind code of ref document: A1