WO2022002242A1

WO2022002242A1 - Scene recognition method and system, and electronic device and medium

Info

Publication number: WO2022002242A1
Application number: PCT/CN2021/104224
Authority: WO
Inventors: 吴臻志; 祝夭龙
Original assignee: 北京灵汐科技有限公司
Priority date: 2020-07-02
Filing date: 2021-07-02
Publication date: 2022-01-06

Abstract

Provided are a scene recognition method and system, and an electronic device and a computer-readable medium. The method comprises: extracting a feature of scene data to be recognized; and inputting the extracted feature into a scene recognition network for recognition, so as to obtain a plurality of scene recognition results respectively corresponding to different scenes. By means of the scene recognition method and system provided in the present application, a feature of extracted scene data is input into a scene recognition network which is capable of recognizing, for various scenes, whether the scene data is a corresponding scene, so as to obtain a plurality of scene recognition results respectively corresponding to different scenes. Compared with the related art whereby only the similarity between scene data and each scene can be obtained, the solution of the present application provides a higher-accuracy recognition result.

Description

A scene recognition method and system, electronic device and medium

technical field

The present application relates to the field of identification technologies, and in particular, to a scene identification method and system, an electronic device, and a computer-readable medium.

Background technique

Neural network refers to a mathematical model that applies a structure similar to the synaptic connections of the brain for information processing. Neural networks can be used to recognize scenes.

However, in some related technologies, the accuracy and flexibility of scene recognition by neural networks are poor.

SUMMARY OF THE INVENTION

The present application provides a scene identification method and system, an electronic device, and a computer-readable medium, which can realize accurate identification of various scenes.

To achieve the above purpose, an embodiment of the present application provides a scene recognition method, including: extracting features of scene data to be recognized; inputting the extracted features into a scene recognition network for recognition, and obtaining multiple scene recognition results corresponding to different scenes.

In order to achieve the above purpose, an embodiment of the present application provides a scene recognition system, including: a backbone network configured to extract features of scene data to be recognized; a scene recognition network device, including a scene recognition network, and the scene recognition network device is set to The extracted features are input into the scene recognition network, and multiple scene recognition results corresponding to different scenes are obtained.

To achieve the above purpose, an embodiment of the present application provides an electronic device, which includes:

one or more processors;

a memory on which one or more programs are stored, and when the one or more programs are executed by the one or more processors, so that the one or more processors implement any one of the embodiments of the present application scene recognition method;

One or more I/O interfaces, connected between the processor and the memory, are configured to realize the information interaction between the processor and the memory.

To achieve the above purpose, embodiments of the present application provide a computer-readable medium on which a computer program is stored, and when the program is executed by a processor, implements any one of the scene recognition methods in the embodiments of the present application.

A scene recognition method and system, an electronic device, and a computer-readable medium proposed in the present application, input the features of the extracted scene data into a scene recognition network capable of identifying whether the scene data is a corresponding scene for a variety of scenes. Recognition to obtain multiple scene recognition results corresponding to different scenes respectively. Compared with the related art, which can only obtain the similarity between scene data and each scene, the solution identification result of the present application is more accurate.

Description of drawings

1 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application;

2 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application;

3 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application;

4 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application;

5 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application;

6 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application;

7 is a schematic flowchart of a scene recognition system provided by an embodiment of the present application;

8 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application;

9 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application;

10 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application;

11 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application;

12 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application;

13 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application;

14 is a schematic structural diagram of an electronic device provided by an embodiment of the present application;

FIG. 15 is a schematic structural diagram of a computer-readable medium provided by an embodiment of the present application.

detailed description

In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application. It should be noted that although the functional modules are divided in the schematic diagram of the device, and the logical sequence is shown in the flowchart, in some cases, the modules may be divided differently from the device, or executed in the order in the flowchart. steps shown or described.

The embodiments of the present application will be further described below with reference to the accompanying drawings.

As shown in FIG. 1 , FIG. 1 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application. The system includes, but is not limited to, a backbone network 110 and a scene recognition network device 120 .

The backbone network 110 is configured to extract features of the scene data to be identified.

The backbone network is responsible for feature extraction of scene data. The scene data includes at least one of scene video data, scene picture data and scene text data. When the scene data is scene text data, the backbone network is a deep neural network pre-trained with text, and the scene data obtains vectors representing text features through the backbone network. When the scene data is scene video data or scene picture data, the backbone network is a deep neural network pre-trained with an image network (ImageNet), and the scene data obtains a vector representing image features through the backbone network. Optionally, the backbone network is a multi-layer deep neural network that removes the front network part of the last few fully connected layers.

In this embodiment, optionally, scene data is collected by a collection device such as a camera or a microphone, and the collected scene data is stored in the memory.

The scene identification network device 120 includes a scene identification network, and the scene identification network device 120 is configured to input the extracted features into the scene identification network to obtain multiple scene identification results corresponding to different scenes.

The scene recognition network can identify whether the scene data is a corresponding scene for a variety of scenes.

Using the solution of this embodiment, the features of the extracted scene data are input into the scene recognition network for identification, and multiple scene identification results corresponding to different scenes are obtained, and the scene identification results can represent whether the scene is a corresponding scene. Compared with the related art, which can only obtain the similarity between the scene data and each scene, the solution identification result of this embodiment has higher accuracy.

As shown in FIG. 2 , FIG. 2 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application. The system includes, but is not limited to, a backbone network 210 and a scene recognition network device 220 . The scene identification network device 220 includes a scene identification network, and the scene identification network includes a plurality of scene networks corresponding to different scenes, for example, scene network 1, scene network 2, and scene network 3 in FIG. 2 .

The scene recognition network device 220 is set so that the extracted features pass through each scene network in parallel, and respectively obtain the scene recognition result corresponding to each scene network.

Each scene network can be set as a single-layer fully connected or multi-layer perceptron (MLP), and each scene network is called a head. There can be multiple heads that exist in parallel without affecting each other, or new heads can be added. Each head outputs a binary classification, that is, whether the scene data corresponds to the scene corresponding to the scene network. A scene recognition network composed of multiple scene networks can also be called a multi-head network.

With the solution of the related art, for the scene data N, the scene recognition result output by the neural network is the approximation of each scene, rather than whether it is the exact result of which scene. For example, the approximation of the scene data N and the scene A is 40%, the similarity between scene data N and scene B is 30%, and the similarity between scene data N and scene C is 30%, and the recognition accuracy is poor. Using the solution of this embodiment, the extracted features are passed through different scene networks in parallel, and the scene recognition results corresponding to each scene network are obtained respectively. For example, for scene data N, scene network 1 outputs identification result 1, indicating that scene data N and The scene corresponding to the scene network 1 is similar, the scene network 2 outputs the recognition result 0, indicating that the scene data N is not similar to the scene corresponding to the scene network 2, and the scene network 3 outputs the recognition result 0, indicating that the scene data N and the scene corresponding to the scene network 3 are not similar. Therefore, it is clear that the scene data N is the scene data corresponding to the scene network 1, and the recognition result is more accurate.

As shown in FIG. 3 , FIG. 3 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application. The system includes, but is not limited to, a backbone network 310 and a scene recognition network device 320 . The scene identification network device 320 includes a scene identification network, the scene identification network includes an attention network, the attention network includes subnets corresponding to different scene identifiers, and multiple scene identifiers respectively correspond to different scenes.

The scene recognition network device 320 is configured to obtain the scene recognition results corresponding to each scene identifier through the extracted features through the subnets corresponding to each scene identifier.

The attention network is a kind of gated network. For each attention input (in this embodiment, a scene identifier), some neural network nodes are connected, and the connected neural network nodes form a sub-network. The form of attention input can be one-hot encoding or activity value. For example, the form of attention input is one-hot encoding, the scene identifier of scene A is [1,0], the corresponding gated branch A is turned on (subnet A works), and the gated branch B is turned off, at this time the attention In the network, the neurons controlled by the gated branch A are in working state, and the neurons controlled by the gated branch B are inhibited (no output is produced regardless of the input situation). The scene identifier of scene B is [0,1], the corresponding gated branch B is turned on (subnet B works), and the gated branch A is turned off. At this time, the neurons controlled by the gated branch B in the attention network are in In the working state, the neurons controlled by gated branch A are inhibited (no output is produced regardless of the input situation). Alternatively, the gated input is a set of values, and each value is used for the activation activity of one gated branch. For example, the activity of gated branch A is 0.2, the activity of gated branch B is 0.8, and the gated branch If the input is [0.2, 0.8], the corresponding gated branch B is turned on (subnet B works), and the gated branch A is turned off.

Using the solution of the related art, for the scene data N, the scene recognition result output by the neural network is the approximation degree to each scene, rather than whether it is the exact result of which scene, for example, the approximation degree to scene A is 40% , the similarity with scene B is 30%, the similarity with scene C is 30%, and the recognition accuracy is poor. Using the solution of this embodiment, the scene identification results corresponding to each scene identification will be obtained through the subnets corresponding to different scene identifications. For example, for scene data N, subnet A outputs identification result 1, indicating that scene data N and Scenario A is similar, subnet B outputs a recognition result of 0, indicating that scene data N is not similar to scene B, and subnet C outputs a recognition result of 0, indicating that scene data N is not similar to scene C, and the recognition result is more accurate. Therefore, it is clear that the scene data N is the scene data corresponding to the subnet A, and the accuracy of the identification result is higher.

As shown in FIG. 4 , FIG. 4 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application. The system includes, but is not limited to, a positive sample device 410 , a backbone network 420 and a scene device network device 430 .

The positive sample device 410 is configured to output scene data to be identified to the backbone network.

The positive sample device collects the data of the current scene, obtains text data, image data or video data, and waits to identify the scene data.

The backbone network 420 is configured to extract features of the scene data to be identified.

The scene identification network device 430 includes a scene identification network, and the scene identification network device 430 is configured to input the extracted features into the scene identification network to obtain multiple scene identification results corresponding to different scenes.

As shown in FIG. 5 , FIG. 5 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application. The system includes, but is not limited to, a positive sample device 510 , a negative sample generator 520 , a scene identification device 530 , a backbone network 540 and a scene identification network device 550 .

The positive sample device 510 is configured to output training positive samples to the backbone network.

The negative sample generator 520 is configured to output training negative samples to the backbone network.

The training positive samples are selected scene files, and the training negative samples are other scene files except the selected scene. The difference between scene files and scene data is that: scene data refers to collected scene data directly stored in a storage space (eg, memory), and scene files are an ordered collection of scene data. For example, read the data of 128 sectors from 0 to 127 in the memory, or read the first 128 bytes of the tellme.txt file in the X directory in the memory.

The scene identification device 530 is configured to obtain the target scene identification of the target scene, and output the target scene identification to the backbone network. The target scene identification is set to identify the selected scene, and the selected scene is the target scene.

The backbone network 540 is configured to extract training features according to the target scene identifier, and the training features include training features for training positive samples and training features for training negative samples. The scene recognition network device 550 is configured to train the scene recognition network to be trained for the target scene according to the training feature to obtain the scene recognition network.

In this embodiment, the target scene may be a newly-added scene, and the training-to-be-trained scene recognition network may be trained for the newly-added scene to obtain the ability to identify whether the scene data is the newly-added scene and whether the scene data is the original scene. The scene recognition network of the scene; the target scene can also be an existing scene, and the scene recognition network to be trained can be trained for the existing scene, so as to update the recognition function for the existing scene, and the recognition function for other scenes. constant. It is more convenient and quicker to use the solution of this embodiment to train the scene recognition network.

As shown in FIG. 6 , FIG. 6 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application. The system includes, but is not limited to, a positive sample device 610 , a negative sample generator 620 , a scene identification device 630 , a backbone network 640 and a scene identification network device 650 .

The positive sample device 610 is configured to output training positive samples to the backbone network.

The negative sample generator 620 is configured to output training negative samples to the backbone network.

The scene identification device 630 is configured to obtain the target scene identification of the target scene, and output the target scene identification to the backbone network. The target scene identification is set to identify the selected scene, and the selected scene is the target scene.

The backbone network 640 is configured to extract training features according to the target scene identifier, and the training features include training features for training positive samples and training features for training negative samples.

The scene recognition network device 650 includes a plurality of scene networks and new scene networks corresponding to different scenes respectively, and the target scene identifier of the target scene corresponds to the new scene network; the scene recognition network device 650 is configured to train positive samples and training negative samples for training The feature passes through the new scene network to obtain the training recognition result corresponding to the new scene network; according to the training recognition result corresponding to the new scene network, the label of the training positive sample and the label of the training negative sample, determine the weight of the new scene network, and obtain The trained scene network.

Or, the scene recognition network device 650 includes a plurality of scene networks corresponding to different scenes respectively, and the target scene identifier of the target scene corresponds to an existing scene network in the scene recognition network device; the scene recognition network device 650 is set to pass the training feature through The existing scene network is used to obtain the training identification result corresponding to the existing scene network; according to the training identification result corresponding to the existing scene network, the label of the training positive sample and the label of the training negative sample, the weight of the existing scene network is updated, and the updated scene network.

Optionally, the multi-head network device may be instructed to identify scene data, train a new scene network, or update an existing scene network by means of button triggering, button triggering, or sending an instruction.

In the related art, when a new scene recognition function needs to be added, the neural network is retrained according to the samples corresponding to the original scene recognition function and the samples corresponding to the new scene recognition function. For example, the original neural network can recognize the scene A, but cannot recognize the scene. B. When it is necessary to increase the recognition of scene B, the neural network is retrained according to the samples of scene A and scene B, so that the similarity between scene data and scene A and scene B can be identified, for example, the similarity between scene data and scene A is 30%, and the similarity with scene B is 60%. With the solution of this embodiment, when the scene recognition network device needs to add a new scene recognition function, there is no need to retrain the entire scene recognition network, but only the new scene network can be trained. The training is convenient and fast, and the recognition is flexible and accurate.

In the related art, when the scene recognition function needs to be updated, the neural network is retrained according to the samples corresponding to the scene recognition function that need to be updated and the samples corresponding to other scene recognition functions that do not need to be updated. For example, the original neural network can recognize the scene A and the scene. B. If the ability to recognize scene B needs to be updated, the neural network is retrained according to the samples of scene A and the updated scene B. With the solution of this embodiment, when the scene recognition network device needs to update the scene recognition function, there is no need to retrain the entire scene recognition network, and only the scene network that needs to be updated can be retrained, which is convenient and quick to update.

As shown in FIG. 7 , FIG. 7 is a schematic structural diagram of a scene recognition system provided by an embodiment of the present application. The system includes, but is not limited to, a positive sample device 710 , a negative sample generator 720 , a scene identification device 730 , a backbone network 740 and a scene identification network device 750 . The scene identification network device 750 includes a scene identification network, the scene identification network includes an attention network, the attention network includes subnetworks corresponding to different scene identifiers, and multiple scene identifiers correspond to different scenes respectively.

The positive sample device 710 is configured to output training positive samples to the backbone network.

The negative sample generator 720 is configured to output training negative samples to the backbone network.

The scene identification device 730 is configured to obtain the target scene identification of the target scene, and output the target scene identification to the backbone network. The target scene identification is set to identify the selected scene, and the selected scene is the target scene.

The backbone network 740 is configured to extract training features according to the target scene identifier, and the training features include training features for training positive samples and training features for training negative samples.

The scene recognition network device 750 is configured to input the training feature and the target scene identifier into the attention network to be trained, and obtain the training recognition result of the attention network to be trained corresponding to the target scene identifier; according to the training recognition result, the label of the training positive sample and the training The label of the negative sample determines the weight of the attention network to be trained corresponding to the target scene identification, and obtains the trained attention network corresponding to the target scene identification.

Wherein, the subnet corresponding to the target scene identifier acquired by the scene identification device may be a new subnet, that is, the training process is a training process of a new subnet (new scene); the subnet corresponding to the target scene identifier acquired by the scene identification device may be Existing subnet, that is, the training process is an update process of the existing subnet (existing scene).

Optionally, the attention network can be instructed to recognize scene data, train a new scene network, or update an existing scene network by means of button triggering, button triggering, or sending an instruction.

In the related art, when a new scene recognition function needs to be added, the neural network is retrained according to the samples corresponding to the original scene recognition function and the samples corresponding to the new scene recognition function. For example, the original neural network can recognize the scene A, but cannot recognize the scene. B. When it is necessary to increase the recognition of scene B, the neural network is retrained according to the samples of scene A and scene B, so that the similarity between scene data and scene A and scene B can be identified, for example, the similarity between scene data and scene A is 30%, and the similarity with scene B is 60%. With the solution of this embodiment, when the attention network needs to add a new scene recognition function, there is no need to retrain the entire attention network, and only the sub-network corresponding to the new scene can be trained. The training is convenient and fast, and the recognition is flexible and accurate.

In the related art, when the scene recognition function needs to be updated, the neural network is retrained according to the samples corresponding to the scene recognition function that need to be updated and the samples corresponding to other scene recognition functions that do not need to be updated. For example, the original neural network can recognize the scene A and the scene. B. If the ability to recognize scene B needs to be updated, the neural network is retrained according to the samples of scene A and the updated scene B. With the solution of this embodiment, when the attention network needs to update the scene recognition function, there is no need to retrain the entire attention network, and only the scene subnet that needs to be updated can be retrained, which is convenient and quick to update.

As shown in FIG. 8 , FIG. 8 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application. The method includes but is not limited to step S110 and step S120.

Step S110, extracting features of the scene data to be identified.

The scene data includes at least one of scene video data, scene picture data and scene text data. Optionally, the size of the scene data to be recognized can be 64*64*3. Compared with the scene data with the size of 32*32*3, the scene data with the size of 64*64*3 has a higher resolution and reduces the dimension. It's clearer after processing.

Step S120: Input the extracted features into a scene recognition network for recognition, and obtain multiple scene recognition results corresponding to different scenes.

As shown in FIG. 9 , FIG. 9 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application. The scene recognition network includes multiple scene networks respectively corresponding to different scenes. The method includes but is not limited to step S210 and step S220.

Step S210, extracting features of the scene data to be identified.

Step S220: Pass the extracted features through each of the scene networks in parallel to obtain scene recognition results corresponding to each of the scene networks.

Using the solution of the related art, for the scene data N, the scene recognition result output by the neural network is the similarity with each scene, rather than whether it is the exact result of which scene, for example, the similarity with scene A is 40%, The similarity with scene B is 30%, the similarity with scene C is 30%, and the recognition accuracy is poor. Using the solution of this embodiment, the extracted features are passed through different scene networks in parallel, and the scene recognition results corresponding to each scene network are obtained respectively. For example, for scene data N, scene network 1 outputs identification result 1, indicating that scene data N and The scene corresponding to the scene network 1 is similar, the scene network 2 outputs the recognition result 0, indicating that the scene data N is not similar to the scene corresponding to the scene network 2, and the scene network 3 outputs the recognition result 0, indicating that the scene data N and the scene corresponding to the scene network 3 are not similar. Therefore, it is clear that the scene data N is the scene data corresponding to the scene network 1, and the recognition result is more accurate.

As shown in FIG. 10 , FIG. 10 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application. The scene recognition network includes an attention network, and the attention network includes a plurality of scene identifiers respectively corresponding to different scenes. The method includes but is not limited to step S310 and step S320.

Step S310, extracting features of the scene data to be identified.

Step S320 , traverse a plurality of the scene identifiers of the attention network according to the extracted features, and obtain a scene recognition result corresponding to each of the scene identifiers.

Using the solution of the related art, for the scene data N, the scene recognition result output by the neural network is the similarity with each scene, rather than whether it is the exact result of which scene, for example, the similarity with scene A is 40%, The similarity with scene B is 30%, the similarity with scene C is 30%, and the recognition accuracy is poor. Using the solution of this embodiment, the scene identification results corresponding to each scene identification will be obtained through the subnets corresponding to different scene identifications. For example, for scene data N, subnet A outputs identification result 1, indicating that scene data N and Scenario A is similar, subnet B outputs a recognition result of 0, indicating that scene data N is not similar to scene B, and subnet C outputs a recognition result of 0, indicating that scene data N is not similar to scene C, and the recognition result is more accurate. Therefore, it is clear that the scene data N is the scene data corresponding to the subnet A, and the accuracy of the identification result is higher.

As shown in FIG. 11 , FIG. 11 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application. The method includes but is not limited to step S410, step S420, step S430, and step S440.

Step S410: Extract training features according to the target scene identifier of the target scene, where the training features include training features for training positive samples and training features for training negative samples.

The training positive samples are selected scene files, and the training negative samples are other scene files except the selected scene.

Step S420: According to the training feature, train the scene recognition network to be trained for the target scene to obtain the scene recognition network.

Step S430, extracting features of the scene data to be identified.

Step S440: Input the extracted features into a scene recognition network for recognition, and obtain multiple scene recognition results corresponding to different scenes.

As shown in FIG. 12 , FIG. 12 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application. The scene recognition network includes multiple scene networks respectively corresponding to different scenes. The method includes but is not limited to step 510, step 520, step 530, step S540 and step S550.

Step 510: Extract training features according to the target scene identifier of the target scene, where the training features include training features for training positive samples and training features for training negative samples.

Step 520: Pass the training feature through the network to be trained corresponding to the target scene to obtain a training identification result corresponding to the network to be trained.

The network to be trained is an existing scene network or a new scene network.

Step 530: Determine the weight of the network to be trained according to the training recognition result, the label of the training positive sample and the label of the training negative sample, and obtain the trained scene network.

The training mechanism of the network to be trained is as follows, among which:

Y _pr is the obtained output, Y _gt is the correct output, W is the weight, X is the input, σ is the activation function (sigmoid), and η is a constant.

Y _pr =σ(WX), where WX≡Z

The weight update amount is (W=W+ΔW):

Step S540, extracting features of the scene data to be identified.

Step S550: Pass the extracted features through each of the scene networks in parallel to obtain scene recognition results corresponding to each of the scene networks.

In the related art, when a new scene recognition function needs to be added, the neural network is retrained according to the samples corresponding to the original scene recognition function and the samples corresponding to the new scene recognition function. For example, the original neural network can recognize the scene A, but cannot recognize the scene. B. When it is necessary to increase the recognition of scene B, the neural network is retrained according to the samples of scene A and scene B, so that the similarity between scene data and scene A and scene B can be identified, for example, the similarity between scene data and scene A is 30%, and the similarity with scene B is 60%. With the solution of this embodiment, when a new scene recognition function needs to be added, only the new scene network can be trained, the training is convenient and fast, and the recognition is flexible and accurate.

In the related art, when the scene recognition function needs to be updated, the neural network is retrained according to the samples corresponding to the scene recognition function that need to be updated and the samples corresponding to other scene recognition functions that do not need to be updated. For example, the original neural network can recognize the scene A and the scene. B. If the ability to recognize scene B needs to be updated, the neural network is retrained according to the samples of scene A and the updated scene B. With the solution of this embodiment, when the scene recognition function needs to be updated, only the scene network that needs to be updated (existing scene network) needs to be retrained, and the update is convenient and fast.

As shown in FIG. 13 , FIG. 13 is a schematic flowchart of a scene recognition method provided by an embodiment of the present application. The scene recognition network includes an attention network, and the attention network includes a plurality of scene identifiers respectively corresponding to different scenes. The method includes, but is not limited to, steps S610, S620, S630, S640, and S650.

Step S610: Extract training features according to the target scene identifier of the target scene, where the training features include training features for training positive samples and training features for training negative samples.

Step S620: Input the training feature and the target scene identifier into the attention network to be trained, and obtain the training recognition result of the attention network to be trained corresponding to the target scene identifier.

The target scene identifier corresponds to an existing subnet or a new subnet in the network to be trained.

Step S630, according to the training recognition result, the label of the training positive sample and the label of the training negative sample, determine the weight of the attention network to be trained corresponding to the target scene identifier, and obtain the corresponding target scene identifier. The trained attention network.

Step S640, extracting features of the scene data to be identified.

Step S650 , traverse a plurality of the scene identifiers of the attention network according to the extracted features, and obtain a scene recognition result corresponding to each of the scene identifiers.

As shown in FIG. 14 , FIG. 14 is a schematic structural diagram of an electronic device provided by an embodiment of the present application. The electronic equipment includes:

one or more processors 810;

the memory 820, on which one or more programs are stored, and when the one or more programs are executed by one or more processors, the one or more processors implement any one of the scene recognition methods in the embodiments of the present application;

One or more I/O interfaces 830, connected between the processor and the memory, are configured to realize the information exchange between the processor and the memory.

The processor 810 is a device with data processing capability, including but not limited to a central processing unit (CPU), etc.; the memory 820 is a device with data storage capability, including but not limited to random access memory (RAM, more specifically Such as SDRAM, DDR, etc.), read only memory (ROM), electrified erasable programmable read only memory (EEPROM), flash memory (FLASH); the I/O interface (read and write interface) 830 is connected between the processor 810 and the memory 820 , which can realize the information interaction between the processor 810 and the memory 820, which includes but is not limited to a data bus (Bus) and the like.

In some embodiments, processor 810, memory 820, and I/O interface 830 are interconnected by bus 840, which in turn is connected to other components of the computing device.

As shown in FIG. 15 , FIG. 15 is a schematic structural diagram of a computer-readable medium provided by an embodiment of the present application. The computer-readable medium has a computer program stored thereon, and when the program is executed by the processor, any one of the scene recognition methods in the embodiments of the present application is implemented.

From the above description of the embodiments, those skilled in the art can clearly understand that the present application can be implemented by means of software and necessary general-purpose hardware, and of course can also be implemented by hardware, but in many cases the former is a better implementation manner . Based on such understanding, the technical solutions of the present application can be embodied in the form of software products in essence or the parts that make contributions to related technologies, and the computer software products can be stored in a computer-readable storage medium, such as a computer floppy disk, Read-Only Memory (ROM), Random Access Memory (RAM), flash memory (FLASH), hard disk or optical disk, etc., including several instructions to make a computer device (which can be a personal computer, A server, or a network device, etc.) executes the methods described in the various embodiments of the present application.

The above descriptions are merely exemplary embodiments of the present application, and are not intended to limit the protection scope of the present application.

In general, the various embodiments of the present application may be implemented in hardware or special purpose circuits, software, logic, or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software that may be executed by a controller, microprocessor or other computing device, although the application is not limited thereto.

Embodiments of the present application may be implemented by the execution of computer program instructions by a data processor of a mobile device, eg in a processor entity, or by hardware, or by a combination of software and hardware. The computer program instructions may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state setting data, or source code written in any combination of one or more programming languages or object code.

The block diagrams of any logic flow in the figures of the present application may represent program steps, or may represent interconnected logic circuits, modules and functions, or may represent a combination of program steps and logic circuits, modules and functions. Computer programs can be stored on memory. The memory may be of any type suitable for the local technical environment and may be implemented using any suitable data storage technology, such as but not limited to read only memory (ROM), random access memory (RAM), optical memory devices and systems (Digital Versatile Discs). DVD or CD disc) etc. Computer-readable media may include non-transitory storage media. The data processor may be of any type suitable for the local technical environment, such as, but not limited to, a general purpose computer, special purpose computer, microprocessor, digital signal processor (DSP), application specific integrated circuit (ASIC), programmable logic device (FPGA) and processors based on multi-core processor architectures.

The foregoing has provided a detailed description of exemplary embodiments of the present application, by way of illustrative and non-limiting example. However, when considered in conjunction with the accompanying drawings and claims, various modifications and adjustments to the above embodiments will be apparent to those skilled in the art without departing from the scope of the present invention. Accordingly, the proper scope of the invention will be determined with reference to the appended claims.

Claims

A scene recognition method, characterized in that: comprising:

Extract the features of the scene data to be identified;

The extracted features are input into the scene recognition network for recognition, and multiple scene recognition results corresponding to different scenes are obtained.
The method according to claim 1, wherein the scene recognition network comprises a plurality of scene networks corresponding to different scenes; the extracted features are input into the scene recognition network for recognition, and a plurality of scene recognition networks corresponding to different scenes are obtained. The resulting steps include:

The extracted features are passed through each of the scene networks in parallel to obtain scene recognition results corresponding to each of the scene networks.
The method according to claim 1, wherein the scene recognition network comprises an attention network, and the attention network comprises a plurality of scene identifiers corresponding to different scenes; the extracted features are input into the scene recognition network for recognition, The steps of obtaining multiple scene recognition results corresponding to different scenes respectively include:

Traverse a plurality of the scene identifiers of the attention network according to the extracted features, and obtain a scene recognition result corresponding to each of the scene identifiers.
The method according to claim 3, wherein one scene identifier corresponds to a subnet in the attention network; and the scene identifier is a code or activity value corresponding to a subnet in the attention network .
The method according to any one of claims 1 to 4, characterized in that, before the step of extracting the feature of the scene data to be identified, the method further comprises:

Extracting training features according to the target scene identifier of the target scene, the training features include training features for training positive samples and training features for training negative samples;

According to the training feature, the scene recognition network to be trained is trained for the target scene to obtain the scene recognition network.
The method according to claim 5, wherein the scene recognition network comprises a plurality of scene networks respectively corresponding to different scenes; according to the training feature, the scene recognition network to be trained is trained for the target scene to obtain the The steps of describing the scene recognition network include:

Passing the training feature through the network to be trained corresponding to the target scene to obtain a training identification result corresponding to the network to be trained;

According to the training recognition result, the labels of the training positive samples and the labels of the training negative samples, the weight of the network to be trained is determined, and the trained scene network is obtained.
The method according to claim 6, wherein the network to be trained is an existing scene network or a new scene network.
The method according to claim 5, wherein the scene recognition network includes an attention network, and the attention network includes a plurality of scene identifiers corresponding to different scenes; according to the training feature, for the target scene The scene recognition network to be trained is trained, and the steps of obtaining the scene recognition network include:

Inputting the training feature and the target scene identifier into the attention network to be trained, to obtain the training recognition result of the attention network to be trained corresponding to the target scene identifier;

According to the training recognition result, the label of the training positive sample and the label of the training negative sample, determine the weight of the attention network to be trained corresponding to the target scene identifier, and obtain the training result corresponding to the target scene identifier attention network.
The method according to claim 5, wherein the training positive samples are scene files, and the training negative samples are non-scene files.
The method according to any one of claims 1 to 4, wherein the scene data includes at least one of scene video data, scene picture data and scene text data.
A scene recognition system, characterized in that it includes:

The backbone network is set to extract the features of the scene data to be recognized;

The scene recognition network device includes a scene recognition network, and the scene recognition network device is configured to input the extracted features into the scene recognition network to obtain a plurality of scene recognition results corresponding to different scenes.
The system according to claim 11, wherein the scene recognition network comprises a plurality of scene networks corresponding to different scenes; the scene recognition network device is configured to pass the extracted features through each of the scene networks in parallel, respectively. A scene recognition result corresponding to each of the scene networks is obtained.
The system according to claim 11, wherein the scene recognition network includes an attention network, and the attention network includes subnetworks corresponding to different scene identifiers, and a plurality of the scene identifiers correspond to different scenes respectively; the The scene recognition network device is configured to pass the extracted features through the subnets corresponding to each of the scene identifiers to obtain scene recognition results corresponding to each of the scene identifiers.
The system according to claim 13, wherein the scene identifier is a code or an activity value corresponding to a subnet in the attention network.
The system according to any one of claims 11 to 14, further comprising:

The positive sample device is configured to output scene data to be identified to the backbone network.
The system of claim 15, further comprising:

a scene identification device, configured to obtain a target scene identification of a target scene, and output the target scene identification to the backbone network;

a negative sample generator, configured to output training negative samples to the backbone network;

The positive sample device is further configured to output training positive samples to the backbone network;

The backbone network extracts training features according to the target scene identifier, and the training features include training features for training positive samples and training features for training negative samples;

The scene recognition network device trains the scene recognition network to be trained for the target scene according to the training feature to obtain the scene recognition network.
The system according to claim 16, wherein the scene recognition network comprises a plurality of scene networks corresponding to different scenes;

The scene recognition network device passes the training feature through the network to be trained corresponding to the target scene to obtain a training recognition result corresponding to the network to be trained; according to the training recognition result corresponding to the network to be trained, the training positive The label of the sample and the label of the training negative sample are used to determine the weight of the network to be trained, and the trained scene network is obtained.
The system according to claim 17, wherein the network to be trained is an existing scene network or a new scene network.
The system according to claim 16, wherein the scene recognition network comprises an attention network, and the attention network comprises subnetworks corresponding to different scene identifiers, and a plurality of the scene identifiers correspond to different scenes respectively;

The scene recognition network device inputs the training feature and the target scene identifier into the attention network to be trained, and obtains the training recognition result of the attention network to be trained corresponding to the target scene identifier; The label of the training positive sample and the label of the training negative sample are determined, the weight of the attention network to be trained corresponding to the target scene identifier is determined, and the trained attention network corresponding to the target scene identifier is obtained.
The system according to any one of claims 11 to 14, wherein the backbone network is configured as a deep neural network.
The system according to claim 12, wherein the scene network is set as a single-layer fully connected or a multi-layer perceptron.
An electronic device comprising:

one or more processors;

a memory having stored thereon one or more programs which, when executed by the one or more processors, cause the one or more processors to implement any one of claims 1 to 10 The scene recognition method described in item;

One or more I/O interfaces, connected between the processor and the memory, are configured to realize the information interaction between the processor and the memory.
A computer-readable medium having a computer program stored thereon, the program implementing the scene recognition method according to any one of claims 1 to 10 when the program is executed by a processor.