CN116152645A

CN116152645A - Indoor scene visual recognition method and system integrating multiple characterization balance strategies

Info

Publication number: CN116152645A
Application number: CN202310157638.5A
Authority: CN
Inventors: 张宁; 董乐; 赵浩然
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2023-02-23
Filing date: 2023-02-23
Publication date: 2023-05-23

Abstract

The invention discloses an indoor scene visual identification method and system integrating various characterization balance strategies, comprising the following steps: calculating a class center of each class in the long tail training set by using the preheated model; constructing a plurality of training subsets with different feature distributions through different resampling strategies; training the preheated model by using the training subsets in combination with the self-defined loss function until the loss function converges, so that the model tends to learn the characteristics balanced among the training subsets; the problem of feature imbalance inside the class in the training set is handled. Meanwhile, a regular term is applied to the classifier to adjust the weight difference of head and tail classes, a trained model is obtained after the loss function converges to a certain degree, and imbalance of various weights on the classifier caused by imbalance of class samples in a training set is reduced. The invention solves the problem of model training caused by unbalanced class samples and unbalanced non-class attributes of the samples in the class.

Description

Indoor scene visual recognition method and system integrating multiple characterization balance strategies

Technical Field

The invention relates to the field of long tail visual identification and balance characterization learning, in particular to an indoor scene visual identification method and system integrating various characterization balance strategies.

Background

Long tail visual recognition is one of the most challenging and critical techniques in the computer vision field, because any naturally acquired dataset more or less has the problem of imbalance in long tail distribution, which is often ignored by people, thereby bringing some subtle effects to model training. Among the previous studies, there is a focus on the imbalance between categories, which has been manually resolved in most computer vision tasks. For example, most of the presently disclosed data sets are subjected to manual class balancing after natural acquisition, so that the presently disclosed data sets are mostly class balanced except for the special long-tail research field. However, this is not indicative of the problem of class rebalancing, which is meaningless, because such a study would reduce the necessity of manual class balancing operations following naturally acquired data sets, reducing the time and labor costs of such a link. In addition to class balancing problems, there are also a class of problems that have not been addressed before and that have not been addressed, namely long tail imbalance problems within a class, such as some common phenomena: why samples within the same class behave inconsistently and exhibit long tail distribution; why some samples of the tail class would be predicted to have head class classes of similar properties in the visual recognition task.

In the existing long-tail visual recognition task, particularly in an indoor scene (such as a classroom, a canteen, a market and the like), as sample distribution among categories of objects in an indoor space shows long-tail distribution, a model directly trained by using an acquired training set cannot obtain distribution similar to a test set with balanced category in the same category. A high class richness of the sample means that potential confounding factors can be avoided, but rather that they are more susceptible to confounding factors. The method focuses on obtaining the characteristics of robustness and balance used under the visual recognition task by training with long tail distribution data in narrow spaces such as indoors.

The long tail visual recognition task aims to improve the performance of a given long tail training set under a class balance evaluation method. The most obvious confounding factor under the long-tail distribution data set is the category, so the category is subjected to the de-confounding operation. How to effectively use unbalanced data, thereby reducing the cost in data acquisition, and training a more balanced model is a concern. The rebalancing method for "inter-category imbalance" is generally categorized into four categories: the first is a resampling strategy on the training data set, such as downsampling the head class and upsampling the tail class, which has problems with the full utilization of the data set, downsampling the head class makes partial data underutilized, and upsampling the tail class has problems with the new sample distribution deviating from the original distribution. The second type is re-weighting, i.e. the processing of the loss function during the training phase, because of the flexibility and simplicity of the loss function calculation, the method is applied to many tasks requiring complex modeling. The third is transfer learning, which is based on imbalance of long-tail data distribution, and first learns the samples of the head class fully, and then transfers the learned knowledge into feature learning of the tail class in some way, such as using the distribution information of the head class to perform sample enhancement of the tail class, which tends to have a complex model. The fourth class is model integration, which aims to simultaneously promote the performance of head and tail classes in long-tail training sets through multiple sub-models.

In addition, many existing researches aim at the problem of unbalance between classes, and a method is provided for training to obtain a classifier of unbalance between classes, wherein the classifier tends to improve the confidence of tail classes and inhibit the confidence of head classes, so that a model is corrected to predict tail class samples as head class, and the problem of head class and tail class is achieved. However, even if the number of samples in the class is uniform, the samples in the same class may be unbalanced in the class due to uneven distribution of features, and the recognition effect of the model may be affected.

Disclosure of Invention

The invention aims to overcome the balance characterization learning problem in a long-tail training set acquired by indoor scenes, and particularly provides an indoor scene visual recognition method and system integrating various characterization balance strategies aiming at the previously ignored in-class imbalance problem.

The aim of the invention is realized by the following technical scheme:

in a first aspect, an indoor scene visual recognition method fusing multiple characterization balance strategies is provided, the method comprising:

s1, sampling to obtain a long tail training set;

s2, preheating a model and customizing a loss function;

s3, calculating a class center of each class in the long-tail training set by using the preheated model;

s4, constructing a plurality of training subsets with different feature distributions through different resampling strategies;

s5, training the preheated model by using the training subsets until the loss function converges by combining with the self-defined loss function, so that the model tends to learn the characteristics balanced among the training subsets;

s6, applying a regular term on the classifier of the model after training in the step S5 to adjust the weight difference of the head and tail classes, and obtaining the trained model after the loss function converges to a certain degree.

And S7, performing visual identification by using the trained model.

As an optional aspect, an indoor scene visual recognition method integrating multiple characterization balance strategies, wherein the building of multiple training subsets with different feature distributions through different resampling strategies includes:

and respectively using different resampling modes for each category in the long-tail training set to obtain a plurality of new small subsets, and then combining all the small subsets using the same resampling mode to obtain a plurality of large training subsets.

As an optional aspect, an indoor scene visual recognition method integrating multiple characterization balance strategies, wherein the different resampling modes include:

one is to sample samples in each class with the same weight, and the other is to sample samples in each class with weights according to the two-eight law.

As an preferable option, an indoor scene visual recognition method integrating multiple characterization balance strategies, which samples the sample in each class according to the law of twenty-eight, includes:

upsampling a portion of the samples in the current class until the sample fraction of twenty percent of the lowest confidence in the prediction in the class reaches eighty percent of the number of samples in the original set; while another portion of the samples in the current class are down sampled to twenty percent of the original set of samples.

As an optimal option, an indoor scene visual recognition method integrating multiple characterization balance strategies is adopted, and the upsampling method is MixUp data enhancement.

As an optional aspect, an indoor scene visual recognition method integrating multiple characterization balance strategies, the training the preheated model by using the training subset until the loss function converges, includes:

periodically repeating step S4 to reconstruct the training subset and training the model using the reconstructed training subset.

As an optimal choice, an indoor scene visual recognition method integrating a plurality of characterization balance strategies is adopted, wherein the periodicity refers to repeating every 20 epochs.

As an optimal option, an indoor scene visual recognition method integrating multiple characterization balance strategies updates the category center of each category before each repetition.

As an optional aspect, a method for identifying indoor scene vision by fusing multiple characterization balance strategies, the step S6 includes:

the parameters of the model classifier are randomly initialized and the classifier is independently adjusted periodically by using a training subset obtained by resampling.

In a second aspect, an indoor scene visual recognition system incorporating multiple characterization balancing strategies is provided, the system comprising:

the data acquisition module is used for sampling to obtain a long-tail training set;

the model preheating module is used for preheating the model and customizing the loss function;

the class center calculating module calculates the class center of each class in the long tail training set by using the preheated model;

the training subset construction module is used for constructing a plurality of training subsets with different feature distributions through different resampling strategies;

the in-class balance training module is used for training the preheated model by combining the self-defined loss function until the loss function converges, so that the model tends to learn the characteristics balanced among the training subsets;

and the inter-class balance training module applies a regular term on the classifier of the model obtained by the intra-class balance training module to adjust the weight difference of the head and tail classes, and the trained model is obtained after the loss function converges to a certain degree.

And the recognition module is used for performing visual recognition by using the trained model.

It should be further noted that the technical features corresponding to the above options may be combined with each other or replaced to form a new technical scheme without collision.

Compared with the prior art, the invention has the beneficial effects that:

(1) According to the method, a plurality of training subsets with different feature distributions are constructed through different resampling strategies, then the preheated models are trained by using the training subsets until the loss functions are converged by combining with a self-defined central loss function, so that the models tend to learn the balanced features among the training subsets, the models tend to learn the unbiased characterization in the class, and the problem of the bias in the class which is ignored all the time before is solved; meanwhile, a regular term is applied to the classifier to adjust the weight difference of head and tail classes, a trained model is obtained after the loss function converges to a certain degree, and imbalance of various weights on the classifier caused by imbalance of class samples in a training set is reduced. The invention solves the problems of unbalanced class samples and unbalanced non-class attributes of samples in classes on the basis of the prior class-to-class balance work, further considers the influence of some non-class factors such as background, gesture, visual angle and the like on the visual recognition task result, and the balance methods are complementary in two aspects, so that the overall effect of the model is better.

(2) The present invention explicitly states and incorporates a modeled "long tail within class" that explains that previous studies did not address: why the manifestations within the same class exhibit long tail distribution; why some samples of the tail class in the long-tail training set would be predicted to have a head class that resembles a "non-class attribute". And the problem that the original research is balanced in accuracy and precision without simultaneous improvement is improved.

(3) The method for balancing the intra-class offset has the characteristic of non-invasiveness, can be combined with a plurality of methods such as cRT, LWS, balancedSoftmax, BBN and the like, can be directly embedded into the existing long tail recognition model without affecting the structure before the model, and provides the capability of balancing the intra-class offset for the model to realize complementation.

Drawings

FIG. 1 is a flow chart of an indoor scene visual recognition method integrating multiple characterization balance strategies according to an embodiment of the invention;

FIG. 2 is a diagram illustrating inter-class sample long tail distribution and intra-class attribute long tail distribution according to an embodiment of the present invention;

FIG. 3 is a structural causal graph of a new modeling approach to long tail visual recognition problems, shown in an embodiment of the present invention;

fig. 4 is a schematic diagram of an overall framework according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made apparent and fully understood from the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In addition, the technical features of the different embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

In an exemplary embodiment, an indoor scene visual recognition method fusing multiple characterization balance strategies is provided, as shown in fig. 1, the method includes:

s1, sampling to obtain a long tail training set;

s2, preheating a model and customizing a loss function;

And S7, performing visual identification by using the trained model.

Specifically, the success of the existing long tail visual recognition balance method mainly comes from the fact that the confidence boundary of the tail class is enlarged to include more tail class samples, so that the accuracy of the tail class is improved. This is in fact a trade-off between accuracy, which is defined as

The accuracy is defined as

Wherein, # AllSamples refers to the total number of picture samples in the data set, and # Correct predictors refers to the number of picture samples for which the model predicts the correct category (prediction pair), and the accuracy is the sum ofThe model predicts the class of n pictures, wherein the proportion of the n pictures is predicted. The term "samples predictionassthisclass" refers to the number of picture samples predicted by the model as a certain class, for example, 10 pictures are predicted by the model as tigers and 12 pictures are predicted as mice by the model, then the term "samples predictionassthisclass" of the old tiger class is equal to 10, the term "SamplesPredicti onAsThisClass" of the mice class is equal to 12, the accuracy rate is also called "precision rate" and refers to the correct proportion of the samples predicted as a certain class in the data set, and then each class is summed up and the total number of classes is processed. # class is the total number of categories in the dataset, e.g., the model predicts 10 of the previous 100 pictures as tigers, but only 6 of them are true tigers, and the accuracy is equal to 6/10.

The method provided by the invention is to apply a regular term on the classifier of the model, reduce the weight difference of the head and tail classes on the classifier and relieve the unbalance among the classes. In addition, the neglected point in the prior popular method is unbalance in the class, the method provided by the invention constructs a plurality of training subsets with different characteristic distributions through different resampling strategies, then trains the preheated model by using the training subsets until the loss function converges in combination with a self-defined center loss function, so that the model tends to learn the characteristics balanced among the training subsets, the model tends to learn the unbiased characterization in the class, and the problem of the bias in the class which is always neglected before is solved. The invention solves the problems of unbalanced class samples and unbalanced non-class attributes of samples in classes on the basis of the prior class-to-class balance work, further considers the influence of some non-class factors such as background, gesture, visual angle and the like on the visual recognition task result, and the balance methods are complementary in two aspects, so that the overall effect of the model is better.

In one example, an indoor scene visual recognition method integrating multiple characterization balance strategies constructs multiple training subsets with different feature distributions through different resampling strategies, comprising:

In particular, sample X may be represented as "category information" and "a series of attributes. That is to say that sample X is represented by two potential eigenvectors, zc and Za. Where Zc is a class feature with invariance, which can be understood as template information or prototypes of the class, and Za is an attribute feature that changes distribution as the domain changes, such as texture, pose, background, illumination, etc. Thus, the problem of visual recognition under long-tail data sets can be modeled in a new way, and category bias and attribute bias are explained. Given the structural causal model of this modeling approach, zc is a class prototype, i.e., a given class Y will have a corresponding class prototype Zc, as shown in fig. 2-3. Zc is defined herein as a binary vector comprising a plurality of components of Y, e.g., y=zc of a person is [ head=1, torso=1, arm=1, leg=1, others=0 ]. This is also to accommodate finer granularity classification, e.g., y=cow Zc may be [ head=1, torso=1, arm=1. Leg=1, ox horn=1, others=0 ], without the need for another incoherent independent heat vector to represent. And Zc has a set of attributes Za corresponding thereto, such as "hair" having "long hair", "short hair", etc. The property Za is in turn affected by the external noise epsilon of non-categorical factors. The specific object image X is affected by both the category template Zc and the corresponding set of attributes Za.

First, the visual recognition task can be regarded as P (y|x), zc is shared by all samples in the class, and differences in Za cause differences in performance between different samples in the class. According to fig. 3, the following modeling formula can be given:

the final decomposition results are, in order from left to right, category templates, intra-category attribute offsets and inter-category offsets, which together result in different manifestations between different samples. The difference in the following classes is first noted, although Zc is shared by samples in the classes, the difference in Za also causes difficulty in visually identifying samples in the same class, for example, green bananas are tail-class in bananas, and then the green bananas become difficult samples, as shown in fig. 2. In addition, the intra-class attribute offset may explain why the sample was misclassified. For example, green is common in loofah, and then a pseudo-correlation may be generated between "green" and "melon", so that green bananas are classified into "loofah" with a high probability.

1) The difference in performance between samples can be explained by the following formula:

because of

Then p (y=banana|zc=color, za=yellow) < p (y=banana|zc=color, za=green)

2) The pseudo-association of samples in a class with other classes (samples are misclassified into other classes) can be explained by the following formula: if the number of green loofah samples is excessive, then

The p (y=towel gourd|zc=color, za=green) value will be much larger than 1. I.e. if the color of a banana sample is green, it is likely to be divided into loofah, and the green banana itself aggravates this phenomenon as it is a tail class.

The above fully explains how the intra-class offset is generated and acts on the visual recognition task, and for the intra-class offset, the overall architecture provided by the present invention is shown in fig. 4, and a specific visual recognition process is provided in connection with the embodiment of fig. 4:

firstly, sampling to obtain a long-tail training set { { { X, y } }, wherein X is a sample, y is a category label, and the size of a sample picture is adjusted to 112X112;

the training backbone f (. Theta.; theta.) (feature extractor) and classifier g (. Cndot.; omega.) are then preheated, and the loss function uses cross entropy θ, ω ε argmin _θ,ω L _cls (f (x; θ; y; ω), where θ, ω are the learnable parameters of the backbone network and the classifier, respectively, L _cls Cross entropy is used, the duration is 60 epochs, SGD is used for the optimizer, and the batch size is 256.

Calculating to obtain a class center { Cy } of each class through the preheated model in the previous step;

then after warming up, the model is trained with the training subset, the training subset is reconstructed periodically, meaning every 20 epochs, and the model is trained with the reconstructed training subset until the loss function converges. Specifically, two training subsets are respectively constructed through different resampling strategies for subsequent training { (x) ^e1 ,y ^e1 )},{(x ^e2 ,y ^e2 ) ) } = subsetstructure ({ x, y }, θ, ω), the loss function is θ, ω∈argmin _θ,ω ∑ _s∈ε ∑ _i∈s (L _cls +α·L _IFL ) At the same time, the class center Cy of each class is updated before each repetition of the steps

{Cy}→MovingAverage({Cy},{(f(x ^s1 ；θ),y ^s1 )},{(f(x ^s2 ；θ),y ^s2 )})

Wherein the method comprises the steps of

Wherein the method comprises the steps of

Finally, the training process is completed to obtain the balanced feature extractor f (·; θ). Finally, an additional 10 epochs training is performed to handle the inter-class imbalance during which the parameters θ of the feature extractor are frozen, then the parameters ω of the linear classifier are randomly initialized and the classifier is individually tuned at the next 10 epochs using the resampled training set of class sample equalizations. The balanced feature extractor f (·; θ) and classifier g (·, ω) are finally obtained.

In one example, an indoor scene visual recognition method integrating multiple characterization balance strategies, the different resampling modes include:

In particular, because there is no theory that proves that multiple features of the model learned from the training set can be unwrapped, zc and Za cannot be separated by simple direct feature selection, the solution presented by the present invention is to guide the model to reduce the learning of Za by constructing two training subsets. First we experimentally found an empirical conclusion: each sample has an inverse cosine similarity to its class center that is inversely proportional to the rarity of its Za, i.e., the more rare the Za of this sample, the less predictive logic that this sample gives from the model. By this conclusion, the predicted logic given by the model for a sample can be used as the location of the Za for this sample in the Za long tail distribution inside the class of samples. Then, a resampling method is used in the construction of the training subsets, and since the two training subsets are distinguished by using the difference of Za distribution of samples in the subsets, and as can be seen from the empirical conclusion, the Za distribution can be represented by the predicted logit of the samples given by the model, so that a specific training subset resampling strategy is as follows. After the preheated feature extractor and classifier are obtained, the prediction confidence of all samples in the current whole training set can be obtained, and when the sample label is k, the prediction confidence is P (y=k|x in k). As mentioned above, the prediction execution degree of the sample can be used to represent that the current sample x is in the belonging category yIs a distribution of Za of (c). Because the training subsets are constructed by reducing the learning of Za that causes attribute offset through the training subsets and the loss function guide model, zc is used as the basis for visual recognition task. There are two methods of resampling the specific training subset. One is to assign the same weight to the samples in each class, so that the resulting distribution of Za for samples within each class in the subset is the same as the distribution in the original set. The second resampling is to assign a sampling weight of (1-p (y=k|zc, za)) to each sample in the class according to the twenty-eight law ^β Where β automatically adjusts for upsampling twenty percent of the samples from the k-class having the lowest p (y=k|zc, za) value to eighty percent of that class, it can be briefly understood that the second resampling mode assigns a right opposite weight to the first sampling mode to each sample in the class k. And then respectively using two resampling modes for each category in the original training set to obtain two new small subsets, and then combining all the small subsets using the same resampling mode to finally obtain two large training subsets, thereby completing the construction of one training subset.

Further, weighting the samples in each class according to the bieight law, including:

upsampling a portion of the samples in the current class until the sample fraction of twenty percent of the lowest confidence in the prediction in the class reaches eighty percent of the number of samples in the original set; while another portion of the samples in the current class are down sampled to twenty percent of the original set of samples. Specifically, since a portion of the samples need to be upsampled, mixUp is used here. MixUp is a data enhancement method, commonly used as upsampling. Twenty percent samples of the current class with the lowest p (y=k|zc, za) value are found first, then two of them are taken randomly, then at [0,1]The fusion ratio mu is randomly selected, and mu accords with beta distribution. And then adding each pixel of the two pictures selected randomly before to fuse, outputting=mu.images1+ (1-mu) images, and outputting to obtain a new sample for up-sampling. The label of the newly generated sample is

This constitutes a new pair { x, y }. The up-sampling process in the current class is performed until the sample of the twenty percent with the lowest p (y=k|zc, za) value in the class has reached eighty percent of the original set of samples, and the other part of samples in the original set is changed into twenty percent of the original set of samples through down-sampling, so that the construction process of the second training subset is completed.

In another exemplary embodiment, an indoor scene visual recognition system incorporating multiple characterization balancing strategies is provided, the system comprising:

The training subset construction module constructs a plurality of training subsets with different feature distribution through different resampling strategies, the in-class balance training module combines the training subsets output by the training subset construction module with a self-defined center loss function, the training subset is used for training the preheated model until the loss function converges, the model tends to learn the balanced features among the training subsets, the model tends to learn the unbiased characterization in the class, and the problem of the in-class offset which is always ignored before is solved. The inter-class balance training module applies a regular term on the classifier to adjust the weight difference of the head and tail classes, and a trained model is obtained after the loss function converges to a certain degree, so that unbalance of various weights on the classifier caused by unbalance of class samples in a training set is reduced. Meanwhile, the problems caused by unbalanced class samples and unbalanced non-class attributes of samples in the classes are solved, on the basis of the prior balance work between classes, the influence of long tails in the classes on the visual recognition task result caused by some non-class factors such as background, gesture, visual angle and the like is further considered, and the balance methods are complementary in two aspects, so that the overall effect of the model is better.

In another exemplary embodiment, the invention provides a storage medium having stored thereon computer instructions that, when executed, perform the steps of the one indoor scene visual recognition method incorporating a plurality of characterization balancing policies.

Based on such understanding, the technical solution of the present embodiment may be essentially or a part contributing to the prior art or a part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another exemplary embodiment, the invention provides a terminal, including a memory and a processor, where the memory stores computer instructions executable on the processor, and the processor executes the steps of the method for identifying indoor scene vision that fuses multiple characterization balancing strategies.

The processor may be a single or multi-core central processing unit or a specific integrated circuit, or one or more integrated circuits configured to implement the invention.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in: tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and structural equivalents thereof, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on a manually-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by data processing apparatus.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Processors suitable for the execution of a computer program include, by way of example, general and/or special purpose microprocessors, or any other type of central processing unit. Typically, the central processing unit will receive instructions and data from a read only memory and/or a random access memory. The essential elements of a computer include a central processing unit for carrying out or executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks, etc. However, a computer does not have to have such a device. Furthermore, the computer may be embedded in another device, such as a mobile phone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features of specific embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. On the other hand, the various features described in the individual embodiments may also be implemented separately in the various embodiments or in any suitable subcombination. Furthermore, although features may be acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

The foregoing detailed description of the invention is provided for illustration, and it is not to be construed that the detailed description of the invention is limited to only those illustration, but that several simple deductions and substitutions can be made by those skilled in the art without departing from the spirit of the invention, and are to be considered as falling within the scope of the invention.

Claims

1. An indoor scene visual recognition method integrating multiple characterization balance strategies is characterized by comprising the following steps:

s1, sampling to obtain a long tail training set;

s2, preheating a model and customizing a loss function;

And S7, performing visual identification by using the trained model.

2. The method for indoor scene visual recognition incorporating multiple characterization balancing strategies according to claim 1, wherein the constructing a plurality of training subsets with different feature distributions by different resampling strategies comprises:

3. The method for visual recognition of an indoor scene incorporating multiple characterization balancing strategies according to claim 2, wherein the different resampling modes comprise:

4. A method of indoor scene visual recognition incorporating multiple characterization balancing strategies according to claim 3, wherein the weighting of the samples in each class according to the bieight law is sampled, comprising:

5. The method for indoor scene visual recognition incorporating multiple characterization balancing strategies according to claim 4, wherein the upsampling method is MixUp data enhancement.

6. The method for indoor scene visual recognition incorporating multiple characterization balancing strategies according to claim 1, wherein training the preheated model using the training subset until the loss function converges comprises:

7. The method for visual recognition of an indoor scene incorporating multiple characterization balancing strategies according to claim 6, wherein the periodicity is repeated every 20 epochs.

8. The method for indoor scene visual identification incorporating multiple characterization balancing strategies of claim 6, wherein the class center of each class is updated prior to each repetition.

9. The method for visual recognition of an indoor scene incorporating multiple characterization balancing strategies according to claim 1, wherein the step S6 comprises:

10. An indoor scene visual recognition system incorporating a plurality of characterization balancing strategies, the system comprising: