CN113887340A

CN113887340A - Target identification method based on offline deep learning and online man-machine cooperation

Info

Publication number: CN113887340A
Application number: CN202111080979.4A
Authority: CN
Inventors: 何元; 曹德宇
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2021-09-15
Filing date: 2021-09-15
Publication date: 2022-01-04

Abstract

The invention discloses a target identification method based on offline deep learning and online man-machine cooperation, and relates to the field of computer mode identification. The method comprises the steps of obtaining a pre-training model by utilizing a large data set, carrying out enhanced transformation on data of a small data set to expand a data sample, and carrying out migration training on a source pre-training model on a target small data set to obtain an offline training model; when the off-line model recognition confidence rate is low, extracting the bottom-layer features of the target to perform semantic mapping to obtain semantic attributes, calculating the semantic attribute distance to an unknown target, and when the target is not in the knowledge range of the off-line training model, providing knowledge or experience needing to be supplemented to a user in a problem form to realize the recognition of the unknown target; an anthropomorphic human-computer interaction instruction framework is provided, an integrated target identification method with an autonomous learning ability is constructed, the accuracy of small sample data identification is improved, and the capability of identifying unknown targets is enhanced through human-computer interaction.

Description

Target identification method based on offline deep learning and online man-machine cooperation

Technical Field

The invention relates to a computer mode recognition technology, belongs to the field of target detection and recognition, and particularly relates to a target recognition method integrating offline deep learning and online man-machine cooperation.

Background

The future battlefield form is to be with the intelligent, network for the main support someone unmanned cooperative war, and the battlefield environment is complicated changeable, and battlefield rule wave 35890, cloud and deception, change unreal survey, novel combat target constantly emerge, how to obtain battlefield multisource information better and in time make accurate decision, is the important problem that the combat personnel face. However, most artificial intelligence models are trained based on a large number of samples, and the cost for acquiring battlefield information is huge, so that the development of artificial intelligence in the military aspect is limited, and the realization of battlefield environment and target identification under the conditions of small samples and unknown classes has important military significance.

The identification of small samples and unknown category labels is an important problem to be solved in the field of computer pattern recognition. At present, the artificial intelligence technology has high effect in the aspects of target identification, decision making and the like, but because the identification method of the artificial intelligence technology depends on massive training data, a large amount of target information needs to be acquired, and when the novel target continuously emerges, the machine autonomous learning model cannot be completely suitable for complex environments. The current target identification method is difficult to accurately identify a small sample target, cannot identify an unknown type of target, has few identification researches on the unknown type of target, and has a strong bias problem in most of the existing unknown type learning (Zero-Shot Learn i ng, ZSL) methods. The man-machine interaction in the process of target recognition training depends on instruction control of professionals, and a man-machine cooperative recognition method is complex and cannot provide a proper man-machine interaction interface.

In the machine learning process, human intelligence is introduced, the capabilities of knowledge, experience, logical reasoning and the like of fighters are combined with the quantitative sensing, calculation and high-precision operation capabilities of machine intelligence, the fighters directly participate and guide the machine learning evolution process, the machine is assisted to quickly complete the establishment of a deep learning model, and the efficient detection and identification of small samples and unknown targets are realized through man-machine cooperation. The interventional learning evolution technology combining the small sample learning evolution capability and the man-machine cooperative decision capability is an important development trend of the future military artificial intelligence technology.

Disclosure of Invention

The technical invention aims to: the invention overcomes the defects of the prior art, provides a target identification method based on offline deep learning and online man-machine cooperation, and the core of the invention is that the method comprises the following steps: off-line deep learning can be performed by using small sample data; the method comprises the steps of automatically generating problems when an unknown class is faced, summarizing and inducing basic characteristics contained in the class of targets, judging the types of the targets through man-machine cooperation, and updating a deep learning model; a convenient and effective man-machine interaction method is provided for man-machine interaction.

As shown in fig. 1, the offline learning part performs model training through large sample data, solves the evolution difficulty of small sample learning through data enhancement, model migration learning and model fine tuning, mainly adopts a deep learning network after image preprocessing to detect and identify a typical target, and the result with high confidence is the label of the target; when an unknown target is faced, the offline learning system cannot realize high-confidence judgment, the offline learning system automatically enters an intrusive learning stage, online learning solves the evolution difficulty of unknown target learning through target image semantic attribute extraction and problem automatic identification and generation, the system extracts image semantic attribute characteristics through a pre-trained semantic network, generates problems and presents the problems to a user, the user judges the target type, adds the type into a known sample set, and establishes a new input-output corresponding relation in a deep learning model.

The invention relates to a target identification method based on offline deep learning and online man-machine cooperation, which comprises the following steps of:

step 200, training data on a large training set and transferring the data to small model data to obtain a small sample identification model;

in the transfer learning process, data enhancement transformation is introduced into an original small data set to expand data samples, and available expansion methods comprise data transformation methods such as translation, rotation, scaling, noise addition, color space transformation and cutting and a confrontation network generation method; performing migration training on a source pre-training model on the large-scale data set on the target small data set, and extracting model weights and image features except the last full-connected layer; combining the characteristics extracted by the source pre-training model, freezing the parameters of the first five layers of the model by adopting a layer freezing method, adjusting the learning rate of the last three layers, and enabling the migration model to adapt to a small sample scene to obtain a training network;

step 210, identifying a target picture by using a small sample network obtained by transfer learning to obtain an identified target category and a confidence rate;

step 220, when the confidence rate of target identification is high, the target class identification is considered to be correct, and an identification result is output through the anthropomorphic human-computer interface;

and 230, when the target recognition confidence rate is low, the target picture is considered to contain the unknown target, the unknown semantic attribute is obtained through the semantic mapping model, the distance between the unknown semantic attribute and the known semantic attribute in the semantic space is calculated, the knowledge needing to be supplemented is provided for a user in a problem form, and the user input is obtained through the anthropomorphic human-computer interface, so that the unknown target recognition is realized.

Advantageous effects

The project aims at the accurate identification requirement of 'small samples and unknown' targets in a complex military environment, the overall research of the integrated electronic system interventional learning is completed, the qualitative perception, decision-making and planning capabilities of operators and the quantitative perception, calculation and high-precision operation capabilities of machine intelligence are comprehensively utilized, an efficient anthropomorphic human-computer interaction channel is utilized to construct an integrated electronic system interventional learning evolution system architecture, a semantic association layer is introduced between the 'known' targets and the 'small samples and unknown' targets through the intervention of the operators, the knowledge is extracted from a 'known' target sample library, the 'small samples and unknown' targets are predicted, and the electronic system has the efficient and controllable learning evolution capability.

Drawings

The technical steps of the invention will be related to the description of the electronic system architecture design, model training process and simulation result, and all the figures mentioned in the description of the invention will be briefly explained below. It should be noted that the drawings described below are merely examples of implementations of the present invention, and other drawings may still be obtained by those of ordinary skill in the pattern recognition field in other different scenarios.

FIG. 1 is a flow chart of a small sample offline learning and unknown class online co-evolution method of the present invention;

FIG. 2 is a flow diagram of a small sample off-line learning implementation of the present invention;

FIG. 3 is a flow chart of an implementation of the unknown class online learning of the present invention;

FIG. 4 is a semantic space mapping model construction process, semantic attribute extraction, and problem generation process of the present invention;

FIG. 5 is a hierarchical instruction fetch and task integration architecture for anthropomorphic human-computer interaction in accordance with the present invention;

Detailed Description

According to the invention, the implementation mode comprises three parts of off-line model training and recognition, on-line model training and human-computer co-evolution and a personification interaction system.

The off-line deep model training is used for realizing the identification of a small sample target, the implementation steps are shown in fig. 2, the off-line model training comprises three steps of model training based on a large data set, small sample data enhancement and data migration learning, and the specific implementation process is as follows:

step 300, aiming at typical battlefield target identification, a model used in the offline identification is a faster regional convolutional neural network model obtained based on ImageNet data set training, and an unknown identification model based on problem autonomous identification and generation is adopted in the online identification and prediction. The online recognition and prediction model is established on the basis of an offline recognition model, and an offline recognition network is trained on the basis of a faster regional convolutional neural network model in order to improve the recognition accuracy of a typical target in a battlefield. Compared with a general image target identification network structure, the traditional sliding window and selective search method is abandoned in the fast regional convolutional neural network model, the regional candidate network is used for generating the detection frame, and the generation speed of the detection frame is greatly improved. The regional candidate network judges whether the target belongs to the foreground or the background through the normalized exponential function, and then corrects the target by using the bounding box regression to obtain an accurate candidate region.

Step 310, enhance the small sample data set based on the faster regional convolutional neural network model. First, preprocessing small sample data, uniformly adjusting the image size of a scene data set to 256 × 256 × 3(3 represents the number of color channels), calculating the image mean value of a training set, and subtracting the mean value of the training set before inputting the training set into a model for training. Secondly, enhancing data of the small sample data set, introducing a generation countermeasure network, training a G generator and a discriminator, circularly traversing the small sample data set, and carrying out image generation on each image for 10 times, namely expanding the data set by 10 times. And finally, finely adjusting a faster regional convolutional neural network model, changing the number of neurons of the last layer of fully-connected layer of the model into the number of categories of a specific scene data set, carrying out Gaussian distribution random initialization on parameters of the layer, initializing the previous layers by adopting pre-trained parameters, and inheriting the 'knowledge' learned on the large-scale data set ImageNet by pre-training to a certain extent so that the whole network starts to learn from a position close to an optimal point.

The online human-computer cooperation method is used for realizing the identification of unknown class targets through human-computer interaction, the implementation steps are shown in FIG. 3, the method comprises three steps of semantic mapping model training, problem automatic generation and unknown class target learning, and the method is specifically implemented as follows:

and 320, aiming at the problem that the confidence coefficient of the judgment result of the convolutional neural network on part of unknown battlefield targets is not high, constructing a small sample and unknown class target learning evolution model based on a man-machine cooperative architecture, and researching a target identification method based on semantic mapping.

The method takes a known target as a training set and an unknown target as a test set, establishes a semantic attribute space by searching attribute description rules which are suitable for the known target and the unknown target, and verifies the feasibility of the attribute description method, and comprises the following specific steps: extracting bottom layer characteristics of unknown images in the test set; predicting semantic attribute vectors corresponding to unknown images based on the mapping relation; and searching the category with the highest similarity to the semantic attribute vector in the semantic attribute space as the final label estimation.

Step 330, extracting the target bottom layer features by the online semantic mapping network, and performing semantic mapping on the bottom layer features by using a semantic mapping model to obtain a plurality of dimension semantic attributes, as shown in fig. 4;

step 340, judging whether the type of target information is in the system knowledge range by calculating the distance between the semantic attribute and the recognizable known target semantic attribute; if the target image is not in the knowledge range of the system, presenting the characteristics possibly possessed by the target to the user, and providing the user with the knowledge or experience needing to be supplemented in the form of a question; the system further provides a target type judgment result with high confidence coefficient, and adds the target label into a known class, so that the judgment confidence coefficient of the system on the class of targets can be improved in the future target identification process, and the unknown class target learning is realized.

The anthropomorphic human-computer interaction is used for providing a suitable human-computer interaction platform, the design architecture is shown in fig. 5, and the functions of each layer are as follows:

the physical layer comprises access, initialization, control of various interaction channel hardware devices, reading in of interface data and the like, and is a hardware basic layer of the multi-mode anthropomorphic human-computer interaction system.

The characteristic layer comprises characteristic extraction of various interaction channels and corresponding action tables, symbol tables and attribute tables. The former is responsible for obtaining interactive hardware information output, data preprocessing and extracting feature space representation of corresponding modes, and the latter is responsible for analyzing and labeling features of various interactive forms and helping a subsequent layer to translate interactive content meaning.

And the grammar layer realizes the translation function of each interaction channel characteristic and maintains the normal operation of the whole system. Firstly, converting a task table into a corresponding task action, and then judging whether an object is clear and whether the object is input correspondingly by combining a symbol table and an attribute table. And (4) integrating the judgment result, and judging whether to enter a semantic layer or not by the system. Therefore, the grammar layer interacts with both the physical layer and the semantic layer, and the information transmitted to the semantic layer includes task actions, object description attribute structures and task parameters.

The semantic layer acquires interactive semantic information transmitted by the task layer and determines whether the interactive semantic information is complete or not, if the interactive semantic information is incomplete, the semantic layer returns to the physical layer to acquire input, and an interactive process is reconstructed; and if the semantic information is complete, integrating and translating the corresponding interactive information, and confirming the executable interactive task to the application layer.

The application layer directly faces to users and is an interface for actual operation of the users. The layer is a software application layer which most visually displays the human-computer interaction effect, and has a high customization system aiming at application requirements of different environments.

The anthropomorphic human-computer interaction system model mainly comprises six sub-modules, namely a physical perception module, a signal translation module, a model base, a semantic analysis module, a semantic database and an application module. The system takes a signal translation module and a semantic analysis module as cores, takes a physical perception module as data input, takes an application module as semantic output, takes a model base and an instruction database as translation standards and stores the translation standards and the translation standards, and the following briefly describes six sub-modules respectively.

The physical sensing module is used for detecting and acquiring various physical signals of a user, such as optical or depth cameras, infrared sensors, voice modules, handwriting screens, matched handwriting pens and other information acquired by external devices, is an input interface and a hardware basis of the whole anthropomorphic interaction model, is the beginning of all operations, and faces various physical devices. The interaction modes contained in the module have diversity and convenience, and in the invention, the module mainly selects four interaction mode blocks of gestures, voice, emotion and sketches, thereby widening the selectivity of the interaction modes.

The signal translation module receives the physical signals acquired by the physical sensing module, selects a model from a model library according to the signal type, preprocesses the measured signals, processes the signals through a pre-trained feature extraction model, acquires corresponding feature information, translates and analyzes irregular continuous physical signals into standard digital representation signals, and sends the standard digital representation signals to the semantic translation module. If the palm contour information acquired from the sensor is processed, the processed palm contour information is compared with standard characteristic data in a gesture library, the gesture category is judged and converted into a corresponding characterization signal.

The model library is composed of a plurality of offline pre-trained models, and is not generally modified due to the training complexity and the determinability of the characterization information. The model library comprises feature extraction models of various data, and the models have the characteristics of independence, high reliability and the like, are used as rear support of the signal translation module and are essential modules in the multi-modal signal translation process.

The semantic analysis module processes the representation signals given by the signal module, converts the visual digital representation signals into interactive instructions with actual military meanings according to the contrast analysis of a semantic database, and outputs the interactive instructions to the application module. This operation weakens the physics of the signal representation and improves interpretability.

The semantic database is a corresponding dictionary of physical representation signals and semantic representation information, and has adjustability and task customizability. According to different military missions, the corresponding semantic content is changed, so that the semantic database needs to be customized for different battlefield environments and battlefield missions. The semantic database should be scalable to meet different application requirements.

The application module is a module for reading the interactive information by the user and executing the interactive task by the computer. The application module converts the semantic representation information of the semantic interpretation module into a human-friendly interface interactive statement, and information presentation is realized according to the usability principle of interactive design.

The anthropomorphic man-machine interaction system composed of the physical perception module, the signal translation module, the semantic analysis module, the model base, the semantic database and the application module is generally effective in battlefield environments and suitable for most application environments.

One implementation of gesture recognition is as follows:

the invention realizes skeleton tracking by using the infrared projector and the depth camera of the sensor. Positioning human skeleton joint point information acquired by a sensor to a hand position, selecting the maximum value and the minimum value of hand pixel point depth values to set a depth threshold value for intercepting, setting the depths of other pixels to be 0, eliminating background interference in the same plane with the hand, acquiring a gesture range, performing smoothing processing on an image by adopting a bilateral filtering algorithm, filling a concave hole and eliminating thin bulges by using morphological operations of expansion and corrosion, converting hand form recognition into hand form contour recognition, extracting a hand contour by adopting a chain code, acquiring Hu moment characteristics of the hand contour image, calculating an average value, and taking the Hu moment as characteristics of learning recognition, wherein the Hu moment is a 7-dimensional vector. The learning identification part adopts a full connection layer network to complete the gesture identification process.

One implementation of speech recognition is as follows:

the invention adopts keyword recognition, namely, a command list is set in advance, and a specific command binds a specific action. The whole speech recognition system mainly comprises two stages of training and recognition. The time domain signal is subjected to framing, windowing and fast Fourier transform processing to obtain a spectrogram with short-time stationarity, Hamming window operation is carried out on the voice signal to enable data to be close to a periodic function, and the frequency spectrums of all windows are overlapped to obtain the spectrogram.

The speech search process is based on the first N recognition results, firstly using the recognition input speech with large vocabulary to generate the first N best result lists and confidence degrees thereof, and then searching. If a keyword exists in the list and the confidence is greater than a threshold, the keyword is considered to be found. The language model uses historical data of length k-l to estimate the probability of the current word occurring.

One implementation of sketch interaction is as follows.

The identification method based on the structural characteristics and the rule base defines different characteristics for each type of graph and then checks whether the graph conforms to certain characteristics during identification. Global features such as an outer closure, a maximum embedded triangle, a maximum inscribed quadrilateral, stroke speed of a user, curvature, and the like may be employed. Statistical model-based methods refer to methods that rely on statistical machine learning, which describe a graph with a set of features in a feature space, with different types represented as multi-dimensional probability distribution functions around a certain centroid in the feature space. The identification process is the process of determining the probability of a sample appearing in a particular pattern class.

One implementation of emotional interaction is as follows:

an optical camera using a sensor produces 30 optical images per second. The positions of a plurality of mark points of the mouth, eyes, eyebrows and other parts of the face are obtained from the optical image. By tracking the facial marker points and recording the positions and relative positions of the marker points, marker point information of organs is combined, and a threshold value is set for the amplitude of position movement. When the motion amplitude of some mark points of the organ is larger than a certain threshold value, the motion units of all parts of the human face are formed, and if the upward amplitude of most mark points of the eyebrows is larger than the threshold value, the motion unit at the moment is judged to be the eyebrow picking. Organs commonly used for emotion recognition include eyebrows, eyes, a nose, a mouth, and the like, and the emotion is judged by combining each action unit.

Claims

1. A target identification method based on offline deep learning and online human-computer cooperation is characterized in that when a small sample unknown target is identified, a pre-training model result on a large data set is used for transferring to a small sample target data set, small sample learning evolution is achieved through offline learning, meanwhile, partial features obtained through deep learning are mapped with artificially constructed semantic attributes, unknown target identification is achieved through human-computer cooperation, and meanwhile, a personification interaction method is provided.

2. The small sample target transfer learning method according to claim 1, comprising: firstly, carrying out model training on large sample data to obtain a pre-training model; secondly, enhancing small-scale sample data by methods of rotation, scaling, migration, noise addition, cutting, color space conversion and the like, and further expanding the number of samples by generating a confrontation model; and then freezing a plurality of layers of parameters on the basis of the pre-training model, and carrying out transfer learning to fine tune the target identification model based on the expanded small sample data set.

3. The unknown category identification method as claimed in claim 1, comprising: when the recognition confidence rate of the off-line recognition model is low, the target is considered as an unknown target, the semantic features of the target are extracted and mapped into a semantic space which can be understood by a user, the unknown target is compared with the known target, the similarity and similarity are integrated to solve the problem, and the recognition of the position target type is realized through the feedback of the user.

4. The human-computer interaction architecture design method according to claim 1, comprising: in the man-machine interaction process, the access of interaction channel hardware equipment and the reading of interface data are controlled by a physical layer, the feature space representation of corresponding modes is obtained by a feature layer through interaction hardware information output and data preprocessing, the translation function of each interaction channel feature is realized by a syntax layer, a semantic layer checks whether interaction semantic information transmitted by a task layer is complete or not, and an application layer directly faces to fighters and is an interface for actual operation of the fighters.

5. The human-computer interaction method according to claim 1 or 4, comprising: providing four interaction modes such as gestures, voice, sketches, expressions and the like for a user, combining the gesture interaction with the extraction of skeleton information and depth information, tracking the hand, and using somatosensory equipment to research and realize gesture recognition; the voice interaction converts the voice signal into corresponding text or a command based on an isolated word voice recognition system; the sketch interaction forcedly decomposes a certain graphic idea into a series of conceptual instructions which can be accepted by a computer; the expression interaction is based on the optical camera signal, and the tracking face mark points quickly convey user information.