CN117668573A

CN117668573A - Target state detection method, device, intelligent device and medium

Info

Publication number: CN117668573A
Application number: CN202311691390.7A
Authority: CN
Inventors: 李传康; 王溯恺; 王云龙; 单为; 姚卯青
Original assignee: Anhui Weilai Zhijia Technology Co Ltd
Current assignee: Anhui Weilai Zhijia Technology Co Ltd
Priority date: 2023-12-08
Filing date: 2023-12-08
Publication date: 2024-03-08

Abstract

The application provides a target state detection method, device, intelligent device and medium, which comprise the steps of extracting characteristics of each current target in a current time frame and each first historical target in a historical time frame to obtain first characteristic information of the current target and historical characteristic information of each first historical target, and determining first mask attention information of the current time frame based on the first characteristic information and the historical characteristic information of the first historical target; based on the first mask attention information, the first feature information and the history feature information of the first history target, second feature information of the current target is obtained, and then state information of each current target is predicted. Therefore, separation of target detection and state prediction is realized, each model is light, any modal data can be used as input, adaptation to different modal data is not needed, the complexity of a network structure is reduced, the influence of an upstream detection result can be avoided, and the accuracy of the obtained state information is higher.

Description

Target state detection method, device, intelligent device and medium

Technical Field

The application relates to the technical field of target detection, and particularly provides a target state detection method, device, intelligent equipment and medium.

Background

The function of autopilot is becoming more and more accepted, and the use of scenes is further improved by the user's expected function with the advancement of sensor and information technology. The target is correctly tracked in the perception scene, complete and accurate environment information is provided for functional modules such as downstream planning and control (Planning And Control, PNC), active braking systems (Autonomous Emergency Braking, AEB) and the like, and the user experience is improved.

In the related art, the detection methods of the target state are roughly classified into two types: 1) A rule filtering-based detection method; 2) Detection method based on pre-fusion neural network.

Detection methods based on regular filtering generally assume that the object is moving at a constant speed or at a uniform acceleration. Based on the state quantity of the target in the historical time frame, the state quantity of the current time frame is predicted by utilizing a classical physical motion formula and Kalman filtering. However, this method has the following disadvantages: the state quantity of the current time frame depends on the state quantity of the target in the historical time frame, is easily influenced by upstream noise, and has an error accumulation effect.

In the detection method based on the pre-fusion neural network, radar point cloud data, camera image data and the like are generally used as input data, feature extraction is carried out through different branches, feature graphs are fused in the network, and finally, the state of a target is predicted while the target is detected. However, this method has the following disadvantages: the target detection and the state prediction are integrated models, corresponding branches are required to be set for different modal data so as to adapt to the change of the modal data, and the models are complex.

Therefore, how to accurately predict the target state under a relatively lightweight model is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

In order to overcome the above-mentioned drawbacks, the present application has been proposed to provide a target state detection method, apparatus, intelligent device and medium that solve or at least partially solve the technical problems of low accuracy in predicting a target state and complex model for predicting a target state.

In a first aspect, the present application provides a method for detecting a target state, the method for detecting a target state including:

extracting features of each current target and each first historical target to obtain first feature information of the current target and historical feature information of the first historical target; wherein the current target is a target in a current time frame; the first historical targets are targets in each historical time frame;

determining first mask attention information of a current time frame based on the first feature information and historical feature information of the first historical target; wherein the first mask attention information is used to indicate a first historical target that matches the current target;

obtaining second characteristic information of the current target based on the first mask attention information, the first characteristic information and the history characteristic information of the first history target;

based on the second characteristic information, carrying out state prediction on the current target to obtain state information of the current target; wherein the status information comprises speed information and/or orientation information.

In a second aspect, the present application provides a target state detection apparatus comprising a processor and storage means, the storage means being adapted to store a plurality of program code adapted to be loaded and executed by the processor to perform the target state detection method of any one of the above.

In a third aspect, a smart device is provided, which may comprise a detection device of a target state as described above.

In a fourth aspect, a computer readable storage medium is provided, the computer readable storage medium storing a plurality of program codes adapted to be loaded and executed by a processor to perform the method of detecting a target state according to any one of the above.

The technical scheme has at least one or more of the following beneficial effects:

in the technical scheme of implementing the application, the first characteristic information of the current target and the historical characteristic information of the first historical target are obtained by carrying out characteristic extraction on each current target in the current time frame and each first historical target in the historical time frame; determining first mask attention information of a current time frame based on first feature information of the current target and historical feature information of a first historical target; obtaining second characteristic information of the current target based on the first mask attention information, the first characteristic information of the current target and the history characteristic information of the first history target; and carrying out state prediction on each current target based on the second characteristic information of the current target to obtain the state information of each current target. Therefore, separation of target detection and state prediction is achieved, each model is light, any mode data can be used as input, adaptation of different mode data is not needed for feature extraction, the complexity of a network structure is reduced, when detection is carried out each time, the state of the current target of the current time frame can be independently detected under the first mask attention information, the influence of an upstream detection result is avoided, and the accuracy of the obtained state information is higher.

Drawings

The disclosure of the present application will become more readily understood with reference to the accompanying drawings. As will be readily appreciated by those skilled in the art: these drawings are for illustrative purposes only and are not intended to limit the scope of the present application. Moreover, like numerals in the figures are used to designate like parts, wherein:

FIG. 1 is a flow chart of the main steps of a method for detecting a target state according to one embodiment of the present application;

FIG. 2 is a schematic diagram of a network structure for detecting a target state according to the present application;

FIG. 3 is a schematic diagram of an actual scenario employing a conventional target state detection method;

FIG. 4 is a schematic diagram of an actual scenario employing the method of detecting a target state of the present application;

FIG. 5 is a schematic diagram of another practical scenario employing the method of detecting a target state of the present application;

FIG. 6 is a schematic diagram of still another practical scenario employing the method of detecting a target state of the present application;

FIG. 7 is a schematic diagram of still another practical scenario employing the method of detecting a target state of the present application;

FIG. 8 is a schematic diagram of still another practical scenario employing the method of detecting a target state of the present application;

fig. 9 is a main structural block diagram of a detection apparatus of a target state according to an embodiment of the present application.

Detailed Description

Some embodiments of the present application are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are merely for explaining the technical principles of the present application, and are not intended to limit the scope of the present application.

In the description of the present application, a "module," "processor" may include hardware, software, or a combination of both. A module may comprise hardware circuitry, various suitable sensors, communication ports, memory, or software components, such as program code, or a combination of software and hardware. The processor may be a central processor, a microprocessor, an image processor, a digital signal processor, or any other suitable processor. The processor has data and/or signal processing functions. The processor may be implemented in software, hardware, or a combination of both. Non-transitory computer readable storage media include any suitable medium that can store program code, such as magnetic disks, hard disks, optical disks, flash memory, read-only memory, random access memory, and the like. The term "a and/or B" means all possible combinations of a and B, such as a alone, B alone or a and B. The term "at least one A or B" or "at least one of A and B" has a meaning similar to "A and/or B" and may include A alone, B alone or A and B. The singular forms "a", "an" and "the" include plural referents.

An autopilot system (Automated Driving Systems, ADS), which means that the system will continue to perform all dynamic driving tasks (Dynamic Driving Task, DDT) within its design operating range (Operational Domain Design, ODD). That is, the machine system is allowed to fully take over the task of the vehicle autonomous handling under the prescribed appropriate driving scenario conditions, i.e., the system is activated when the vehicle satisfies the ODD condition, which replaces the human driver as the driving subject of the vehicle. Among them, the dynamic driving task DDT refers to continuous lateral (left and right steering) and longitudinal motion control (acceleration, deceleration, uniform) of the vehicle and detection and response of targets and events in the running environment of the vehicle. The design operation range ODD refers to a condition under which the automatic driving system can safely operate, and the set condition may include a geographical location, a road type, a speed range, weather, time, country and local traffic laws and regulations, and the like.

The state of the target is accurately predicted in the sensing scene of the ADS, complete and accurate environment information is provided for the downstream PNC, AEB and other functional modules, and the user experience is improved.

Therefore, in order to accurately predict the state of the target, the present application provides the following technical solutions:

referring to fig. 1, fig. 1 is a schematic flow chart of main steps of a method for detecting a target state according to an embodiment of the present application. As shown in fig. 1, the method for detecting a target state in the embodiment of the present application mainly includes the following steps 101 to 104.

Step 101, extracting features of each current target and each first historical target to obtain first feature information of the current target and historical feature information of the first historical target;

in one implementation, multiple time frames of perceived data in the current scene, such as images, point clouds, etc., may be acquired using cameras, radars, etc. And inputting the acquired multi-time frame sensing data into a pre-trained target detection model to detect a target, obtaining the target of the current environment under each time frame, and obtaining the current target in the current time frame and the first historical targets in a plurality of historical time frames from the target. The multi-time frame sensing data comprises sensing data of a current time frame and sensing data of each historical time frame.

Specifically, after the target detection result under each time frame output by the target detection model is obtained, the preset time sliding window can be utilized to slide, the output of the target detection model is intercepted, and the current target in the current time frame and the first historical targets in a plurality of historical time frames are obtained, so that all observations in a period of past time can be obtained more comprehensively to comprehensively extract the characteristics, and the accuracy of the later state prediction is ensured. The time sequence length of the preset time sliding window can be set according to actual requirements. In theory, the larger the time sequence length T of the preset time sliding window is, the more information of the model reference history frame is, the smoother and stable the output is. However, the benefit of increasing the time sequence length T is marginally decreasing, the benefit brought by too far history frames is limited, and the calculation cost is exponentially increased, so the method can set the time sequence length to be but not limited to t=10, so that the prediction result is accurately balanced with the calculation.

In a specific implementation process, an encoder may be used to perform feature extraction on each current target in a current time frame and each first historical target in a historical time frame based on the network structure shown in fig. 2, so as to obtain first feature information of each current target and feature information of each first historical target, that is, the current target and the first historical target may be encoded, so as to obtain the first feature information and the historical feature information of the first historical target.

Fig. 2 is a schematic diagram of a network structure for detecting a target state in the present application. As shown in fig. 2, the network structure may include K encoders Encoder, J decoders, a first mask attention information network (Asso Head network), a second mask attention information network, and a State prediction network (State Head network). The input is the current target of the current time frame and the first historical target of the historical time frame in the step 101, the mapping vector of the input data is obtained through the multi-layer perceptron MLP, under the second mask attention information (ST Attention Mask), the characteristic information Memory of the target keys in all time frames can be output through a softmax activation function, an Add & LN network, an FFN network and the like, and the characteristic information Memory of the target keys in all time frames can comprise the first characteristic information of the current target of the current time frame and the historical characteristic information of the first historical target in the historical time frame. Tgt is the first feature information of the current target Query in the current time frame, and the last M feature information can be obtained from the Memory. The Asso Head network outputs an Association matching result of the current target Query (namely, an Association relation between the current target Query and all target keys) as first Mask attention information (Association Mask), wherein the first Mask attention information is used for indicating a first historical target matched with the current target, so that the Decoder can be guided to pay attention to only the first historical target matched with the Decoder by using the first Mask attention information, namely, only historical State information of the Decoder is paid attention to, and a multi-layer perceptron MLP, a softmax activation function, an Add & LN network, an FFN network and the like in the Decoder can obtain second characteristic information of each current target, input the second characteristic information to the State Head network, and then output the State information of each current target through the State Head network.

The network structure of the embodiment inputs targets detected by the target detection model, namely the network structure only carries out state prediction, thus, the separation of target detection and state prediction is realized, each model is light, the model is more suitable for practical product development practice and is easier to iterate, the problem is solved by the modules, the model can be respectively optimized by two teams, and finally the model is combined when being deployed. And even if a plurality of modal data exist at the same time, the detection result of the plurality of modal data is used as integral time sequence data to be input into a network structure for the subsequent state prediction, and different branches are not required to be set for characteristic extraction aiming at different modal data, so that the complexity of the network structure is reduced. In fig. 2, M represents the maximum target number in one time frame, T total time frames (current time frame T and historical time frame T-1), and n=t×m represents the target maximum number of all time frames. C is the dimension after decoding representing the current target.

In a specific implementation process, when the matching relation is predicted, and the similarity of feature information such as space time, category, size and the like of every two targets meets a preset condition, the two targets are unlikely to form the matching relation, and the matching relation can be removed from network training and reasoning. Specifically, the preset condition may be that, for each object, if some of the feature information of the two objects are similar (such as type, size, etc. of the feature information is similar), but the time interval between the two objects being detected exceeds the preset time period, it may be determined that it is impossible to form a matching relationship between the two objects. For another example, if a part of the feature information of two targets is similar, but the two targets appear in different spaces, it may be determined that it is impossible to form a matching relationship therebetween. For another example, if a part of the feature information of two targets is similar (such as feature information of space-time, etc.) but the types of the two targets are completely different, or the size difference is larger than the preset difference, it may be determined that it is impossible to form a matching relationship between the two targets.

Based on the above, a matrix corresponding to the matching possibility can be calculated as preset second mask attention information, where the second mask attention information is used to indicate that there is no matching possibility with the current target, so that feature extraction can be performed on the current target and the first historical target under the preset second mask attention information to obtain the first feature information and the historical feature information of the first historical target, and thus, some first historical targets which have no matching possibility with the current target are removed as far as possible, and feature extraction is performed on the first historical targets.

In one specific implementation, the space-time feature embedding may be implemented by a space-time position embedding (STPE) module, through which a transducer network may further distinguish between elements of different positions and time frames, and better handle dependencies in a sequence in an attention mechanism. The STPE may be fixed or may be learnable. The fixed STPE may be generated by some mathematical functions (such as sine and cosine functions) while the learnable STPE may be updated by back-propagation during training. The present embodiment may employ a fixed sine function and cosine function.

Therefore, the space-time characteristic embedding and encoding can be further carried out on each current target in the current time frame and each first historical target in the historical time frame to obtain the first characteristic information of each current target and the characteristic information of each historical target. Wherein the spatiotemporal location embeds a feature representation for enhancing the current target and a feature representation of the first historical target.

In this embodiment, the required feature information can be extracted according to the actual requirement, so that the obtained richer feature information can be used to help the accuracy of Association Mask prediction, and further help the prediction of the state.

Step 102, determining first mask attention information of a current time frame based on the first feature information and the history feature information of the first history object;

in a specific implementation process, the state prediction of the target is generally only related to the historical state information of the target, so that the correlation matching of each current target can be performed based on the first characteristic information of each current target and the historical characteristic information of each first historical target, and the correlation matching result of each current target is obtained and used as the first mask attention information of the current time frame. The Association Mask may be, but not limited to, a 0/1 matrix of shape= (M, N), learned by Asso Head network, where each element in the Mask represents a [ match ] between the current target Query and all keys (1 represents no match, and 0 represents a match). For a current target Query, there is at most one associated target at each time, and at the current time, the Mask must associate itself, so there are at most T0 s per line, at least 10 s. That is, there may be T0's when all historical targets of the historical frames can be associated with the current target, and only one 0 when there is only self-association.

Specifically, the historical characteristic information of the first historical target may be filtered based on the first characteristic information of each current target and the historical characteristic information of each first historical target under the preset second mask attention information, so as to obtain the historical characteristic information of the second historical target, so as to facilitate training convergence and reasoning expression of the encoder, and the correlation matching result of the current target may be obtained based on the current characteristic information of the current target and the historical characteristic information of the second historical target. Wherein the second historical target is a historical target having a likelihood of matching with the current target.

Step 103, obtaining second feature information of the current target based on the first mask attention information, the first feature information and the history feature information of the first history target;

in a specific implementation process, after the Asso Head network obtains the Association Mask, the Asso Head network may input the Association Mask to the Decoder, guide the Decoder to pay attention to only the first historical targets matched with the Decoder, and decode the Association Mask based on the first feature information of each current target and the historical feature information of each first historical target to obtain second feature information of each current target, so as to input the second feature information of each current target to the State Head network, and output the State information of each current target through the State Head network.

And 104, carrying out state prediction on the current target based on the second characteristic information to obtain the state information of the current target.

In a specific implementation process, the state information of the current target may include orientation information, and at this time, the second feature information may be input into an orientation header network in the state prediction network to perform prediction, so as to obtain the orientation information of the current target. The orientation information of the current target may include an orientation angle, a yaw angle, a pitch angle, a roll angle, and the like of the current target. The direction angle refers to an angle formed by rotating the north-right direction or the south-right direction to a target direction line of the target object by taking the position of the target object as a center, and the target direction line can point to the movement direction of the target object. The pitch angle refers to an angle between a moving direction of the target object and a horizontal plane, the yaw angle refers to an angle between a projection direction of the moving direction of the target object on the horizontal plane and a predetermined direction on the horizontal plane, and the predetermined direction may be set as a road direction, and the roll angle is used for representing a lateral inclination angle.

According to the target state detection method, feature extraction is carried out on each current target in the current time frame and each first historical target in the historical time frame, so that first feature information of the current target and historical feature information of the first historical target are obtained; determining first mask attention information of a current time frame based on first feature information of the current target and historical feature information of a first historical target; obtaining second characteristic information of the current target based on the first mask attention information, the first characteristic information of the current target and the history characteristic information of the first history target; and carrying out state prediction on each current target based on the second characteristic information of the current target to obtain the state information of each current target. Therefore, separation of target detection and state prediction is achieved, each model is light, any mode data can be used as input, adaptation of different mode data is not needed for feature extraction, the complexity of a network structure is reduced, when detection is carried out each time, the state of the current target of the current time frame can be independently detected under the first mask attention information, the influence of an upstream detection result is avoided, and the accuracy of the obtained state information is higher.

Fig. 3 is a schematic diagram of an actual scenario of a conventional target state detection method. As shown in fig. 3, the right side of fig. 3 is the actual scene where the vehicle is located, and there is a stationary electric bicycle in the front left of the vehicle in the scene. However, due to the fact that the speed of entering and exiting a certain electric bicycle is mispredicted, the speed of entering a lane is caused, AEB is caused, the experience of using automatic driving by a user is affected, and personal injury to the user is more serious.

Fig. 4 is a schematic diagram of an actual scenario of a detection method employing the target state of the present application. As shown in fig. 4, the input of the target state detection method of the present application is the current target in the current time frame detected by the target detection model and the first historical target (not shown in the figure) in the historical time frame, and the output is the speed information and/or the direction information of the current target in the current time frame. In this scenario, a moving object a exists in the lane r1 in which the own vehicle is located, and its direction is the same as the direction of travel of the own vehicle. In the opposite lane r2 of the lane in which the own vehicle is located, there is a moving object B which is directed in the opposite direction to the traveling direction of the own vehicle. The roadside has a first stationary object C and a second stationary object D. By adopting the target state detection method, correct prediction is made, for example, the speeds of the moving target A and the moving target B in fig. 4 are marked by arrows, and the moving directions of the moving target A and the moving target B are opposite.

Fig. 5 is a schematic diagram of another practical scenario employing the method for detecting a target state of the present application. As shown in fig. 5, the right side of fig. 5 is an actual scene where the vehicle is located, in which there is one stationary electric bicycle in front of the right side of the vehicle, and there are a plurality of running vehicles in front of the left side of the vehicle. The left side of fig. 5 shows state information of a plurality of targets detected by the own vehicle in the scene. From fig. 5, it can be determined that a single stationary weak road target (Vulnerable Road User, VRU) is observed from the right front of the vehicle, and an abnormal speed (a detection frame without an arrow in the figure) of intrusion into the own lane does not occur, so AEB is not triggered. And vehicles existing in the left front and the front of the vehicle have speeds (detection frames with arrows in the figure), so that complete and accurate environment information can be provided for the PNC.

Fig. 6 is a schematic diagram of still another practical scenario employing the method for detecting a target state of the present application. As shown in fig. 6, the right side of fig. 6 is an actual scene where the vehicle is located, in which some stationary electric bicycles and pedestrians exist in front of the right side of the vehicle, and a plurality of running vehicles exist in front of the left side of the vehicle. The left side of fig. 6 shows state information of a plurality of targets detected by the own vehicle in the scene. From fig. 6, it can be determined that the group VRU is observed from the right front of the vehicle, and that an abnormal speed of intrusion into the own lane does not occur, and therefore AEB is not triggered. And vehicles existing in the right front and the front of the vehicle have speeds, so that complete and accurate environment information can be provided for the PNC.

Fig. 7 is a schematic diagram of still another practical scenario employing the method for detecting a target state of the present application. As shown in fig. 7, the right side of fig. 7 is an actual scene where the own vehicle is located, and in this scene, there is an electric bicycle with a speed in front of the right of the own vehicle. The left side of fig. 7 shows state information of a plurality of targets in the scene detected by the own vehicle. From fig. 7, it can be determined that the single VRU is observed from the right front of the vehicle, and an abnormal speed of intrusion into the own lane occurs, and thus AEB is triggered.

Fig. 8 is a schematic diagram of still another practical scenario employing the method for detecting a target state of the present application. As shown in fig. 8, the right side of fig. 8 is an actual scene where the own vehicle is located, in which there are a plurality of traveling vehicles at a plurality of angles in front of the own vehicle. The left side of fig. 8 shows state information of a plurality of targets in the scene detected by the own vehicle. From fig. 8, it can be determined that vehicles existing in a plurality of angles in front of the vehicle have speeds, and complete and accurate environmental information can be provided to the PNC.

It should be noted that, although the foregoing embodiments describe the steps in a specific sequential order, it should be understood by those skilled in the art that, in order to achieve the effects of the present application, different steps need not be performed in such an order, and may be performed simultaneously (in parallel) or in other orders, and these variations are within the scope of protection of the present application.

It will be appreciated by those skilled in the art that the present application may implement all or part of the above-described methods according to the above-described embodiments, or may be implemented by means of a computer program for instructing relevant hardware, where the computer program may be stored in a computer readable storage medium, and the computer program may implement the steps of the above-described embodiments of the method when executed by a processor. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable storage medium may include: any entity or device, medium, usb disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory, random access memory, electrical carrier wave signals, telecommunications signals, software distribution media, and the like capable of carrying the computer program code. It should be noted that the computer readable storage medium may include content that is subject to appropriate increases and decreases as required by jurisdictions and by jurisdictions in which such computer readable storage medium does not include electrical carrier signals and telecommunications signals.

Further, the application also provides a detection device for the target state.

Referring to fig. 9, fig. 9 is a main block diagram of a detection apparatus of a target state according to an embodiment of the present application. As shown in fig. 9, the apparatus for detecting a target state in the embodiment of the present application may include a processor 91 and a storage 92.

The storage 92 may be configured to store a program for executing the method of detecting a target state of the above-described method embodiment, and the processor 91 may be configured to execute the program in the storage 92, including, but not limited to, the program for executing the method of detecting a target state of the above-described method embodiment. For convenience of explanation, only those portions relevant to the embodiments of the present application are shown, and specific technical details are not disclosed, refer to the method portions of the embodiments of the present application. The detection device of the target state may be a control device formed by including various electronic devices.

In one implementation, the number of memory devices 92 and processors 91 may be multiple. While the program for executing the method for detecting a target state of the above-described method embodiment may be divided into a plurality of sub-programs, each of which may be loaded and executed by the processor 91 to perform the different steps of the method for detecting a target state of the above-described method embodiment, respectively. Specifically, each of the sub-programs may be stored in a different storage 92, and each of the processors 91 may be configured to execute the programs in one or more storage 92 to collectively implement the method for detecting a target state of the above-described method embodiment, that is, each of the processors 91 executes different steps of the method for detecting a target state of the above-described method embodiment, respectively, to collectively implement the method for detecting a target state of the above-described method embodiment.

The plurality of processors 91 may be processors disposed on the same device, for example, the device may be a high-performance device composed of a plurality of processors, and the plurality of processors 91 may be processors configured on the high-performance device. The plurality of processors 91 may be processors disposed on different devices, for example, the devices may be a server cluster, and the plurality of processors 91 may be processors on different servers in the server cluster.

Further, the application also provides an intelligent device, which comprises the target state detection device of the embodiment. The intelligent device may include a steering device, an autopilot vehicle, an intelligent car, a robot, an unmanned aerial vehicle, etc.

In some embodiments of the present application, the smart device further comprises at least one sensor for sensing information. The sensor is communicatively coupled to any of the types of processors referred to herein. Optionally, the intelligent device further comprises an automatic driving system, and the automatic driving system is used for guiding the intelligent device to drive by itself or assist driving. The processor communicates with the sensors and/or the autopilot system for performing the method of any one of the embodiments described above.

Further, the present application also provides a computer-readable storage medium. In one embodiment of a computer-readable storage medium according to the present application, the computer-readable storage medium may be configured to store a program for performing the method of detecting a target state of the above-described method embodiment, which may be loaded and executed by a processor to implement the method of detecting a target state as described above. For convenience of explanation, only those portions relevant to the embodiments of the present application are shown, and specific technical details are not disclosed, refer to the method portions of the embodiments of the present application. The computer readable storage medium may be a storage device including various electronic devices, and optionally, in embodiments of the present application, the computer readable storage medium is a non-transitory computer readable storage medium.

Further, it should be understood that, since the respective modules are merely set to illustrate the functional units of the apparatus of the present application, the physical devices corresponding to the modules may be the processor itself, or a part of software in the processor, a part of hardware, or a part of a combination of software and hardware. Accordingly, the number of individual modules in the figures is merely illustrative.

Those skilled in the art will appreciate that the various modules in the apparatus may be adaptively split or combined. Such splitting or combining of specific modules does not lead to a deviation of the technical solution from the principles of the present application, and therefore, the technical solution after splitting or combining will fall within the protection scope of the present application.

It should be noted that, the personal information of the relevant user possibly related to each embodiment of the present application is personal information that is strictly according to requirements of laws and regulations, follows legal, legal and necessary principles, and processes the personal information actively provided by the user or generated by using the product/service in the process of using the product/service based on the reasonable purpose of the business scenario, and is obtained by the user through authorization.

The personal information of the user processed by the application may be different according to the specific product/service scene, and the specific scene that the user uses the product/service may be referred to as account information, equipment information, driving information, vehicle information or other related information of the user. The present application treats the user's personal information and its processing with a high diligence.

The method and the device have the advantages that safety of personal information of the user is very important, and safety protection measures which meet industry standards and are reasonable and feasible are adopted to protect the information of the user and prevent the personal information from unauthorized access, disclosure, use, modification, damage or loss.

Thus far, the technical solutions of the present application have been described with reference to the embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of protection of the present application is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present application, and such modifications and substitutions will be within the scope of the present application.

Claims

1. A method for detecting a target state, comprising:

2. The method of detecting a target state according to claim 1, wherein determining first mask attention information of a current time frame based on the first feature information and history feature information of the first history target comprises:

and carrying out association matching on the current target based on the first characteristic information and the historical characteristic information of the first historical target to obtain an association matching result of the current target as the first mask attention information.

3. The method for detecting a target state according to claim 2, wherein performing association matching on the current target based on the first feature information and the history feature information of the first history target to obtain an association matching result of the current target, includes:

filtering the first historical targets based on the first characteristic information and the characteristic information of the first historical targets under the preset second mask attention information to obtain the characteristic information of a second historical target; wherein the second mask attention information is used to indicate that there is no match possibility with the current target; the second historical target is a historical target with matching possibility with the current target;

and obtaining the association matching result based on the first characteristic information and the characteristic information of the second historical targets.

4. The method according to claim 1, wherein the feature extraction is performed on each current target and each historical target to obtain the first feature information of the current target and the historical feature information of the historical target, and the method comprises:

and encoding the current target and the first historical target to obtain the first characteristic information and the historical characteristic information of the first historical target.

5. The method according to claim 1, wherein the feature extraction is performed on each current target and each historical target to obtain the first feature information of the current target and the historical feature information of the historical target, and the method comprises:

performing space-time position embedding and encoding on the current target and the first historical target to obtain the first characteristic information and the historical characteristic information of the first historical target;

wherein the spatiotemporal location embeds a feature representation for enhancing the current target and a feature representation of the first historical target.

6. The method according to claim 1, wherein the feature extraction is performed on each current target and each historical target to obtain the first feature information of the current target and the historical feature information of the historical target, and the method comprises:

extracting features of the current target and the first historical target under the preset second mask attention information to obtain the first feature information and the historical feature information of the first historical target;

wherein the second mask attention information is used to indicate that there is no match possibility with the current target.

7. The method according to any one of claims 1 to 6, characterized by further comprising, before feature extraction for each current target and each first historical target:

inputting multi-time frame sensing data into a target detection model for detection to obtain the current target and the first historical target; wherein the multi-time frame sensing data comprises sensing data of a current time frame and sensing data of each historical time frame.

8. A detection apparatus for a target state, characterized by comprising a processor and a storage means, the storage means being adapted to store a plurality of program codes, the program codes being adapted to be loaded and executed by the processor to perform the method for detecting a target state according to any one of claims 1 to 7.

9. An intelligent device comprising the target state detection device of claim 8.

10. A computer readable storage medium, characterized in that a plurality of program codes are stored, which are adapted to be loaded and executed by a processor to perform the method of detecting a target state according to any one of claims 1 to 7.