CN111062249A

CN111062249A - Vehicle information acquisition method and device, electronic equipment and storage medium

Info

Publication number: CN111062249A
Application number: CN201911094815.XA
Authority: CN
Inventors: 蒋旻悦; 谭啸; 孙昊; 文石磊; 丁二锐
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-11-11
Filing date: 2019-11-11
Publication date: 2020-04-24

Abstract

The application discloses a vehicle information acquisition method, a vehicle information acquisition device, electronic equipment and a storage medium, and relates to the field of computer vision, wherein the method comprises the following steps: acquiring at least two layers of characteristics of an image to be processed; fusing the at least two layers of characteristics to obtain fused characteristics; at least one vehicle position rectangular frame is detected from the image as a candidate frame, and the following first operations are performed: respectively intercepting the features corresponding to each candidate frame from the fusion features; if the output conditions are met, outputting each candidate frame and the corresponding characteristics, otherwise, correcting each candidate frame; and screening each corrected candidate frame, correcting each residual candidate frame again, and repeatedly executing the first operation aiming at each corrected candidate frame. By applying the scheme, the accuracy of the processing result can be improved.

Description

Vehicle information acquisition method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer applications, and in particular, to a method and an apparatus for acquiring vehicle information in the field of computer vision, an electronic device, and a storage medium.

Background

In the vehicle search technology, a search can be performed on each vehicle image stored in a database for a known target vehicle to find out the target vehicle. For example, if a police wants to search for a vehicle a that has hit and run, the police can search for the vehicle a from the images of the vehicle a at the intersection captured by the cameras at each intersection stored in the database, and find out the vehicle a, thereby knowing the track of the vehicle a.

For the intersection vehicle image, it is necessary to acquire a vehicle position rectangular frame (usually, a minimum rectangular frame including a vehicle) and corresponding features thereof, respectively, so as to compare the vehicle position rectangular frame with the features of the vehicle a, and determine whether the vehicle in the vehicle position rectangular frame is the vehicle a. However, the detection of the existing vehicle position rectangular frame is relatively independent from the acquisition of the characteristics, and the optimization behavior of the detected vehicle position rectangular frame does not exist, so that the accuracy is poor.

Disclosure of Invention

In view of the above, the present application provides a vehicle information acquisition method, apparatus, electronic device, and storage medium.

A vehicle information acquisition apparatus comprising: the system comprises a feature extraction module, a feature fusion module, a candidate frame generation module and a vehicle re-identification module;

the feature extraction module is used for acquiring at least two layers of features of an image to be processed;

the feature fusion module is used for fusing the at least two layers of features to obtain fused features;

the candidate frame generation module is used for detecting at least one vehicle position rectangular frame from the image to serve as a candidate frame and executing the following first operation: respectively intercepting the features corresponding to each candidate frame from the fusion features, and sending each candidate frame and the corresponding features to the vehicle re-identification module; screening each corrected candidate frame returned by the vehicle re-identification module, re-correcting each remaining candidate frame, and repeatedly executing the first operation for each re-corrected candidate frame;

the vehicle re-identification module is used for executing the following second operation: and if the output conditions are met, outputting each received candidate frame and the corresponding characteristics, otherwise, correcting each received candidate frame and returning to the candidate frame generation module.

According to a preferred embodiment of the present application, the feature extraction module extracts at least two layers of features of the image by using a convolutional neural network; the spatial resolution of each layer of features is sequentially reduced, but the number of channels is sequentially increased.

According to a preferred embodiment of the present application, the feature fusion module takes a layer of features with the lowest spatial resolution as features to be processed, and performs the following third operation: and adjusting the spatial resolution and the channel number of the features to be processed to be consistent with the spatial resolution and the channel number of the features of the previous layer, fusing the features to be processed and the features of the previous layer, taking the fusion result as the features to be processed, and repeatedly executing the third operation until all the features of each layer are fused.

According to a preferred embodiment of the present application, for any candidate frame, the candidate frame generation module intercepts, from the fusion feature, a feature corresponding to the candidate frame according to a position of the candidate frame in the image.

According to a preferred embodiment of the present application, the meeting the predetermined output condition includes: the number of counts reaches a predetermined threshold;

the vehicle re-identification module is further configured to add one to the count number after the received candidate frames are corrected and returned to the candidate frame generation module.

According to a preferred embodiment of the present application, the vehicle re-identification module is further configured to output the feature corresponding to each candidate frame after performing predetermined processing.

A vehicle information acquisition method comprising:

acquiring at least two layers of characteristics of an image to be processed;

fusing the at least two layers of characteristics to obtain fused characteristics;

at least one vehicle position rectangular frame is detected from the image as a candidate frame, and the following first operations are performed: respectively intercepting the features corresponding to each candidate frame from the fusion features; if the output conditions are met, outputting each candidate frame and the corresponding characteristics, otherwise, correcting each candidate frame; and screening each corrected candidate frame, correcting each residual candidate frame again, and repeatedly executing the first operation aiming at each corrected candidate frame.

According to a preferred embodiment of the present application, the acquiring at least two layers of features of the image to be processed includes: extracting at least two layers of characteristics of the image by using a convolutional neural network; the spatial resolution of each layer of features is sequentially reduced, but the number of channels is sequentially increased.

According to a preferred embodiment of the present application, said fusing the at least two layers of features comprises: taking a layer of features with the lowest spatial resolution as features to be processed, and performing the following third operation: and adjusting the spatial resolution and the channel number of the features to be processed to be consistent with the spatial resolution and the channel number of the features of the previous layer, fusing the features to be processed and the features of the previous layer, taking the fusion result as the features to be processed, and repeatedly executing the third operation until all the features of each layer are fused.

According to a preferred embodiment of the present application, the extracting features corresponding to the candidate frames from the fused features respectively includes: and for any candidate frame, according to the position of the candidate frame in the image, intercepting the feature corresponding to the candidate frame from the fusion feature.

the method further comprises the following steps: after the correction is made for each candidate frame, the count number is incremented by one.

According to a preferred embodiment of the present application, the method further comprises: and outputting the characteristics corresponding to each candidate frame after predetermined processing.

An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as described above.

A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method as described above.

One embodiment in the above application has the following advantages or benefits: the detected vehicle position rectangular frame can be corrected/optimized for multiple times, so that the accuracy of the acquired vehicle position rectangular frame is improved, corresponding features can be acquired again for the vehicle position rectangular frame after each correction, the detection of the vehicle position rectangular frame and the acquisition of the features are mutually influenced, the effect of continuously optimizing the features is achieved, and the accuracy of the acquired features is improved, in addition, at least two layers of features of the image can be acquired, such as multilayer features which are extracted by a convolutional neural network and have sequentially reduced spatial resolution but sequentially improved channel number, can be fused, so that deep semantic features, shallow detail features and the like of the image can be acquired simultaneously, the acquired features can better represent the image, and the accuracy of the acquired features is further improved, in addition, the features corresponding to each candidate frame can be output after predetermined processing, so that the output features have higher discriminative power and the like; other effects of the above-described alternative will be described below with reference to specific embodiments.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a schematic structural diagram illustrating a vehicle information acquisition apparatus 100 according to an embodiment of the present application;

FIG. 2 is a flowchart of an embodiment of a vehicle information obtaining method according to the present application;

fig. 3 is a block diagram of an electronic device according to the method of an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In addition, it should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

Fig. 1 is a schematic structural diagram of a vehicle information acquisition apparatus 100 according to an embodiment of the present application. As shown in fig. 1, includes: a feature extraction module 101, a feature fusion module 102, a candidate box generation module 103, and a vehicle re-identification module 104.

The feature extraction module 101 is configured to obtain at least two layers of features of an image to be processed.

And the feature fusion module 102 is configured to fuse at least two layers of features to obtain a fusion feature.

A candidate frame generation module 103, configured to detect at least one vehicle position rectangular frame from the image as a candidate frame, and perform a first operation of: respectively intercepting the features corresponding to each candidate frame from the fusion features, and sending each candidate frame and the corresponding features to the vehicle re-identification module 104; the corrected candidate frames returned by the vehicle re-recognition module 104 are screened, the remaining candidate frames are re-corrected, and the first operation is repeatedly performed for the re-corrected candidate frames.

A vehicle re-identification module 104 for performing a second operation of: if the output conditions are met, the received candidate frames and the corresponding features are output, otherwise, the received candidate frames are corrected and returned to the candidate frame generation module 103.

In practical applications, the vehicle information acquisition apparatus 100 shown in fig. 1 may correspond to a vehicle information acquisition model trained in advance, and the model may be a deep learning model. A vehicle image containing a plurality of vehicles may be acquired as training data, and a vehicle position rectangular box, i.e., a true value, for each vehicle may be marked in the vehicle image. The training data can be used for training to obtain the vehicle information acquisition model, so that the vehicle information acquisition model learns various capabilities, such as how to detect the vehicle position rectangular frame in the image, how to correct the vehicle position rectangular frame, and the like.

For the image to be processed, at least two layers of features of the image may be obtained, and preferably, the feature extraction module 101 may extract at least two layers of features of the image by using a convolutional neural network, where spatial resolutions of the features of each layer are sequentially reduced, but the number of channels is sequentially increased.

For example, three layers of features can be extracted, respectively: the spatial resolution of F1, F2 and F3, F1, F2 and F3 decreases in turn, but the number of channels increases in turn, making up for the information lost in space, etc. For example, F1 may have a spatial resolution of 1024 × 1024, the number of channels is 128, F2 may have a spatial resolution of 512 × 512, the number of channels is 256, and so on.

The features extracted by the convolutional neural network are global features similar to a three-dimensional matrix form, and the three dimensions refer to length, width and channel.

The feature fusion module 102 may fuse the extracted at least two layers of features to obtain a fusion feature. Because the spatial resolution and the number of channels of each layer feature are different, the fusion cannot be directly performed, and preferably, the following fusion mode can be adopted: taking a layer of features with the lowest spatial resolution as features to be processed, and performing the following third operation: and respectively adjusting the spatial resolution and the channel number of the features to be processed to be consistent with the spatial resolution and the channel number of the features of the previous layer, fusing the features to be processed and the features of the previous layer, taking the fusion result as the features to be processed, and repeatedly executing the third operation until the fusion of the features of all layers is completed.

Taking the three-layer features F1, F2, and F3 as examples, F3 and F2 may be first fused, where F3 is a deep feature and F2 is a shallow feature, F3 may be first deconvolved to increase its spatial resolution, e.g., to the spatial resolution of F2, while the number of channels of F3 may be reduced by one convolution layer, e.g., to the number of channels of F2, and then F3 and F2 may be fused by a convolution operation, and further, the fusion result may be fused with F1 in a similar manner. How to improve the spatial resolution and reduce the number of channels, etc. are the prior art.

Through the processing, the deep semantic features, the shallow detail features and the like of the image can be obtained at the same time, namely, the semantic information and the detail information of the features are improved, so that the obtained features can better represent the image and the like.

The candidate frame generation module 103 may detect at least one vehicle position rectangular frame from the image as a candidate frame. For example, for an image to be processed, a plurality of rectangular frames may be generated in advance, and the plurality of rectangular frames need to cover various possible situations such as the position and size of a vehicle in the whole image as much as possible, and the confidence of each rectangular frame being a vehicle position rectangular frame may be determined based on the corresponding features, and the rectangular frame with the confidence meeting the requirement (for example, greater than a threshold) may be used as the detected vehicle position rectangular frame, and the detected vehicle position rectangular frame may be used as a candidate frame, and may be one or more.

For each candidate frame, the candidate frame generation module 103 may further intercept features corresponding to each candidate frame from the fused features. As described above, the features extracted by using the convolutional neural network are global features, similar to a three-dimensional matrix form, where the three dimensions refer to length, width, and channel, and the position of any candidate frame in the image is known, so that the feature corresponding to the candidate frame, that is, the feature corresponding to the position of the candidate frame, can be extracted from the fused features according to the position of the candidate frame in the image.

Further, the candidate box generation module 103 may also send each candidate box and the corresponding feature to the vehicle re-identification module 104. The vehicle re-identification module 104 may correct each acquired candidate frame according to the learned correction manner, and may return each corrected candidate frame to the candidate frame generation module 103. The candidate frame generation module 103 may filter each of the corrected candidate frames returned by the vehicle re-identification module 104, for example, may re-determine the confidence of each of the corrected candidate frames, filter out the candidate frames whose confidence does not meet the requirement, re-correct the remaining candidate frames according to the learned correction manner, respectively intercept the features corresponding to each of the candidate frames from the fusion features for each of the re-corrected candidate frames, and re-send each of the candidate frames and the corresponding features to the vehicle re-identification module 104, and the like.

The above process may be repeated a plurality of times, and accordingly, meeting a predetermined output condition may refer to the number of counts reaching a predetermined threshold. The initial value of the counting number may be 0, and the specific value of the threshold may be determined according to actual needs, such as 2. After the vehicle re-identification module 104 acquires each candidate frame and the corresponding feature sent by the candidate frame generation module 103 for the first time, the vehicle re-identification module 104 may modify each acquired candidate frame and return to the candidate frame generation module 103, and may add one to the count number, then the candidate frame generation module 103 may screen each modified candidate frame returned by the vehicle re-identification module 104, re-modify the remaining candidate frames, and may intercept the feature corresponding to each candidate frame from the fusion feature for each candidate frame after re-modification, and send each candidate frame and the corresponding feature to the vehicle re-identification module 104, the vehicle re-identification module 104 may re-modify each acquired candidate frame and return to the candidate frame generation module 103, and may add one to the count number again, and then the candidate frame generation module 103 may re-screen each modified candidate frame returned by the vehicle re-identification module 104, and revising the remaining candidate frames, respectively intercepting the features corresponding to the candidate frames from the fusion features for the revised candidate frames, and sending the candidate frames and the corresponding features to the vehicle re-identification module 104, wherein the vehicle re-identification module 104 can output the acquired candidate frames and the corresponding features because the counting times reach the threshold value.

Preferably, the vehicle re-identification module 104 may further perform predetermined processing on the features corresponding to the candidate frames and output the processed features. For any candidate frame, after processing of several convolution layers, the corresponding features may be divided according to the vehicle parts, which may include the vehicle front part, the vehicle middle part and the vehicle rear part (corresponding to one rectangular region in the rectangular frame respectively), and accordingly, the processed features corresponding to each part may be acquired respectively, and the processed features corresponding to each part may be concatenated to output the concatenation result. By the method, the output characteristics are more discriminative and the like.

It can be seen that, based on the above-mentioned solution of the device embodiment, the detected vehicle position rectangular frame can be corrected/optimized for multiple times, so as to improve the accuracy of the acquired vehicle position rectangular frame, and the corresponding feature can be re-acquired for the vehicle position rectangular frame after each correction, and the detection of the vehicle position rectangular frame and the acquisition of the feature are mutually affected, so as to achieve the effect of continuously optimizing the feature, and further improve the accuracy of the acquired feature, and in addition, at least two layers of features of the image can be acquired, such as multilayer features extracted by using a convolutional neural network, where the spatial resolution is sequentially reduced but the number of channels is sequentially increased, and the multilayer features can be fused, so as to simultaneously acquire deep semantic features, shallow detail features, and the like of the image, and enable the acquired features to better represent the image, furthermore, the accuracy of the acquired features is further improved, and in addition, the features corresponding to the candidate frames can be output after predetermined processing, so that the output features have higher discriminative power and the like.

The application also discloses a vehicle information acquisition method. Fig. 2 is a flowchart of an embodiment of a vehicle information obtaining method according to the present application. As shown in fig. 2, the following detailed implementation is included.

In 201, at least two layers of features of an image to be processed are acquired.

At 202, at least two layers of features are fused to obtain a fused feature.

At 203, at least one vehicle position rectangular frame is detected from the image as a candidate frame, and the following first operation is performed: respectively intercepting the features corresponding to each candidate frame from the fusion features; if the output conditions are met, outputting each candidate frame and the corresponding characteristics, otherwise, correcting each candidate frame; and screening each corrected candidate frame, correcting each residual candidate frame again, and repeatedly executing the first operation aiming at each corrected candidate frame.

Preferably, at least two layers of features of the image can be extracted by using a convolutional neural network; the spatial resolution of each layer of features is sequentially reduced, but the number of channels is sequentially increased.

The extracted at least two layers of features may be fused to obtain a fused feature. Because the spatial resolution and the number of channels of each layer feature are different, the fusion cannot be directly performed, and preferably, the following fusion mode can be adopted: taking a layer of features with the lowest spatial resolution as features to be processed, and performing the following third operation: and respectively adjusting the spatial resolution and the channel number of the features to be processed to be consistent with the spatial resolution and the channel number of the features of the previous layer, fusing the features to be processed and the features of the previous layer, taking the fusion result as the features to be processed, and repeatedly executing the third operation until the fusion of the features of all layers is completed. Taking the three-layer features F1, F2, and F3 as examples, F3 and F2 may be first fused, where F3 is a deep feature and F2 is a shallow feature, F3 may be first deconvolved to increase its spatial resolution, e.g., to the spatial resolution of F2, while the number of channels of F3 may be reduced by one convolution layer, e.g., to the number of channels of F2, and then F3 and F2 may be fused by a convolution operation, and further, the fusion result may be fused with F1 in a similar manner.

At least one vehicle position rectangular frame may be detected from the image as a candidate frame. For example, for the image to be processed, a plurality of rectangular frames may be generated in advance, and the plurality of rectangular frames need to cover various possible situations such as the position and the size of the vehicle in the whole image as much as possible, and the confidence of each rectangular frame being the vehicle position rectangular frame may be determined based on the corresponding features, and the rectangular frame with the confidence meeting the requirement may be used as the detected vehicle position rectangular frame, and the detected vehicle position rectangular frame may be used as a candidate frame, which may be one or more.

For the acquired candidate box, the following first operations may be performed: respectively intercepting the features corresponding to each candidate frame from the fusion features; if the output conditions are met, outputting each candidate frame and the corresponding characteristics, otherwise, correcting each candidate frame; and screening each corrected candidate frame, if the confidence coefficient of each corrected candidate frame can be determined again, screening the candidate frames with the confidence coefficients which do not meet the requirement, correcting the rest candidate frames again, and repeatedly executing the first operation aiming at each candidate frame after being corrected again.

The features extracted by the convolutional neural network are global features, similar to a three-dimensional matrix form, the three dimensions refer to length, width and channel, and for any candidate frame, the position of the candidate frame in the image is known, so that the features corresponding to the candidate frame, that is, the features corresponding to the position of the candidate frame, can be intercepted from the fused features according to the position of the candidate frame in the image.

The meeting of the predetermined output condition may mean that the count number reaches a predetermined threshold value, and in addition, after the correction is performed on each candidate frame, the count number may be incremented by one.

Preferably, the features corresponding to the candidate frames may be output after predetermined processing. For any candidate frame, after processing of several convolution layers, the corresponding features may be divided according to the vehicle parts, which may include the vehicle front part, the vehicle middle part, and the vehicle rear part (each corresponding to one rectangular region in the rectangular frame), and accordingly, the processed features corresponding to each part may be acquired, and further, the processed features corresponding to each part may be concatenated, and the concatenation result may be output.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In short, based on the scheme of the method embodiment, the detected vehicle position rectangular frame can be corrected/optimized for multiple times, so that the accuracy of the acquired vehicle position rectangular frame can be improved, the corresponding features can be re-acquired for the vehicle position rectangular frame after each correction, the detection of the vehicle position rectangular frame and the acquisition of the features are mutually influenced, so that the effect of continuously optimizing the features is achieved, and the accuracy of the acquired features is improved, in addition, at least two layers of features of the image can be acquired, such as multilayer features which are extracted by using a convolutional neural network and have sequentially reduced spatial resolution and sequentially improved channel number, and the multilayer features can be fused, so that the deep semantic features, the shallow detail features and the like of the image can be simultaneously acquired, the acquired features can better represent the image, and the accuracy of the acquired features is further improved, in addition, the feature corresponding to each candidate frame may be output after predetermined processing, so that the output feature may be more discriminative.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 3 is a block diagram of an electronic device according to the method of the embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 3, the electronic apparatus includes: one or more processors Y01, a memory Y02, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information for a graphical user interface on an external input/output device (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 3, a processor Y01 is taken as an example.

Memory Y02 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the methods provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the methods provided herein.

Memory Y02, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the methods in the embodiments of the present application (e.g., xx module X01, xx module X02, and xx module X03 shown in fig. X). The processor Y01 executes various functional applications of the server and data processing, i.e., implements the method in the above-described method embodiments, by executing non-transitory software programs, instructions, and modules stored in the memory Y02.

The memory Y02 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device, and the like. Additionally, the memory Y02 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory Y02 may optionally include memory located remotely from processor Y01, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device may further include: an input device Y03 and an output device Y04. The processor Y01, the memory Y02, the input device Y03 and the output device Y04 may be connected by a bus or in another manner, and the connection by the bus is exemplified in fig. 3.

The input device Y03 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device, such as a touch screen, keypad, mouse, track pad, touch pad, pointer, one or more mouse buttons, track ball, joystick, or other input device. The output device Y04 may include a display device, an auxiliary lighting device, a tactile feedback device (e.g., a vibration motor), and the like. The display device may include, but is not limited to, a liquid crystal display, a light emitting diode display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific integrated circuits, computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable logic devices) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a cathode ray tube or a liquid crystal display monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local area networks, wide area networks, and the internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A vehicle information acquisition apparatus characterized by comprising: the system comprises a feature extraction module, a feature fusion module, a candidate frame generation module and a vehicle re-identification module;

the candidate frame generation module is used for detecting at least one vehicle position rectangular frame from the image to serve as a candidate frame and executing the following first operation: respectively intercepting the features corresponding to each candidate frame from the fusion features, and sending each candidate frame and the corresponding features to the vehicle re-identification module; screening each corrected candidate frame returned by the vehicle re-identification module, re-correcting each remaining candidate frame, and repeatedly executing the first operation;

2. The apparatus of claim 1,

the feature extraction module extracts at least two layers of features of the image by using a convolutional neural network; the spatial resolution of each layer of features is sequentially reduced, but the number of channels is sequentially increased.

3. The apparatus of claim 2,

the feature fusion module takes a layer of features with the lowest spatial resolution as features to be processed, and executes the following third operation: and respectively adjusting the spatial resolution and the channel number of the features to be processed to be consistent with the spatial resolution and the channel number of the features of the previous layer, fusing the features to be processed and the features of the previous layer, taking the fusion result as the features to be processed, and repeatedly executing the third operation until all the features of each layer are fused.

4. The apparatus of claim 1,

and the candidate frame generation module is used for intercepting the feature corresponding to any candidate frame from the fusion feature according to the position of the candidate frame in the image.

5. The apparatus of claim 1,

the meeting of the predetermined output condition includes: the number of counts reaches a predetermined threshold;

the vehicle re-identification module is further configured to increment the count by one after the received candidate frames are modified.

6. The apparatus of claim 1,

and the vehicle re-identification module is further used for outputting the characteristics corresponding to the candidate frames after performing preset processing.

7. A vehicle information acquisition method characterized by comprising:

acquiring at least two layers of characteristics of an image to be processed;

8. The method of claim 7,

the acquiring of the at least two layers of features of the image to be processed comprises: extracting at least two layers of characteristics of the image by using a convolutional neural network; the spatial resolution of each layer of features is sequentially reduced, but the number of channels is sequentially increased.

9. The method of claim 8,

the fusing the at least two layers of features comprises: taking a layer of features with the lowest spatial resolution as features to be processed, and performing the following third operation: and respectively adjusting the spatial resolution and the channel number of the features to be processed to be consistent with the spatial resolution and the channel number of the features of the previous layer, fusing the features to be processed and the features of the previous layer, taking the fusion result as the features to be processed, and repeatedly executing the third operation until all the features of each layer are fused.

10. The method of claim 7,

the respectively intercepting the features corresponding to the candidate frames from the fusion features comprises: and for any candidate frame, according to the position of the candidate frame in the image, intercepting the feature corresponding to the candidate frame from the fusion feature.

11. The method of claim 7,

12. The method of claim 7,

the method further comprises the following steps: and outputting the characteristics corresponding to each candidate frame after predetermined processing.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 7-12.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 7-12.