WO2022152009A1 - 目标检测方法、装置、设备以及存储介质 - Google Patents

目标检测方法、装置、设备以及存储介质 Download PDF

Info

Publication number
WO2022152009A1
WO2022152009A1 PCT/CN2022/070095 CN2022070095W WO2022152009A1 WO 2022152009 A1 WO2022152009 A1 WO 2022152009A1 CN 2022070095 W CN2022070095 W CN 2022070095W WO 2022152009 A1 WO2022152009 A1 WO 2022152009A1
Authority
WO
WIPO (PCT)
Prior art keywords
directional
frequency band
sub
band
target
Prior art date
Application number
PCT/CN2022/070095
Other languages
English (en)
French (fr)
Inventor
徐东
林国飞
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2022152009A1 publication Critical patent/WO2022152009A1/zh
Priority to US17/982,101 priority Critical patent/US20230053911A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/30Noise filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/70Game security or game management aspects
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/80Special adaptations for executing a specific game genre or game mode
    • A63F13/847Cooperative playing, e.g. requiring coordinated actions from several players to achieve a common goal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/803Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Definitions

  • the present application relates to the field of image recognition, and in particular, to a target detection method, apparatus, device, and storage medium.
  • target detection such as identifying moving target characters in surveillance videos, or identifying moving game characters in games.
  • Embodiments of the present application provide a target detection method, apparatus, device, and storage medium, which can improve the accuracy of target detection.
  • the technical solution is as follows:
  • a target detection method executed by a computer device, the method comprising:
  • Multi-band filtering is performed on the first target area to obtain a plurality of frequency band submaps, and the first target area is the area to be detected in the first video frame;
  • Multi-directional filtering is performed on the multiple frequency band sub-maps to obtain multiple directional sub-maps
  • the direction band fusion feature is input into the target detection model, and prediction is performed based on the direction band fusion feature through the target detection model, and the prediction label of the first target area is obtained, and the prediction label is used to indicate the first target area. Whether a target area includes the target object.
  • a target detection device comprising:
  • a multi-band filtering module configured to perform multi-band filtering on a first target area to obtain a plurality of frequency band submaps, where the first target area is an area to be detected in the first video frame;
  • a multi-directional filtering module configured to perform multi-directional filtering on the multiple frequency band sub-maps to obtain multiple directional sub-maps
  • a feature acquisition module configured to acquire the directional frequency band fusion feature of the first target area according to the plurality of directional submaps
  • the input module is used to input the directional frequency band fusion feature into the target detection model, and through the target detection model, make prediction based on the directional frequency band fusion feature, and obtain the predicted label of the first target area, and the predicted label uses to indicate whether the first target area includes a target object.
  • a computer device comprising one or more processors and one or more memories, wherein the one or more memories store at least one computer program, the computer program being executed by the One or more processors are loaded and executed to implement the object detection method.
  • a computer-readable storage medium is provided, and at least one computer program is stored in the computer-readable storage medium, and the computer program is loaded and executed by a processor to implement the target detection method.
  • a computer program product or computer program comprising program code stored in a computer-readable storage medium from which a processor of a computer device reads The program code, the processor executes the program code, so that the computer device executes the above target detection method.
  • FIG. 1 is a schematic diagram of an implementation environment of a target detection method provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of marking game characters in a game scene provided by an embodiment of the present application.
  • FIG. 3 is a flowchart of a target detection method provided by an embodiment of the present application.
  • FIG. 4 is a flowchart of a target detection method provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a frequency band filter provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a frequency band filter provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a directional filter provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a directional filter provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of an interface provided by an embodiment of the present application.
  • FIG. 10 is a flowchart of a target detection method provided by an embodiment of the present application.
  • FIG. 11 is a flowchart of a training method for a target detection model provided by an embodiment of the present application.
  • FIG. 12 is a flowchart of a target detection method provided by an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a target detection apparatus provided by an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of a terminal provided by an embodiment of the present application.
  • FIG. 15 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • the computer equipment uses the frame difference method to determine the position of the moving target, that is, the difference image is obtained by taking the difference between two or more frames of images before and after. Since the value of the pixel value of the background is small or zero after the subtraction, and the value of the pixel value of the moving object is large after the subtraction, the computer device can detect the moving target by binarizing the difference image obtained.
  • Artificial Intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, involving a wide range of fields, including both hardware-level technology and software-level technology.
  • the basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • Machine Learning is a multi-disciplinary interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in how computers simulate or realize human learning behaviors to acquire new knowledge or skills, and to reorganize existing knowledge sub-models to continuously improve their performance.
  • Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent, and its applications are in all fields of artificial intelligence.
  • Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, teaching learning and other techniques.
  • Normalization that is, a method of mapping sequences with different value ranges to the (0, 1) interval, which is convenient for data processing.
  • the normalized values can be directly implemented as probabilities.
  • the functions that can realize normalization include a soft maximization (Softmax) function and a sigmoid growth curve (Sigmoid), and certainly also include other functions that can realize normalization, which are not limited in this embodiment of the present application.
  • FIG. 1 is a schematic diagram of an implementation environment of a target detection method provided by an embodiment of the present application.
  • the implementation environment may include a terminal 110 and a server 140 .
  • the terminal 110 is connected to the server 140 through a wireless network or a wired network.
  • the terminal 110 is a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart watch, etc., but is not limited thereto.
  • the terminal 110 has an application program supporting image display installed and running.
  • the server is an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, Middleware services, domain name services, security services, distribution networks (Content Delivery Network, CDN), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.
  • cloud databases cloud computing, cloud functions, cloud storage, network services, cloud communications, Middleware services, domain name services, security services, distribution networks (Content Delivery Network, CDN), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.
  • the terminal 110 generally refers to one of multiple terminals, and only the terminal 110 is used as an example in this embodiment of the present application.
  • the number of the above-mentioned terminals may be more or less.
  • the above-mentioned terminal is only one, or the above-mentioned terminal is tens or hundreds, or more, in this case, the above-mentioned implementation environment also includes other terminals.
  • the embodiments of the present application do not limit the number of terminals and device types.
  • the target detection method provided by the embodiment of the present application can be applied to the detection scene of the game target, that is, during the game, the target detection method provided by the embodiment of the present application can realize the detection of the game target in the game scene. Detection and Tracking.
  • a card game in a card game, players can release game characters corresponding to different cards in the game scene, and the game characters can move or make attack actions in the game scene.
  • the terminal can detect and track the moving game character by using the target detection method provided by the embodiment of the present application, so as to obtain the coordinate information of the game character in real time, and record the game character’s coordinate information.
  • the motion trajectory is used for technicians to analyze and test the game, and find the loopholes in the game in time.
  • Figure 2 shows a callout box 201 of our game character.
  • MOBA Multiplayer Online Battle Arena
  • players can control different game characters in the game scene, that is, control the game characters to move or perform actions that release virtual skills in the game scene.
  • the terminal can detect and track the game characters by using the target detection method provided in the embodiment of the present application, so as to obtain the coordinate information of the game characters in real time, and record the movement trajectory of the game characters.
  • the movement trajectories of the game characters can not only provide the technicians to analyze and test the game and find the loopholes in the game in time, but also provide the technicians to generate a game record of a single game.
  • the trajectory of the game in the game in this case.
  • Players can quickly review their performance in the game through the game record of the game, without viewing the complete game record video, and the efficiency of human-computer interaction is high.
  • the terminal can detect and track the game characters controlled by the two players through the target detection method provided by the embodiment of the present application, so as to obtain the coordinate information of the game characters in real time, and record the movement trajectory of the game characters.
  • the movement trajectories of the game characters can be provided to the technicians to analyze and test the game, and eliminate some game anomalies, such as the abnormal disappearance of the game characters, or the situation that the game characters do not disappear after death. , quickly find game abnormalities, and repair game abnormalities in time.
  • the target detection method provided in the embodiment of the present application can be applied to the scene of person detection and tracking, that is, when the technician needs to use the monitoring video to track the target person, the terminal can use the monitoring video to implement the method of this application.
  • the target detection method provided by the example is used to detect and track the target person in the surveillance video.
  • the terminal can highlight the target person in the monitoring video during the process of playing the monitoring video, which is convenient for technicians to observe and record.
  • the target detection method provided in the embodiment of the present application can be applied to the scene of vehicle detection and tracking, that is, when the technician needs to use the aerial video to track the target vehicle, the terminal can use the aerial video to implement this application.
  • the target detection method provided by the example detects and tracks the target vehicle in the aerial video.
  • the terminal can highlight the target vehicle in the aerial video, which is convenient for the technician to track the vehicle.
  • the target detection method provided in this embodiment of the present application can also be applied to other target detection scenarios, such as animal detection or Scenarios such as aircraft detection are not limited in this embodiment of the present application.
  • the technical solutions provided by the embodiments of the present application may be implemented by a server or a terminal as the execution body, or the technical solutions provided by the present application may be implemented through the interaction between the terminal and the server, that is, the server acts as the background
  • the processed data is sent to the terminal, and the terminal displays the processing result to the user, which is not limited in this embodiment of the present application.
  • the following will take the execution subject as the terminal as an example for description.
  • FIG. 3 is a flowchart of a target detection method provided by an embodiment of the present application. Referring to FIG. 3 , the method includes:
  • the terminal performs multi-band filtering on a first target area to obtain multiple frequency band submaps, where the first target area is an area to be detected in the first video frame.
  • the frequency band filtering refers to using a specific frequency range to filter the first target area
  • the multi-band filtering refers to using multiple frequency ranges to filter the first target area. For example, if there is a frequency range (500, 800), when using the frequency range (500, 800) to filter the first target region, that is, the part of the first target region in the frequency range (500, 800) is reserved, and the first target region is in the frequency range (500, 800). The part outside the frequency range (500, 800) is deleted.
  • the terminal filters through multiple frequency bands, that is, decomposes the first target area into multiple frequency band sub-images through multiple frequency ranges, and the frequency band sub-images are images whose filtered frequencies are in corresponding frequency ranges.
  • the image obtained after filtering is a frequency band submap whose frequency is within the frequency range (500, 800); when the terminal uses the frequency range ( 200, 300) in the case of filtering the first target region, the image obtained after filtering is a frequency band submap whose frequency is within the frequency range (200, 300).
  • different frequency band submaps are used to record different image features.
  • the terminal performs multi-directional filtering on the multiple frequency band sub-maps to obtain multiple directional sub-maps.
  • the directional filtering refers to filtering the frequency band sub-map using a specific direction. Taking the existence of four directions as an example, if the due west direction is marked as 0°, then the due north direction is 90°, the due east direction is 180°, and the due south direction is 270°, of which 0° to 90° are the first 90° to 180° is the second direction, 180° to 270° is the third direction, and 270° to 360° (0°) is the fourth direction.
  • Multi-direction filtering refers to filtering the frequency band subgraph in multiple directions, that is, the process of decomposing the frequency band subgraph into multiple direction subgraphs.
  • the frequency band subgraph is decomposed into directional subgraphs in four directions.
  • N frequency bands The subgraph is also decomposed into 4N directional subgraphs, where N is a positive integer.
  • the terminal acquires the directional frequency band fusion feature of the first target area according to the multiple directional submaps.
  • the multiple directional sub-maps are multiple directional sub-maps of multiple frequency band sub-maps
  • the directional-band fusion feature fused with the directional information and the frequency information can be obtained according to the multiple directional sub-maps.
  • the terminal inputs the directional frequency band fusion feature into the target detection model, performs prediction based on the directional frequency band fusion feature through the target detection model, and obtains a predicted label of the first target area, where the predicted label is used to indicate whether the first target area includes a target object.
  • the target detection model is a model trained based on sample video frames and labels corresponding to the sample video frames, and has the ability to identify target objects in the video frames.
  • the target detection model is Faster R-CNN (Fast Region Convolutional Neural Networks), SSD (Single Shot Multi Box Detector, real-time target detection), YOLO (You Only Look Once, you Just look at it once), a decision tree model, or Adaboost (reinforcement learning), etc., which are not limited in this embodiment of the present application.
  • the terminal can perform band filtering and direction filtering on the first target area of the first video frame, and obtain multiple frequency band submaps representing the frequency band information of the first target area and representing the direction of the first target area Multiple directional subgraphs of information.
  • the direction frequency band fusion feature obtained by the terminal fuses the frequency band information and the direction information, which can more completely represent the characteristics of the first target area. Even if the target object moves slowly or rotates, the direction frequency band fusion information can also be accurately represented.
  • Features of the first target region When the subsequent terminal uses the target detection model to perform target detection based on the directional frequency band fusion feature, a more accurate detection effect can be obtained.
  • FIG. 4 is a flowchart of a target detection method provided by an embodiment of the present application. Referring to FIG. 4 , the method includes:
  • the terminal determines a first target area from a first video frame, where the first target area is an area to be detected in the first video frame.
  • the first video frame is a video frame in a video containing a target object, and the first target area is an area where a target object may exist, and the target object is also the target of the target detection method.
  • the target object is the game character; in the character detection and tracking scene, the target object is the target character; in the vehicle detection and tracking scene, the target object is the target vehicle.
  • the terminal determines a second target area, where the second target area is an area where the target object is located in a second video frame, and the second video frame is a video whose display time is before the first video frame frame.
  • the terminal offsets the second target area based on the first video frame and the second video frame to obtain a first target area, where the first target area is an area corresponding to the shifted second target area in the first video frame.
  • the first video frame and the second video frame are adjacent video frames in the same video frame, and the position of the target object in the adjacent video frames is often close to.
  • the terminal After determining that the target object is in the second target area in the second video frame, the terminal offsets the second target area, and maps the shifted second target area to the first video frame, so that the target object can be displayed in the first video frame.
  • a first target area is obtained, and the first target area is an area that may contain the target object.
  • the terminal can determine the second target area centered on the center coordinate (10, 15).
  • the second target area is a square area with a side length of 4, and the coordinates of the lower left corner of the second target area are (8, 13) and the coordinates of the upper right corner are (12, 17), the terminal passes through the lower left corner.
  • the coordinates of , and the coordinates of the upper right corner can uniquely determine the second target area.
  • the terminal offsets the second target area based on the first video frame and the second video frame, and obtains the center coordinates of the offset second target area, such as (12, 15).
  • the coordinates of the lower left corner of the second target area are (10, 13), and the coordinates of the upper right corner are (14, 17).
  • the terminal maps the shifted second target area to the first video frame, and obtains the first target area in the first video frame, that is, the first target area is determined in the first video frame.
  • the number of the first target area is multiple, that is, the terminal can perform multiple offsets on the second target area to obtain multiple offset second target areas.
  • the terminal maps multiple offset second target areas to the first video frame, thus obtaining multiple first target areas, and the terminal can subsequently perform target detection based on the multiple first target areas, that is, in the game scene. game character detection. It should be noted that, in addition to inferring the second target area through the center coordinates of the game character in the second video frame, the terminal can directly determine the second target area in the second video frame, that is, the terminal can directly determine the second target area in the second video frame. After the second target area where the game character is located is determined from the second video frame, the coordinates of the second target area can be stored in the cache, and the second target area of the game character in the second video frame can be directly read from the cache subsequently. In the target area, the terminal does not need to perform reverse calculation, which reduces the calculation amount of the terminal and improves the calculation efficiency of the terminal.
  • the display time of the second video frame is 0.5s earlier than that of the first video frame as an example.
  • a second target area centered on the center coordinate (0, 2) can be determined.
  • the second target area is a rectangular area, and the length of the rectangular area is 4 and the width is 2, the coordinates of the lower left corner of the second target area are (-2, 1), and the coordinates of the upper right corner are (2, 3) .
  • the terminal can uniquely determine the second target area through the coordinates of the lower left corner and the upper right corner.
  • the terminal offsets the second target area based on the first video frame and the second video frame, and obtains the center coordinates of the second target area after the offset, such as (2, 2).
  • the coordinates of the lower left corner of the second target area are (0, 2), and the coordinates of the upper right corner are (4, 3).
  • the terminal maps the shifted second target area to the first video frame, and obtains the first target area in the first video frame, that is, the first target area is determined in the first video frame.
  • the number of the first target area is multiple, that is, the terminal can perform multiple offsets on the second target area to obtain multiple offset second target areas, and the terminal offsets the multiple offsets.
  • the shifted second target area is mapped to the first video frame, and a plurality of first target areas are obtained, and the terminal can subsequently perform target detection based on the plurality of first target areas, that is, target vehicle detection.
  • the terminal can directly determine the second target area in the second video frame, in addition to the center coordinates of the target vehicle in the second video frame to infer the second target area, that is, the terminal can directly determine the second target area in the second video frame.
  • the coordinates of the second target area can be stored in the cache, and then the second target vehicle in the second video frame can be directly read from the cache.
  • the terminal does not need to perform reverse calculation, which reduces the calculation amount of the terminal and improves the calculation efficiency of the terminal.
  • the terminal determines, based on the display time difference between the first video frame and the second video frame, a distance to offset the second target area, and the offset distance and the display time difference is proportional, the display time difference is the time difference between displaying the first video frame and displaying the second video frame when the video is played.
  • the greater the display time difference between the first video frame and the second video frame the greater the position change of the target object in the video frame.
  • the display time difference between the second video frame and the second video frame is used to determine the offset distance for the second target area, so as to improve the probability that the first target area contains the target object, thereby improving the efficiency of target detection.
  • the terminal uses an inverse proportional function to determine the distance to offset the second target area based on the display time difference. For example, the terminal can determine the distance to offset the second target area through formula (1).
  • y is the distance to offset the second target area
  • x is the display time difference between the first video frame and the second video frame
  • K is an inverse proportional constant, which is determined by the technician according to the actual situation set, K ⁇ 0.
  • the terminal in addition to determining the distance for offsetting the second target area based on the display time difference between the first video frame and the second video frame through an inverse proportional function, the terminal can also use other functions to determine the distance to offset the second target area.
  • the distance for offsetting the second target area is determined, which is not limited in this embodiment of the present application.
  • the terminal in the case where the first video frame is the first video frame in the video, can use the template image of the target object to perform template matching on the first video frame, starting from the first video frame.
  • the first target area is determined in a video frame.
  • the terminal can use the template image of the target object in the first video frame Perform template matching on the above to obtain the first target area to be detected. Due to the high efficiency of template matching, the efficiency of determining the first target region is also high.
  • the terminal uses the template image of the target object to slide on the first video frame, obtains the similarity between the target image and multiple covered areas on the first video frame, and determines the area with the highest similarity as the area to be detected. first target area.
  • the terminal performs multi-band filtering on the first target region to be detected in the first video frame to obtain multiple frequency band submaps.
  • the terminal inputs the first target region into a band filter bank, and performs multi-band filtering on the first target region through multiple band filters in the band filter bank to obtain multiple band submaps .
  • the band filters are non-subsampling tower filters.
  • downsampling is the process of reducing the resolution of the image.
  • the process of downsampling the reference image is the process of reducing the resolution of the reference image. process, or the process of extracting some pixels from the reference image to obtain a new reference image.
  • the terminal extracts pixels in odd rows and columns from the reference image with a resolution of 512 ⁇ 512, and recombines the extracted pixels to obtain a new reference image.
  • the resolution of the image becomes 256 ⁇ 256.
  • Example 1 The terminal performs time-frequency transformation on the first video frame to obtain a first frequency domain image of the first video frame.
  • the terminal inputs the first target region in the first frequency domain image into a band filter bank, where the band filter bank includes a plurality of band filters, and each band filter corresponds to a different frequency range.
  • the terminal filters the first target region through multiple band filters in the band filter group, and outputs multiple band submaps from the multiple band filters, and each band submap corresponds to one band filter.
  • the filters correspond to different frequency ranges, then each frequency band submap is also a frequency domain image in a different frequency range. Referring to FIG.
  • a schematic image of a band filter bank 501 is provided, and the band filter bank 501 includes a band filter 5011 , a band filter 5012 , and a band filter 5013 .
  • the three band filters correspond to different frequency ranges.
  • the terminal can decompose the first target area into frequency band subgraph A, frequency band subgraph B, and frequency band subgraph C through three frequency band filters, and frequency band subgraph A, frequency band subgraph B, and frequency band subgraph C respectively correspond to different frequency ranges.
  • the following will further describe the method for the terminal to perform multi-band filtering on the first target area through the band filter bank to obtain multiple frequency band sub-maps.
  • the terminal performs fast Fourier transform on the first video frame to obtain a first frequency domain image of the first video frame.
  • the terminal converts the first target area in the first video frame Inputting the band filter bank, that is, inputting the first target region of the first frequency domain image into the band filter bank.
  • the band filter bank includes three band filters: band filter 5011, band filter 5012 and band filter 5013, wherein the filter matrix of band filter 5011 is The filter matrix of the band filter 5012 is The filter matrix of the band filter 5013 is terminal to the first target area Perform edge filling to get the filled image of the first target area
  • the terminal adopts the filter matrix of three frequency band filters respectively as well as Perform band filtering on the filled image of the first target area to obtain three band submaps as well as
  • Example 2 The terminal performs time-frequency transformation on the first video frame to obtain a first frequency domain image of the first video frame.
  • the terminal inputs the first target region in the first frequency domain image into a band filter bank, where the band filter bank includes a plurality of band filters with different levels.
  • the terminal performs multi-stage filtering on the first target region through a plurality of frequency band filters with different levels to obtain a plurality of frequency band submaps.
  • FIG. 6 another structure of a band filter bank 600 is provided, and the band filter bank includes a first low-pass filter 601 and a first high-pass filter 602 .
  • a second low-pass filter 603 and a second high-pass filter 604 are also connected behind the first low-pass filter 601 .
  • a third low-pass filter 605 and a third high-pass filter 606 are further connected behind the second low-pass filter 603, and so on.
  • the first low-pass filter 601, the second low-pass filter 603 and the third low-pass filter 605 also constitute three levels of low-pass filters;
  • the first high-pass filter 602, the second high-pass filter 604 and the third high-pass filter 606 also constitute a three-level high-pass filter.
  • the terminal can decompose the first target region into the frequency band subgraph D, the frequency band subgraph E, the frequency band subgraph F, and the frequency band subgraph G through the multi-level frequency band filter as shown in FIG. 6 .
  • the following will further describe the method for the terminal to perform multi-band filtering on the first target area through the band filter bank to obtain multiple frequency band submaps.
  • the terminal performs fast Fourier transform on the first video frame to obtain a first frequency domain image of the first video frame.
  • the terminal converts the first target area in the first frequency domain image Input band filter bank.
  • the band filter bank includes 6 band filters: a first low-pass filter 601, a second low-pass filter 603, a third low-pass filter 605, a first high-pass filter 602, and a second high-pass filter 602. filter 604 and a third high pass filter 606 .
  • terminal to the first target area Perform edge filling to get the filled image of the first target area
  • the terminal employs the first low-pass filter and the first high pass filter Band filtering is performed on the filled image of the first target area to obtain the first low-pass sub-image.
  • the terminal will fill the image with the first low-pass sub-image Enter the second low-pass filter separately and a second high pass filter get the second low-pass subgraph and the second high-pass subgraph Terminal pair second low-pass subgraph Perform edge filling to get the filled image of the second low-pass sub-image
  • the terminal inputs the filled images of the second low-pass sub-image into the third low-pass filter respectively and a third high pass filter get the third low-pass subgraph and the third high-pass subgraph
  • the third low-pass subgraph, the third high-pass subgraph, the second high-pass subgraph, and the first high-pass subgraph correspond to the frequency band subgraph D, the frequency band subgraph E, the frequency band subgraph F, and the frequency band subgraph G, respectively. It is known through experiments that the contribution of the low-pass subgraph to the model classification is not obvious, and the terminal can ignore the frequency subgraph D, which is the
  • the terminal performs multi-directional filtering on the multiple frequency band sub-maps to obtain multiple directional sub-maps.
  • the terminal inputs the frequency band subgraph into the directional filter bank, and through the plurality of directional filters in the directional filter bank, the Multi-directional filtering is performed on the sub-images to obtain multiple directional features corresponding to the frequency band sub-images.
  • the terminal inputs each directional feature into the corresponding reconstruction filter in the directional filter bank.
  • the terminal generates a plurality of directional sub-maps of the frequency band sub-map based on the input directional feature through the reconstruction filter.
  • the directional filter and the reconstruction filter are both non-subsampling filters.
  • the terminal can further divide each frequency band submap according to the direction to obtain multiple directional submaps of each frequency band submap , and subsequently the characteristics of the first target area can be fully expressed based on the multiple directional sub-maps corresponding to the multiple frequency band sub-maps.
  • the directional filter bank 701 includes a directional filter 702 and a reconstruction filter. device 703.
  • the directional filter 702 has a first-stage wedge filter 801 and a wedge-shaped filter. Filter 802, wedge filter 801 and wedge filter 802 in the white direction is the direction that the filter allows to pass, and the gray direction in wedge filter 801 and wedge filter 802 is the direction the filter does not allow to pass.
  • the terminal can decompose the first frequency band submap into directional submaps in four directions.
  • the reconstruction filter 703 has a symmetric structure with the direction filter 702, that is, the input of the reconstruction filter 703 is four direction subgraphs, and the output is a synthesized frequency band subgraph.
  • the terminal submaps the frequency band Input the directional filter bank 701, and pass the directional filter 702 in the directional filter bank to this frequency band submap Do directional filtering.
  • the terminal submaps the frequency band Input wedge filter 801, pass the wedge filter matrix of wedge filter 801 submap for this band Perform directional filtering, that is, multiply the wedge-shaped filter matrix with the value of the corresponding position of the frequency band submap to obtain the first-level directional feature
  • the terminal takes the first-level direction feature
  • the first-level direction feature Perform directional filtering, that is, multiply the square filter matrix with the value of the corresponding position of the first-level directional feature to obtain a second-level directional feature
  • the terminal inputs the second-level direction feature into the reconstruction filter 703, and through the reconstruction filter, based on the input second-level direction feature Generate a directional submap of the frequency band, such as The terminal will first
  • the processing process of the terminal passing the wedge filter 802 and the processing method of the terminal passing the wedge filter 801 belong to the same inventive concept. 8012 also belongs to the same inventive concept, and details are not repeated here.
  • the terminal also decomposes one frequency band sub-map into four directional sub-maps. For multiple frequency band subgraphs, each frequency band subgraph is decomposed into 4 direction subgraphs. In the case of N frequency band subgraphs, 4N direction subgraphs can be obtained, where N is a positive integer .
  • the terminal divides each frequency band submap into 4 directional submaps as an example for illustration.
  • the terminal can also divide each frequency band submap
  • the graph is divided into 8 or more frequency band subgraphs, just add filters of different shapes after the square filter 8011, the square filter 8012, the square filter 8021 and the square filter 8022, and it is divided into 8 directions.
  • the method of the sub-graph and the above-mentioned method of dividing the sub-graph into four directions belong to the same inventive concept, and will not be repeated here.
  • the terminal acquires the directional frequency band fusion feature of the first target area according to the multiple directional submaps.
  • the terminal acquires the energy of multiple first direction submaps corresponding to any one of the multiple frequency band submaps.
  • the terminal sorts the plurality of first direction sub-maps in order of energy from large to small to obtain a plurality of reference direction sub-maps, where the plurality of reference direction sub-maps are the sorted first direction sub-maps.
  • the terminal obtains the directional band fusion feature of the first target area based on the multiple reference directional submaps and the multiple second directional submaps, and the second directional submap is other frequency bands in the multiple frequency band submaps except the frequency band submap
  • the direction subgraph corresponding to the subgraph is
  • the terminal only needs to obtain and sort multiple directional submaps corresponding to any one of the multiple frequency band submaps, and then it can obtain the directional band fusion features of the first target area, and it is not necessary to perform a one-by-one analysis on each directional sub-map.
  • the frequency band features are processed, which reduces the calculation amount of the terminal.
  • the terminal uses the square sum of the numerical values in the direction sub-map to represent the energy of the corresponding direction sub-map as an example. In other possible implementations, other forms can also be used.
  • Represents the energy of the corresponding direction subgraph for example, the energy of the corresponding subgraph is represented by the two-norm of the sum of the squares of the values in the corresponding direction subgraph, which is not limited in this embodiment of the present application.
  • the terminal determines any frequency band subgraph from multiple frequency band subgraphs, for example: The terminal obtains the submap of the frequency band The corresponding four first-direction subgraphs as well as The terminal obtains four first-direction submaps as well as The energy of , in some embodiments, the energy is the sum of the squares of the values in the four first-direction subgraphs, in which case the first-direction subgraph The energy is 1, the first direction subgraph The energy is 2, the first direction subgraph The energy is 7, the first direction subgraph The energy is 8.
  • the terminal sorts the plurality of first direction sub-maps according to the order of energy from large to small, and describes a method for obtaining a plurality of reference direction sub-maps.
  • the terminal corresponds to the four first-direction sub-images as well as Sort to get four first-direction subgraphs after sorting as well as
  • the sorted four first direction sub-images are also four reference direction sub-images.
  • the third part describes a method for the terminal to obtain the directional frequency band fusion feature of the first target area based on multiple reference directional sub-maps and multiple second directional sub-maps.
  • the following describes the relationship between the frequency band submap, a plurality of frequency band submaps, and other frequency band submaps.
  • the frequency band subgraph is any frequency band subgraph in the plurality of frequency band subgraphs
  • the other frequency band subgraphs are the frequency band subgraphs other than the frequency band subgraph in the plurality of frequency band subgraphs.
  • the terminal performs weighted fusion of multiple reference direction sub-images based on multiple fusion weights corresponding to multiple reference direction sub-images to obtain a first direction fusion map corresponding to the frequency band sub-image, and the fusion weights It is positively related to the energy of the corresponding reference direction sub-map, that is, positively related to the energy of the corresponding first direction sub-map.
  • the terminal fuses the multiple second-direction sub-maps respectively to obtain multiple second-direction fusion maps corresponding to the other frequency band sub-maps.
  • the terminal obtains the directional frequency band fusion feature based on the first directional fusion map and the plurality of second directional fusion maps.
  • the terminal can obtain the direction frequency band fusion feature based on the first direction fusion map and multiple second direction fusion maps through the following steps:
  • the terminal obtains a first integral graph corresponding to the first direction fusion graph; the terminal obtains a plurality of second integral graphs corresponding to the plurality of second direction fusion graphs; the terminal converts the first integral feature vector Splicing with a plurality of second integral eigenvectors to obtain the directional band fusion feature of the first target area, the first integral eigenvector is the integral eigenvector corresponding to the first integral graph, and the second integral eigenvector is the first integral eigenvector.
  • the eigenvectors corresponding to the two-integral graph.
  • the integral graph since the image is composed of a series of discrete pixels, the integral of the image is actually the summation.
  • the value of each point in the integral graph is the sum of all pixel values in the upper left corner of the point in the original image.
  • the value of each point in the first integral graph is the sum of all the values in the upper left corner of the point in the first direction fusion graph
  • the value of each point in the second integral graph is the second integral graph. The sum of all values in the upper left corner of the point in the direction fusion graph.
  • the integral eigenvector is also the vector obtained by arranging the values in the integral graph in the order from left to right and from top to bottom.
  • the terminal uses the frequency subgraph according to the The four reference direction subgraphs of as well as , determine the four reference direction subgraphs as well as The corresponding fusion weights.
  • the reference direction submap The energy is 8, refer to the direction subgraph The energy is 7, refer to the direction subgraph The energy is 2, the reference direction subgraph energy is 1.
  • Terminal pair four reference direction submaps as well as The corresponding energy (8, 7, 2, 1) is normalized to obtain four reference direction submaps as well as The corresponding fusion weights (0.44, 0.39, 0.11, 0.06).
  • the terminal Based on the fusion weight (0.44, 0.39, 0.11, 0.06), the terminal combines the four reference direction sub-maps as well as Perform weighted fusion to obtain the first direction fusion map corresponding to the frequency band sub-map
  • the number of submaps in multiple frequency bands is two, and the number of submaps in other frequency bands is In the case of , the terminal can Perform directional filtering to obtain the other frequency band submap
  • the four second-direction subgraphs of as well as Based on the previously determined fusion weights (0.44, 0.39, 0.11, 0.06), the terminal combines the four second direction submaps as well as Perform weighted fusion to get the second direction fusion map
  • the terminal obtains the first direction fusion map Corresponding first integral graph
  • the terminal obtains a plurality of second integral graphs corresponding to the plurality of second-direction fusion graphs.
  • the terminal obtains a second-direction fusion graph
  • the corresponding second integral graph The terminal will first integrate the graph
  • the corresponding first integral eigenvectors are spliced with a plurality of second integral eigenvectors corresponding to a plurality of second integral graphs to obtain the directional frequency band fusion feature of the first target region.
  • the first integral graph is The corresponding first integral eigenvector, such as (1, 2, 3, 4) and the second integral graph
  • the corresponding second integral feature vector for example, is (2, 3, 4, 5) for splicing to obtain the directional frequency band fusion feature (1, 2, 3, 4, 2, 3, 4, 5) of the first target region.
  • the terminal inputs the directional band fusion feature into the target detection model, and uses the target detection model to perform prediction based on the directional band fusion feature to obtain a predicted label of the first target area, where the predicted label is used to indicate whether the first target area includes a target object.
  • the target detection model includes multiple sub-models, and the multiple sub-models are independent of each other.
  • the terminal inputs the directional frequency band fusion feature into the target detection model, that is, inputting the directional frequency band fusion feature into multiple sub-models respectively.
  • the terminal performs predictions based on the directional frequency bands through multiple sub-models, and outputs multiple prediction parameters corresponding to the multiple sub-models, and the prediction parameters are used to determine the corresponding prediction labels.
  • the terminal fuses multiple prediction parameters based on the confidence levels corresponding to the multiple sub-models to obtain the prediction label of the first target area, wherein the confidence level is positively correlated with the prediction accuracy of the corresponding sub-model during testing.
  • the object detection model is also referred to as an Adaboost model.
  • the terminal can independently predict through multiple sub-models of the target detection model, and fuse the prediction results of the multiple sub-models based on the confidence of the multiple sub-models to obtain a final predicted label.
  • the prediction ability of multiple sub-models can be used to avoid the problem that the prediction error of a sub-model leads to the error of the overall prediction label, that is, the phenomenon of overfitting the model is avoided, and the prediction ability of the target detection model is improved.
  • the target detection model includes 3 sub-models, each of which is an independently trained sub-model.
  • the terminal inputs the directional band fusion features (1, 2, 3, 4, 2, 3, 4, 5) of the first target area into 3 sub-models respectively, and the 3 sub-models pass through the weight matrix as well as Perform full connection on the directional band fusion features (1, 2, 3, 4, 2, 3, 4, 5) to obtain 3 prediction parameters (7, 8), (8, 9) and (8) corresponding to the 3 sub-models , 12).
  • the terminal normalizes the three prediction parameters (7, 8), (8, 9) and (8, 12) through the three sub-models respectively, and obtains three probability vectors (0.46, 0.54), ( 0.47, 0.53) and (0.4, 0.6).
  • the terminal performs weighted fusion of the three probability vectors according to the three confidence levels 0.1, 0.2 and 0.7 corresponding to the three sub-models to obtain a fusion probability vector (0.42, 0.58), where 0.42 indicates that the probability that the first target area includes the target object is 42%, 0.58 means there is a 58% probability that the first target area does not include the target object. If the terminal adopts 0 to indicate that the first target area includes the predicted label of the target object, and adopts 1 to indicate that the first target area does not include the predicted label of the target object, the terminal can set the predicted label of the first target area to 1.
  • the target detection model takes the target detection model as the Adaboost model as an example.
  • the target detection model may also be a model of other structures, such as a decision tree model or a convolutional network model.
  • the application embodiments do not limit this.
  • the terminal In response to the prediction label indicating that the first target area includes the target object, the terminal highlights the outline of the first target area in the first video frame.
  • the terminal can display the outline 902 of the first target area in the first video frame 901 , so that the technician can quickly determine the position of the target object in the first video frame 901 .
  • the terminal performs multi-band filtering on the first target region of the first video frame through NSP (Non Subsampled Pyramid, non-subsampled pyramid) to obtain multiple frequency band submaps.
  • the terminal sorts the 8 directional subgraphs f1,1 - f1,8 according to the energy of the 8 directional subgraphs f1,1 - f1,8 , and obtains the sorted 8 directional subgraphs pic 1,1- pic 1, 8 , that is, 8 reference direction sub-maps.
  • the terminal obtains 8 fusion weights a 1 -a 8 corresponding to the 8 directional subgraphs according to the energy of the 8 directional subgraphs f 1, 1 -f 1, 8 , and the terminal combines the 8 fusion weights a 1 -a 8 based on the 8 fusion weights a 1 -a 8
  • the sorted directional sub-maps pic 1 , 1 -pic 1, 8 are weighted and fused to obtain the first directional fusion map Img 1 .
  • the terminal integrates 8 fusion weights a 1 -a 8 into a fusion equivalent filter, and multiple direction sub-images of other frequency band images can be quickly fused into the corresponding second direction fusion map by fusing the equivalent filters.
  • the corresponding multiple directional sub-images of the frequency band image can also be fused into multiple second directional fused images through an equivalent filter.
  • the terminal can be integrated into a fusion equivalent filter hc 2 and a fusion equivalent filter hc 3 based on 8 fusion weights a 1 -a 8 .
  • the terminal processes the other two frequency band subgraphs by fusing the equivalent filter hc 2 and the fusion equivalent filter hc 3 , and obtains the second direction fusion graph Img 2 and the third direction fusion graph Img corresponding to the other two frequency band subgraphs 3 .
  • the terminal acquires the first integral graph Int 1 of the fusion graph Img 1 in the first direction, the second integral graph Int 2 of the fusion graph Img 2 in the second direction, and the third integral graph Int 3 of the fusion graph Img 3 in the third direction.
  • the terminal splices the first integral eigenvector of the first integral graph Int 1 , the second integral eigenvector of the second integral graph Int 2 and the third integral eigenvector of the third integral graph Int 3 to obtain the direction of the first target area Band fusion feature X.
  • the terminal can input the directional frequency band fusion feature X into the target detection model, and output the predicted label of the first target area through the target detection model. When the predicted label indicates that the first target area includes the target object, the terminal highlights the outline of the first target area in the first video frame.
  • the technical solutions provided by the embodiments of the present application have a good effect in the recognition and trajectory tracking of game tasks.
  • the technical solutions can correctly identify the positions of the game characters, and optimize the game characters to a large extent.
  • the recognition success rate in card games is 99%, and MOBA games are also more than 95%.
  • the terminal can perform band filtering and direction filtering on the first target area of the first video frame, and obtain multiple frequency band submaps representing the frequency band information of the first target area and representing the direction of the first target area Multiple directional subgraphs of information.
  • the direction frequency band fusion feature obtained by the terminal fuses the frequency band information and the direction information, which can more completely represent the characteristics of the first target area. Even if the target object moves slowly or rotates, the direction frequency band fusion information can also be accurately represented.
  • Features of the first target region When the subsequent terminal uses the target detection model to perform target detection based on the directional frequency band fusion feature, a more accurate detection effect can be obtained.
  • the terminal adopts a non-subsampling tower filter in the process of filtering the first target area. Since the non-downsampling tower filter removes the downsampling process, the size of the output frequency band sub-image is the same as that of the first target area. If the target area is the same, the image enlargement and image registration process after downsampling is subtracted, which improves the accuracy of subsequent target detection.
  • a non-subsampling filter is also used, which avoids image mismatch caused by scale transformation during directional filtering, and improves the accuracy of subsequent target detection.
  • the terminal when obtaining the directional band fusion feature of the first target area, the terminal only needs to obtain multiple directional submaps corresponding to any one of the multiple frequency band submaps and sort them, and then the terminal can obtain the first target.
  • the direction band fusion feature of the region does not need to process each band feature one by one, which reduces the calculation amount of the terminal.
  • the terminal determines that the first target area includes the target object in the first video frame, it can highlight the outline of the first target area, which is convenient for technicians to find the position of the target object in time and improves human-computer interaction. s efficiency.
  • the target detection model can be performed by the terminal. Training can also be performed by the server. Taking the target detection model trained by the server as an example, see Figure 11, the method includes:
  • the server performs multi-band filtering on the sample region to be detected in the sample video frame to obtain multiple sample frequency band submaps.
  • the server can intercept video images from different videos, and the technicians screen the video images intercepted by the server, and use the video images containing the target object as sample video frames.
  • the technicians also A tag can be added to the video image, and the tag is used to indicate whether the corresponding area in the sample video frame contains the target object.
  • the server can then use the label as supervision to train the target detection model.
  • the terminal inputs the sample region into a band filter bank, and performs multi-band filtering on the sample region through multiple band filters in the band filter bank to obtain multiple band submaps, and the band filter is a non-subsampling tower filter.
  • the server performs multi-directional filtering on the multiple sample frequency band submaps to obtain multiple sample orientation submaps.
  • the terminal inputs the sample frequency band subgraph into a directional filter bank, and the sample frequency band subgraph is processed by a plurality of directional filters in the directional filter bank.
  • Multi-directional filtering is used to obtain multiple directional features corresponding to the sample frequency band sub-map.
  • the terminal respectively inputs each directional feature into the corresponding reconstruction filter in the directional filter bank, and through the reconstruction filter, based on the input directional feature, generates multiple directional sub-maps of the sample frequency band sub-map, the directional filter and the reconstruction filter.
  • the constructed filters are all non-subsampling filters.
  • the server acquires the sample direction frequency band fusion feature of the sample region according to the multiple sample direction submaps.
  • the terminal acquires the energy of the multiple sample direction submaps corresponding to any one of the sample frequency band submaps in the multiple sample frequency band submaps.
  • the terminal sorts the multiple sample direction submaps corresponding to the sample frequency band submap according to the order of energy from large to small.
  • the terminal obtains the sample direction frequency band fusion feature of the sample area based on the multiple sample direction subgraphs corresponding to the sorted sample frequency band subgraph and the plurality of sample direction subgraphs corresponding to other sample frequency band subgraphs in the multiple sample frequency band subgraphs.
  • the server trains the target detection model based on the sample direction frequency band fusion feature and the label of the sample area, where the label is used to indicate whether the sample area includes a sample object.
  • the server inputs the sample direction frequency band fusion feature into the target detection model, performs prediction based on the sample direction frequency band fusion feature through the target detection model, and outputs the predicted label of the sample area.
  • the server updates the model parameters of the target detection model based on the difference information between the predicted label and the label of the sample area.
  • the following description takes the target detection model as the Adaboost model as an example.
  • the Adaboost model includes multiple submodels Classifier t , where t is the number of submodels.
  • the server inputs the sample-direction fusion feature t 1 into the first sub-model Classifier 1 , and the first sub-model Classifier 1 performs full connection and normalization on the sample-direction fusion feature t 1 , output the predicted sample label of the sample area, and the predicted sample label is used to indicate whether the sample area includes a sample object.
  • the server constructs a loss function representing the difference between the predicted sample labels and the labels of the sample regions to update the model parameters of the first submodel Classifier 1 .
  • the server can Reduce the training weight w 1 of the band fusion feature t 1 in the sample direction; if the predicted sample label predicted by the first sub-model Classifier 1 based on the sample direction fusion feature t 1 is different from the label corresponding to the sample area, it means that the sample direction fusion feature If the prediction difficulty of predicting t 1 is relatively high, the server can increase the training weight w 1 of the band fusion feature t 1 in the sample direction. The larger the parameter update range is.
  • the server can obtain another sample video frame, perform the above steps 1101-1103 on the other sample video frame, and obtain the sample direction fusion feature t 2 of the other sample video frame.
  • the server initializes the second sub-model Classifier 2 and trains the second sub-model Classifier 2 based on the sample direction fusion feature prediction t 1 , the sample direction fusion feature prediction t 1 corresponding to the training weight w 1 and the sample direction fusion feature t 2 . analogy.
  • the training samples used by the sub-model of the next training are the sample-direction fusion features used by the sub-model of the previous training and the corresponding training weights are added, which can improve the target detection model's ability to distinguish the indistinguishable areas. Recognition ability.
  • the server can test the t sub-models through the same test set, and respectively record the correct rates of the t sub-models in the testing process.
  • the server sets corresponding confidence levels for the t sub-models according to the correct rates of the t sub-models during the testing process.
  • the label can be predicted based on the confidence levels corresponding to the t sub-models.
  • the server training model is used as an example for description.
  • the above target detection model can also be trained by the terminal, or through the interaction between the terminal and the server.
  • the terminal collects training images of the target detection model, and sends the training images of the target detection model to the server, and the server trains the target detection model.
  • the server collects a plurality of sample regions on the sample video frame, and the plurality of sample regions includes a positive sample region (region including the sample object) and a negative sample region (background region).
  • the server performs multi-band filtering on the sample area of the sample video frame through NKP (Non Subsampled Pyramid, non-subsampled pyramid) to obtain multiple sample frequency band submaps.
  • NKP Non Subsampled Pyramid, non-subsampled pyramid
  • the server sorts the 8 sample direction subgraphs m 1,1 -m 1,8 according to the energy of the 8 sample direction subgraphs m 1,1 -m 1,8 , and obtains the sorted 8 sample direction subgraphs pic 1 , 1 -pic 1, 8 .
  • the server obtains 8 fusion weights a 1 -a 8 corresponding to the 8 sample direction sub-graphs according to the energy of the 8 sample direction sub-graphs m 1,1 -m 1,8 , and the server uses the 8 fusion weights a 1 -a 8 to
  • the eight sorted sample orientation sub-maps pic 1 , 1 -pic 1, 8 are weighted and fused to obtain the first sample orientation fusion map SImg 1 .
  • the server integrates the 8 fusion weights a 1 -a 8 into a fusion equivalent filter, and multiple sample orientation sub-maps of other frequency bands can be quickly fused into the corresponding second sample orientation fusion map by fusing the equivalent filters.
  • the corresponding multiple sample orientation sub-maps of the other frequency band images can also be fused into multiple second sample orientation fusion maps through the equivalent filter.
  • the server can integrate into a fusion equivalent filter hc 2 and a fusion equivalent filter hc 3 based on 8 fusion weights a 1 -a 8 .
  • the server processes the other two sample frequency band subgraphs by fusing the equivalent filter hc 2 and the fusion equivalent filter hc 3 to obtain the second sample direction fusion graph SImg 2 and the third sample corresponding to the other two sample frequency band subgraphs.
  • Orientation fusion map SImg 3 .
  • the server obtains the first sample integration map SInt 1 of the first sample orientation fusion map SImg 1 , the second sample integration map SInt 2 of the second sample orientation fusion map SImg 2 , and the third sample of the third sample orientation fusion map SImg 3 Integral graph SInt 3 .
  • the server splices the first sample integral eigenvector of the first sample integral graph SInt 1 , the second sample integral eigenvector of the second sample integral graph SInt 2 , and the third sample integral eigenvector of the third sample integral graph SInt 3 , and obtain the sample direction band fusion feature Y of the sample area.
  • the server can input the sample direction band fusion feature Y into the target detection model, and output the predicted label of the sample area through the target detection model.
  • the server trains the target detection model based on the difference information between the predicted label and the label of the sample area. For the training process, refer to the description of step 1104, which is not repeated here.
  • FIG. 12 is a flowchart of target detection using the target detection model, which corresponds to steps 401 to 406 .
  • the old position in FIG. 12 refers to the position of the target object in the second video frame
  • the new position refers to the position of the target object in the first video frame.
  • the apparatus includes: a multi-band filtering module 1301 , a multi-directional filtering module 1302 , a feature acquisition module 1303 , and an input module 1304 .
  • the multi-band filtering module 1301 is configured to perform multi-band filtering on the first target area, where the first target area is the area to be detected in the first video frame, to obtain multiple frequency band submaps.
  • the multi-directional filtering module 1302 is configured to perform multi-directional filtering on multiple frequency band sub-maps to obtain multiple directional sub-maps.
  • the feature acquisition module 1303 is configured to acquire the directional frequency band fusion feature of the first target area according to the plurality of directional sub-maps.
  • the input module 1304 is used to input the directional frequency band fusion feature into the target detection model, and through the target detection model, predict based on the directional frequency band fusion feature to obtain a predicted label of the first target area, and the predicted label is used to indicate whether the first target area includes a target object.
  • the multi-band filtering module 1301 is configured to input the first target region into the frequency band filter bank, and perform multi-band filtering on the first target region through the plurality of frequency band filters in the frequency band filter bank , to obtain multiple frequency subgraphs.
  • the multi-directional filtering module 1302 is configured to input the frequency band sub-map into the directional filter bank for any frequency band sub-map in the plurality of frequency band sub-maps, and pass the multiple directions in the directional filter bank The filter performs multi-directional filtering on the frequency band sub-map to obtain multiple directional features corresponding to the frequency band sub-map. Input each directional feature into the corresponding reconstruction filter in the directional filter, and through the reconstruction filter, based on the input directional feature, multiple directional sub-maps of the frequency band sub-map are generated.
  • the feature acquisition module 1303 is configured to acquire the energy of multiple first direction submaps corresponding to any one of the multiple frequency band submaps. According to the order of energy from large to small, the plurality of first direction sub-images are sorted to obtain a plurality of reference direction sub-images. Based on the multiple reference directional submaps and the multiple second directional submaps, the directional band fusion feature of the first target region is obtained, and the second directional submap is the directional submaps corresponding to other frequency band submaps in the multiple frequency band submaps.
  • the feature acquisition module 1303 is used for multiple fusion weights corresponding to multiple reference direction sub-maps, and performs weighted fusion of the multiple reference direction sub-maps to obtain the first direction fusion corresponding to the frequency band sub-map Figure, the fusion weight is positively correlated with the energy of the corresponding first direction sub-graph.
  • the multiple second-direction sub-maps are respectively fused to obtain multiple second-direction fusion maps corresponding to the other frequency band sub-maps.
  • directional frequency band fusion features are obtained.
  • the feature acquisition module 1303 is configured to acquire the first integral map corresponding to the fusion map of the first direction. Acquire multiple second integral maps corresponding to multiple second-direction fusion maps.
  • the first integral eigenvector is spliced with a plurality of second integral eigenvectors to obtain the directional band fusion feature of the first target area, where the first integral eigenvector is the integral eigenvector corresponding to the first integral graph, and the second integral eigenvector is The integral eigenvector is the eigenvector corresponding to the second integral graph.
  • the device further includes:
  • the first target area determination module is used to determine the second target area, the second target area is the area where the target object is located in the second video frame, and the second video frame is the video frame whose display time is before the first video frame . Based on the first video frame and the second video frame, the second target area is offset to obtain a first target area, where the first target area is an area corresponding to the shifted second target area in the first video frame.
  • the device further includes:
  • a display module configured to highlight the outline of the first target area in the first video frame in response to the predicted label indicating that the first target area includes a target object.
  • the training device of the target detection model is configured to perform multi-band filtering on the sample region to be detected in the sample video frame to obtain multiple sample frequency band submaps. Multi-directional filtering is performed on multiple sample band submaps to obtain multiple sample orientation submaps. According to a plurality of sample direction sub-maps, the sample direction frequency band fusion features of the sample area are obtained. The target detection model is trained based on the sample direction band fusion feature and the label of the sample area, and the label is used to indicate whether the sample area includes the sample object.
  • the training device for the target detection model is configured to input the sample direction frequency band fusion feature into the target detection model, and through the target detection model, perform prediction based on the sample direction frequency band fusion feature, and output the predicted label of the sample area. Based on the difference information between the predicted label and the label of the sample region, the model parameters of the object detection model are updated.
  • the terminal can perform band filtering and direction filtering on the first target area of the first video frame, and obtain multiple frequency band submaps representing the frequency band information of the first target area and representing the direction of the first target area Multiple directional subgraphs of information.
  • the direction frequency band fusion feature obtained by the terminal fuses the frequency band information and the direction information, which can more completely represent the characteristics of the first target area. Even if the target object moves slowly or rotates, the direction frequency band fusion information can also be accurately represented.
  • Features of the first target region When the subsequent terminal uses the target detection model to perform target detection based on the directional frequency band fusion feature, a more accurate detection effect can be obtained.
  • An embodiment of the present application provides a computer device for executing the above method.
  • the computer device can be implemented as a terminal or a server.
  • the structure of the terminal is first introduced below:
  • FIG. 14 is a schematic structural diagram of a terminal provided by an embodiment of the present application.
  • the terminal 1400 can be: a smart phone, a tablet computer, a notebook computer or a desktop computer. Terminal 1400 may also be called user equipment, portable terminal, laptop terminal, desktop terminal, and the like by other names.
  • the terminal 1400 includes: one or more processors 1401 and one or more memories 1402 .
  • the processor 1401 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like.
  • the processor 1401 can use at least one hardware form among DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) accomplish.
  • the processor 1401 may also include a main processor and a coprocessor.
  • the main processor is a processor used to process data in the wake-up state, also called CPU (Central Processing Unit, central processing unit); the coprocessor is A low-power processor for processing data in a standby state.
  • the processor 1401 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing the content that needs to be displayed on the display screen.
  • the processor 1401 may further include an AI (Artificial Intelligence, artificial intelligence) processor, where the AI processor is used to process computing operations related to machine learning.
  • AI Artificial Intelligence, artificial intelligence
  • Memory 1402 may include one or more computer-readable storage media, which may be non-transitory. Memory 1402 may also include high-speed random access memory, as well as non-volatile memory, such as one or more disk storage devices, flash storage devices. In some embodiments, the non-transitory computer-readable storage medium in the memory 1402 is used to store at least one computer program, and the at least one computer program is used to be executed by the processor 1401 to implement the methods provided by the method embodiments in this application. object detection method.
  • FIG. 14 does not constitute a limitation on the terminal 1400, and may include more or less components than the one shown, or combine some components, or adopt different component arrangements.
  • the above computer equipment can also be implemented as a server, and the structure of the server is introduced below:
  • FIG. 15 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • the server 1500 may vary greatly due to different configurations or performance, and may include one or more processors (Central Processing Units, CPU) 1501 and a or more memories 1502, wherein, the one or more memories 1502 store at least one computer program, and the at least one computer program is loaded and executed by the one or more processors 1501 to implement the above-mentioned methods. method provided by the example.
  • the server 1500 may also have components such as wired or wireless network interfaces, keyboards, and input/output interfaces for input and output, and the server 1500 may also include other components for implementing device functions, which will not be repeated here.
  • a computer-readable storage medium such as a memory including a computer program
  • the above-mentioned computer program can be executed by a processor to complete the target detection method in the above-mentioned embodiment.
  • the computer-readable storage medium may be Read-Only Memory (ROM), Random Access Memory (RAM), Compact Disc Read-Only Memory (CD-ROM), Tape, floppy disk, and optical data storage devices, etc.
  • a computer program product or computer program comprising program code stored in a computer readable storage medium from which a processor of a computer device is readable by a computer
  • the program code is read by reading the storage medium, and the processor executes the program code, so that the computer device executes the target detection method provided in the above-mentioned various optional implementation manners.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Security & Cryptography (AREA)
  • General Business, Economics & Management (AREA)
  • Business, Economics & Management (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

提供了一种目标检测方法,方法包括:(301)对第一目标区域进行多频带滤波,得到多个频带子图,所述第一目标区域为第一视频帧中待检测的区域;(302)对所述多个频带子图进行多方向滤波,得到多个方向子图;(303)根据所述多个方向子图,获取所述第一目标区域的方向频带融合特征;(304)将所述方向频带融合特征输入目标检测模型,通过所述目标检测模型,基于所述方向频带融合特征进行预测,得到所述第一目标区域的预测标签,所述预测标签用于表示所述第一目标区域是否包括目标对象。

Description

目标检测方法、装置、设备以及存储介质
本申请要求于2021年1月12日提交的申请号为202110033937.9、发明名称为“目标检测方法、装置、设备以及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及图像识别领域,特别涉及一种目标检测方法、装置、设备以及存储介质。
背景技术
随着计算机技术的发展,越来越多的场景需要进行目标检测,比如在监控视频中识别出运动的目标人物,或者在游戏中识别出运动的游戏人物等。
发明内容
本申请实施例提供了一种目标检测方法、装置、设备以及存储介质,可以提升目标检测的准确性。所述技术方案如下:
一方面,提供了一种目标检测方法,由计算机设备执行,所述方法包括:
对第一目标区域进行多频带滤波,得到多个频带子图,所述第一目标区域为第一视频帧中待检测的区域;
对所述多个频带子图进行多方向滤波,得到多个方向子图;
根据所述多个方向子图,获取所述第一目标区域的方向频带融合特征;
将所述方向频带融合特征输入目标检测模型,通过所述目标检测模型,基于所述方向频带融合特征进行预测,得到所述第一目标区域的预测标签,所述预测标签用于表示所述第一目标区域是否包括目标对象。
一方面,提供了一种目标检测装置,所述装置包括:
多频带滤波模块,用于对第一目标区域进行多频带滤波,得到多个频带子图,所述第一目标区域为第一视频帧中待检测的区域;
多方向滤波模块,用于对所述多个频带子图进行多方向滤波,得到多个方向子图;
特征获取模块,用于根据所述多个方向子图,获取所述第一目标区域的方向频带融合特征;
输入模块,用于将所述方向频带融合特征输入目标检测模型,通过所述目标检测模型,基于所述方向频带融合特征进行预测,得到所述第一目标区域的预测标签,所述预测标签用于表示所述第一目标区域是否包括目标对象。
一方面,提供了一种计算机设备,所述计算机设备包括一个或多个处理器和一个或多个存储器,所述一个或多个存储器中存储有至少一条计算机程序,所述计算机程序由所述一个或多个处理器加载并执行以实现所述目标检测方法。
一方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条计算机程序,所述计算机程序由处理器加载并执行以实现所述目标检测方法。
一方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括程序代码,该程序代码存储在计算机可读存储介质中,计算机设备的处理器从计算机可读 存储介质读取该程序代码,处理器执行该程序代码,使得该计算机设备执行上述目标检测方法。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种目标检测方法的实施环境的示意图;
图2是本申请实施例提供的一种在游戏场景中标注游戏人物的示意图;
图3是本申请实施例提供的一种目标检测方法的流程图;
图4是本申请实施例提供的一种目标检测方法的流程图;
图5是本申请实施例提供的一种频带滤波器的结构示意图;
图6是本申请实施例提供的一种频带滤波器的结构示意图;
图7是本申请实施例提供的一种方向滤波器的结构示意图;
图8是本申请实施例提供的一种方向滤波器的结构示意图;
图9是本申请实施例提供的一种界面示意图;
图10是本申请实施例提供的一种目标检测方法的流程图;
图11是本申请实施例提供的一种目标检测模型的训练方法的流程图;
图12是本申请实施例提供的一种目标检测方法的流程图;
图13是本申请实施例提供的一种目标检测装置结构示意图;
图14是本申请实施例提供的一种终端的结构示意图;
图15是本申请实施例提供的一种服务器的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步的详细描述。
本申请中术语“第一”“第二”等字样用于对作用和功能基本相同的相同项或相似项进行区分,应理解,“第一”、“第二”、“第n”之间不具有逻辑或时序上的依赖关系,也不对数量和执行顺序进行限定。
相关技术中,计算机设备会采用帧差法来确定运动目标的位置,也即是将前后两帧或多帧图像做差值,得到差值图像。由于背景的像素值相减后数值很小或者为零,而运动目标的像素值相减后数值较大,因此计算机设备对得到的差值图像进行二值化,即可检测出运动目标。
但是,当运动目标的运动速度较慢或发生旋转时容易出现漏检现象,而且重叠的部分也不容易检测出来,出现空洞等现象,因此检测准确性不高。
本申请中术语“至少一个”是指一个或多个,“多个”的含义是指两个或两个以上。
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
机器学习(Machine Learning,ML)是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识子模型使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、示教学习等技术。
归一化:也即是将取值范围不同的数列映射到(0,1)区间上的一种方法,便于数据的处理。在一些情况下,归一化后的数值可以直接被实现为概率。能够实现归一化的函数包括软最大化(Softmax)函数以及S型生长曲线(Sigmoid),当然也包括其他能够实现归一化的函数,本申请实施例对此不作限定。
图1是本申请实施例提供的一种目标检测方法的实施环境示意图,参见图1,该实施环境中可以包括终端110和服务器140。
终端110通过无线网络或有线网络与服务器140相连。可选地,终端110是智能手机、平板电脑、笔记本电脑、台式计算机、智能手表等,但并不局限于此。终端110安装和运行有支持图像显示的应用程序。
可选地,服务器是独立的物理服务器,或者是多个物理服务器构成的服务器集群或者分布式系统,或者是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。
可选地,终端110泛指多个终端中的一个,本申请实施例仅以终端110来举例说明。
本领域技术人员可以知晓,上述终端的数量可以更多或更少。比如上述终端仅为一个,或者上述终端为几十个或几百个,或者更多数量,此时上述实施环境中还包括其他终端。本申请实施例对终端的数量和设备类型不加以限定。
在介绍完本申请实施例提供目标检测方法的实施环境之后,下面对本申请实施例提供的目标检测方法的应用场景进行介绍。
1、本申请实施例提供的目标检测方法能够应用在游戏目标的检测场景下,也即是,在游戏过程中,通过本申请实施例提供的目标检测方法,能够实现对游戏场景中游戏目标的检测和追踪。
以卡牌类游戏为例,在卡牌类游戏中,玩家能够在游戏场景中释放不同卡牌对应的游戏人物,该游戏人物能够在游戏场景中进行移动或者做出攻击动作。在游戏人物在游戏场景中进行移动的过程中,终端能够通过本申请实施例提供的目标检测方法来对移动中的游戏人物进行检测和追踪,从而实时获取游戏人物的坐标信息,记录游戏人物的运动轨迹,以供技术人员对游戏进行分析和测试,及时发现游戏中的漏洞,终端对游戏人物进行突出显示的一个示例参见图2,图2中显示有我方游戏人物的一个标注框201。
以MOBA(Multiplayer Online Battle Arena,多人在线战术竞技游戏)为例,玩家能够在游戏场景中控制不同的游戏人物,也即是控制游戏人物在游戏场景中做出移动或者执行释放虚拟技能的动作。在游戏过程中,终端能够通过本申请实施例提供的目标检测方法来对游戏人物进行检测和追踪,从而实时获取游戏人物的坐标信息,记录游戏人物的运动轨迹。游戏人物的运动轨迹不仅能够提供的给技术人员对游戏进行分析和测试,及时发现游戏中的漏洞,也能够提供给技术人员生成单局游戏的游戏记录,游戏记录中至少存储有游戏人物在该局游戏中的运动轨迹。在这种情况下。玩家通过该局游戏的游戏记录就能够快速回顾自己在该局游戏中的表现,无需查看完整的游戏记录视频,人机交互的效率较高。
以回合制游戏为例,双方玩家能够在游戏场景中控制游戏人物进行相互攻击,相互攻击 的方式为回合制,也即是一方玩家控制的游戏人物攻击完毕之后,另一方玩家控制的游戏人物才能发动攻击。在游戏过程中,终端能够通过本申请实施例提供的目标检测方法来对两方玩家控制的游戏人物进行检测和追踪,从而实时获取游戏人物的坐标信息,记录游戏人物的运动轨迹。游戏人物的运动轨迹能够提供的给技术人员对游戏进行分析和测试,消除出现一些游戏异常,比如游戏人物的异常消失,或者游戏人物死亡后没有消失等情况,技术人员能够基于游戏人物的运动轨迹,快速发现游戏异常,及时对游戏异常进行修复。
2、本申请实施例提供的目标检测方法能够应用在人物检测和追踪的场景下,也即是,当技术人员需要利用监控视频来对目标人物进行追踪时,终端能够对监控视频采用本申请实施例提供的目标检测方法,在监控视频中对目标人物进行检测和追踪。在一些实施例中,采用本申请实施例提供的目标检测方法之后,终端能够在播放监控视频的过程中,对监控视频中的目标人物进行突出显示,便于技术人员进行观察和记录。
3、本申请实施例提供的目标检测方法能够应用在车辆检测和追踪的场景下,也即是,当技术人员需要利用航拍视频来对目标车辆进行追踪时,终端能够对航拍视频采用本申请实施例提供的目标检测方法,在航拍视频中对目标车辆进行检测和追踪。在一些实施例中,采用本申请实施例提供的目标检测方法之后,终端能够在航拍视频中对目标车辆进行突出显示,便于技术人员进行车辆追踪。
需要说明的是,上述各个应用场景说明仅仅是为了便于理解而进行的,在其他可能的实施方式中,本申请实施例提供的目标检测方法也能够应用在其他目标检测场景下,比如动物检测或者飞行器检测等场景,本申请实施例对此不做限定。
在本申请实施例中,可以由服务器或终端作为执行主体来实施本申请实施例提供的技术方案,也可以通过终端和服务器之间的交互来实施本申请提供的技术方案,也即服务器做后台的数据,将处理后的数据发送给终端,由终端将处理结果展现给用户,本申请实施例对此不作限定。下面将以执行主体为终端为例进行说明。
图3是本申请实施例提供的一种目标检测方法的流程图,参见图3,方法包括:
301、终端对第一目标区域进行多频带滤波,得到多个频带子图,该第一目标区域为第一视频帧中待检测的区域。
其中,频带滤波是指采用特定的频率范围对第一目标区域进行滤波,相应地,多频带滤波是指采用多个频率范围对第一目标区域进行滤波。比如,存在频率范围(500,800),采用频率范围(500,800)对第一目标区域进行滤波时,也即是将第一目标区域中处于频率范围(500,800)的部分保留,处于频率范围(500,800)之外的部分删除。终端通过多个频带滤波,也即是通过多个频率范围将第一目标区域分解为多个频带子图,频带子图为滤波后频率处于对应频率范围的图像。在终端采用频率范围(500,800)对第一目标区域进行滤波的情况下,滤波后得到的图像也就是频率处于频率范围(500,800)内的一个频带子图;在终端采用频率范围(200,300)对第一目标区域进行滤波的情况下,滤波后得到的图像也就是频率处于频率范围(200,300)内的一个频带子图。在一些实施例中,不同频带子图用于记录不同的图像特征。
302、终端对多个频带子图进行多方向滤波,得到多个方向子图。
其中,方向滤波是指采用特定的方向对频带子图进行滤波。以存在四个方向为例,如果将正西方向标注为0°,那么正北方向为90°,正东方向为180°,正南方向为270°,其中,0°到90°为第一个方向,90°到180°为第二个方向,180°到270°为第三个方向,270°到360°(0°)为第四个方向。多方向滤波是指采用多个方向对频带子图进行滤波,也即是将频带子图分解为多个方向子图的过程。以一个频带子图为例,在存在上述四个方向的情况下,终端对该频带子图进行方向滤波之后,该频带子图被分解为四个方向的方向子图,相应地,N个频带子图也就被分解为4N个方向子图,其中,N为正整数。
303、终端根据多个方向子图,获取第一目标区域的方向频带融合特征。
其中,由于多个方向子图为多个频带子图的多个方向子图,那么根据多个方向子图也就能够得到融合了方向信息和频率信息的方向频带融合特征。
304、终端将方向频带融合特征输入目标检测模型,通过目标检测模型,基于方向频带融合特征进行预测,得到第一目标区域的预测标签,预测标签用于表示第一目标区域是否包括目标对象。
其中,目标检测模型是基于样本视频帧和样本视频帧对应的标签训练得到模型,具有在视频帧中识别出目标对象的能力。在一些实施例中,目标检测模型为Faster R-CNN(Fast Region Convolutional Neural Networks,快速区域卷积神经网络)、SSD(Single Shot Multi Box Detector,实时目标检测)、YOLO(You Only Look Once,你只要看一次)、决策树模型或者Adaboost(增强学习)等,本申请实施例对此不做限定。
通过本申请实施例提供的技术方案,终端能够对第一视频帧的第一目标区域进行频带滤波和方向滤波,得到表示第一目标区域频带信息的多个频带子图以及表示第一目标区域方向信息的多个方向子图。终端将频带信息和方向信息进行融合后得到的方向频带融合特征,也就能够更加完整地表示第一目标区域的特征,即使目标对象运动较慢或者发生旋转,方向频带融合信息也能够准确的表示第一目标区域的特征。后续终端通过目标检测模型,基于方向频带融合特征进行目标检测时,也就能够得到更加准确的检测效果。
图4是本申请实施例提供的一种目标检测方法的流程图,参见图4,方法包括:
401、终端从第一视频帧中确定第一目标区域,该第一目标区域为第一视频帧中待检测的区域。
其中,第一视频帧为包含目标对象的视频中的一个视频帧,第一目标区域为可能存在目标对象的区域,目标对象也即是目标检测方法的目标。在游戏场景中,目标对象是游戏人物;在人物的检测和追踪的场景下,目标对象是目标人物;在车辆检测和追踪的场景下,目标对象是目标车辆。
在一种可能的实施方式中,终端确定第二目标区域,该第二目标区域为目标对象在第二视频帧中所在的区域,该第二视频帧为显示时间在第一视频帧之前的视频帧。终端基于第一视频帧和第二视频帧,对第二目标区域进行偏移,得到第一目标区域,第一目标区域为偏移后的第二目标区域在第一视频帧中对应的区域。
在这种实施方式下,由于视频帧是连续的,第一视频帧和第二视频帧均为同一个视频帧中相邻的视频帧,目标对象在相邻的视频帧中的位置也往往是接近的。终端确定目标对象在第二视频帧中的第二目标区域之后,对第二目标区域进行偏移,将偏移后的第二目标区域映射至第一视频帧,就能够在第一视频帧中得到第一目标区域,该第一目标区域也即是可能包含目标对象的区域。
下面通过两个例子对上述实施方式进行说明。
1、以目标对象为游戏人物,第二视频帧的显示时间早于第一视频帧1s为例。在终端确定游戏人物在第二视频帧中的中心坐标为(10,15)的情况下,终端能够确定以该中心坐标(10,15)为中心的第二目标区域。在该第二目标区域为边长为4的正方形区域,且该第二目标区域左下角的坐标为(8,13),右上角的坐标为(12,17)的情况下,终端通过左下角的坐标和右上角的坐标,就能够唯一确定第二目标区域。终端基于第一视频帧和第二视频帧,对第二目标区域进行偏移,得到偏移后的第二目标区域的中心坐标,比如为(12,15),相应地,偏移后的第二目标区域左下角的坐标为(10,13),右上角的坐标为(14,17)。终端将偏移后的第二目标区域映射至第一视频帧,得到该第一视频帧中的第一目标区域,也就在第一视频帧中确定出了第一目标区域。在一些实施例中,第一目标区域的数量为多个,也即是,终端能够对第二目标区域进行多次偏移,得到多个偏移后的第二目标区域。终端将多个偏移 后的第二目标区域映射至第一视频帧,也就得到了多个第一目标区域,终端后续能够基于多个第一目标区域进行目标检测,也即是游戏场景中的游戏人物检测。需要说明的是,终端除了能够通过游戏人物在第二视频帧中的中心坐标来反推第二目标区域之外,还能够直接在第二视频帧中确定第二目标区域,也即是,终端从第二视频帧中确定出游戏人物所在的第二目标区域之后,能够将第二目标区域的坐标存储在缓存中,后续能够直接从缓存中读取游戏人物在第二视频帧中的第二目标区域,无需终端进行反推计算,减少了终端的计算量,提高了终端的计算效率。
2、以目标对象为目标车辆,第二视频帧的显示时间早于第一视频帧0.5s为例。在终端确定目标车辆在第二视频帧中的中心坐标为(0,2)的情况下,能够确定以该中心坐标(0,2)为中心的第二目标区域。在第二目标区域为长方形区域,长方形区域的长为4,宽为2的情况下,该第二目标区域左下角的坐标为(-2,1),右上角的坐标为(2,3)。终端通过左下角的坐标和右上角的坐标,就能够唯一确定第二目标区域。终端基于第一视频帧和第二视频帧,对第二目标区域进行偏移,得到偏移后的第二目标区域的中心坐标,比如为(2,2),相应地,偏移后的第二目标区域左下角的坐标为(0,2),右上角的坐标为(4,3)。终端将偏移后的第二目标区域映射至第一视频帧,得到该第一视频帧中的第一目标区域,也就在第一视频帧中确定出了第一目标区域。在一些实施例中,第一目标区域的数量为多个,也即是,终端能够对第二目标区域进行多次偏移,得到多个偏移后的第二目标区域,终端将多个偏移后的第二目标区域映射至第一视频帧,也就得到了多个第一目标区域,终端后续能够基于多个第一目标区域进行目标检测,也即是目标车辆的检测。需要说明的是,终端除了能够通过目标车辆在第二视频帧中的中心坐标来反推第二目标区域之外,还能够直接在第二视频帧中确定第二目标区域,也即是,终端从第二视频帧中确定出目标车辆所在的第二目标区域之后,能够将第二目标区域的坐标存储在缓存中,后续能够直接从缓存中读取目标车辆在第二视频帧中的第二目标区域,无需终端进行反推计算,减少了终端的计算量,提高了终端的计算效率。
可选地,在上述实施方式的基础上,下面对终端对第二目标区域进行偏移的方法进行说明。
在一种可能的实施方式中,终端基于第一视频帧和第二视频帧之间的显示时间差值,确定对第二目标区域进行偏移的距离,偏移的距离与该显示时间差值成正比,该显示时间差值为播放视频时,显示第一视频帧和显示第二视频帧之间的时间差值。
在这种实施方式下,第一视频帧和第二视频帧之间的显示时间差值越大,那么目标对象在视频帧中的位置变化幅度也就可能越大,终端能够根据第一视频帧和第二视频帧之间的显示时间差值,来确定对第二目标区域进行偏移的距离,提高第一目标区域中包含目标对象的概率,从而提高目标检测的效率。
举例来说,终端采用一个反比例函数来基于该显示时间差值确定对第二目标区域进行偏移的距离,比如,终端能够通过公式(1)来确定对第二目标区域进行偏移的距离。
y=K/x    (1)
其中,y为对第二目标区域进行偏移的距离,x为第一视频帧和第二视频帧之间的显示时间差值,x>0,K为反比例常数,由技术人员根据实际情况进行设置,K≠0。
需要说明的是,终端除了能够通过反比例函数来基于第一视频帧和第二视频帧之间的显示时间差值来确定对第二目标区域进行偏移的距离之外,也能够通过其他函数来确定对第二目标区域进行偏移的距离,本申请实施例对此不做限定。
可选地,在上述实施方式的基础上,在第一视频帧为视频中的第一个视频帧的情况下,终端能够采用目标对象的模板图像在第一视频帧上进行模板匹配,从第一视频帧中确定该第一目标区域。
在这种实施方式下,由于第一视频帧为视频中的第一个视频帧,那么第一视频帧之前也就不存在第二视频帧,终端能够采用目标对象的模板图像在第一视频帧上进行模板匹配,得到待检测的第一目标区域。由于模板匹配的效率较高,确定第一目标区域的效率也就较高。
举例来说,终端采用目标对象的模板图像在第一视频帧上进行滑动,获取该目标图像与第一视频帧上多个被覆盖区域的相似度,将相似度最高的区域确定为待检测的第一目标区域。
402、终端对第一视频帧中待检测的第一目标区域进行多频带滤波,得到多个频带子图。
在一种可能的实施方式中,终端将第一目标区域输入频带滤波器组,通过频带滤波器组中的多个频带滤波器,对第一目标区域进行多频带滤波,得到多个频带子图。在一些实施例中,频带滤波器为非下采样塔形滤波器。
在这种实施方式下,由于在对第一目标区域进行滤波的过程中采用了非下采样塔形滤波器,而非下采样塔形滤波器不存在下采样过程,输出的频带子图的尺寸与第一目标区域相同。通过这样的方式,减去了下采样之后的图像放大和图像配准过程,提高了后续目标检测的准确性。其中,下采样也即是降低图像分辨率的过程,举例来说,存在一个分辨率为512×512的参考图像,对该参考图像进行下采样的过程,也即是降低该参考图像分辨率的过程,或者说是从该参考图像中抽取部分像素点,得到新的参考图像的过程。在一些实施例中,终端从该分辨率为512×512的参考图像中抽取奇数行和奇数列的像素点,将抽取的像素点进行重新组合,也就得到了新的参考图像,新的参考图像的分辨率变成了256×256。
下面通过两个例子对上述实施方式进行说明。
例1、终端对第一视频帧进行时频变换,得到第一视频帧的第一频域图像。终端将第一频域图像中的第一目标区域输入频带滤波器组,频带滤波器组包括多个频带滤波器,每个频带滤波器对应于不同的频率范围。终端通过频带滤波器组中的多个频带滤波器,对第一目标区域进行滤波,由多个频带滤波器输出多个频带子图,每个频带子图对应于一个频带滤波器,由于不同频带滤波器对应于不同的频率范围,那么每个频带子图也即是处于不同频率范围内的频域图像。参见图5,提供了一种频带滤波器组501的示意图像,该频带滤波器组501包括频带滤波器5011、频带滤波器5012以及频带滤波器5013。三个频带滤波器对应于不同的频率范围。终端通过三个频带滤波器,也就能够将第一目标区域分解为频带子图A、频带子图B以及频带子图C,频带子图A、频带子图B以及频带子图C分别对应于不同的频率范围。
下面将在上述例1的基础上,进一步对终端通过频带滤波器组对第一目标区域进行多频带滤波,得到多个频带子图的方法进行说明。
继续以图5为例,终端对第一视频帧进行快速傅里叶变换,得到第一视频帧的第一频域图像。终端将第一视频帧中的第一目标区域
Figure PCTCN2022070095-appb-000001
输入频带滤波器组,也即是将第一频域图像的第一目标区域输入频带滤波器组。该频带滤波器组包括三个频带滤波器:频带滤波器5011、频带滤波器5012以及频带滤波器5013,其中,频带滤波器5011的滤波矩阵为
Figure PCTCN2022070095-appb-000002
频带滤波器5012的滤波矩阵为
Figure PCTCN2022070095-appb-000003
频带滤波器5013的滤波矩阵 为
Figure PCTCN2022070095-appb-000004
终端对第一目标区域
Figure PCTCN2022070095-appb-000005
进行边缘填充,得到第一目标区域的填充图像
Figure PCTCN2022070095-appb-000006
终端分别采用三个频带滤波器的滤波矩阵
Figure PCTCN2022070095-appb-000007
Figure PCTCN2022070095-appb-000008
以及
Figure PCTCN2022070095-appb-000009
对第一目标区域的填充图像进行频带滤波,得到三个频带子图
Figure PCTCN2022070095-appb-000010
以及
Figure PCTCN2022070095-appb-000011
例2、终端对第一视频帧进行时频变换,得到第一视频帧的第一频域图像。终端将第一频域图像中的第一目标区域输入频带滤波器组,该频带滤波器组包括多个层级不同的频带滤波器。终端通过多个层级不同的频带滤波器,对第一目标区域进行多级滤波,得到多个频带子图。参见图6,提供了另一种频带滤波器组600的结构,该频带滤波器组包括第一低通滤波器601和第一高通滤波器602。第一低通滤波器601后方还连接有第二低通滤波器603以及第二高通滤波器604。在一些实施例中,第二低通滤波器603后方还连接有第三低通滤波器605以及第三高通滤波器606,以此类推。其中,第一低通滤波器601、第二低通滤波器603以及第三低通滤波器605也就构成了三个层级的低通滤波器;第一高通滤波器602、第二高通滤波器604以及第三高通滤波器606也就构成了三个层级的高通滤波器。终端能够通过如图6所示的多层级频带滤波器,将第一目标区域分解为频带子图D、频带子图E、频带子图F及频带子图G。
下面将在上述例2的基础上,进一步对终端通过频带滤波器组对第一目标区域进行多频带滤波,得到多个频带子图的方法进行说明。
举例来说,终端对第一视频帧进行快速傅里叶变换,得到第一视频帧的第一频域图像。终端将第一频域图像中的第一目标区域
Figure PCTCN2022070095-appb-000012
输入频带滤波器组。参见图6,该频带滤波器组包括6个频带滤波器:第一低通滤波器601、第二低通滤波器603、第三低通滤波器605、第一高通滤波器602、第二高通滤波器604以及第三高通滤波器606。终端对第一目标 区域
Figure PCTCN2022070095-appb-000013
进行边缘填充,得到第一目标区域的填充图像
Figure PCTCN2022070095-appb-000014
终端采用第一低通滤波器
Figure PCTCN2022070095-appb-000015
和第一高通滤波器
Figure PCTCN2022070095-appb-000016
对第一目标区域的填充图像分别进行频带滤波,得到第一低通子图
Figure PCTCN2022070095-appb-000017
以及第一高通子图
Figure PCTCN2022070095-appb-000018
终端对第一低通子图
Figure PCTCN2022070095-appb-000019
进行边缘填充,得到第一低通子图的填充图像
Figure PCTCN2022070095-appb-000020
终端将第一低通子图的填充图像
Figure PCTCN2022070095-appb-000021
分别输入第二低通滤波器
Figure PCTCN2022070095-appb-000022
以及第二高通滤波器
Figure PCTCN2022070095-appb-000023
得到第二低通子图
Figure PCTCN2022070095-appb-000024
以及第二高通子图
Figure PCTCN2022070095-appb-000025
终端对第二低通子图
Figure PCTCN2022070095-appb-000026
进行边缘填充,得到第二低通子图的填充图像
Figure PCTCN2022070095-appb-000027
终端将第二低通子图的填充图像分别输入第三低通滤波器
Figure PCTCN2022070095-appb-000028
和第三高通滤波器
Figure PCTCN2022070095-appb-000029
得到第三低通子图
Figure PCTCN2022070095-appb-000030
和第三高通子图
Figure PCTCN2022070095-appb-000031
其中,第三低通子图、第三高通子图、第二高通子图以及第一高通子图分别对应于频带子图D、频带子图E、频带子图F及频带子图G。通过实验得知,低通子图对模型分类的贡献不明显,终端能够在后续处理过程中忽略频带子图D,也即是第三低通子图,以减少计算量。
403、终端对多个频带子图进行多方向滤波,得到多个方向子图。
在一种可能的实施方式中,对于多个频带子图中任一频带子图,终端将该频带子图输入方向滤波器组,通过方向滤波器组中的多个方向滤波器,对该频带子图进行多方向滤波,得到频带子图对应的多个方向特征。终端将各个方向特征输入方向滤波器组中对应的重构滤波器。终端通过重构滤波器,基于输入的方向特征生成该频带子图的多个方向子图,在一些实施例中,方向滤波器和重构滤波器均为非下采样滤波器。
在这种实施方式下,终端能够在将第一目标区域划分为多个频带子图之后,进一步按照方向对每个频带子图进行进一步的划分,得到每个频带子图的多个方向子图,后续基于多个频带子图对应的多个方向子图就能够充分表达第一目标区域的特征。
下面以终端对一个频带子图进行四方向滤波的为例对上述实施方式进行说明。
为了更加清楚的进行说明,首先对方向滤波器组的结构进行介绍,参见图7,提供了一种方向滤波器组701的结构示意图,该方向滤波器组701包括方向滤波器702和重构滤波器703。对于方向滤波器来说,若要对频带子图进行四方向滤波,那么方向滤波器702中存在两级滤波结构,参见图8,方向滤波器702中存在第一级的楔形滤波器801和楔形滤波器802,楔形滤波器801和楔形滤波器802中白色的方向为滤波器允许通过的方向,楔形滤波器801和楔形滤波器802中灰色的方向为滤波器不允许通过的方向。在第一级的楔形滤波器801和楔形滤波器802之后,还分别连接有两个第二级的方形滤波器,其中,楔形滤波器801之后连接有方形滤波器8011和方形滤波器8012;楔形滤波器802之后连接有方形滤波器8021和方形滤波器8022,方形滤波器的中白色的方向为滤波器允许通过的方向,方形滤波器中灰色的方向为滤波器不允许通过的方向。通过楔形滤波器和方形滤波器的配合,终端能够将第一 频带子图分解为四个方向的方向子图。对于重构滤波器703来说,其具有与方向滤波器702对称的结构,也即是,重构滤波器703的输入为四个方向子图,输出为一个合成的频带子图。
在介绍完方向滤波器组的结构之后,下面将结合方向滤波器组的结构,对终端将一个频带子图分解为四个方向子图的方法进行说明。
若存在一个频带子图
Figure PCTCN2022070095-appb-000032
终端将该频带子图
Figure PCTCN2022070095-appb-000033
输入方向滤波器组701,通过方向滤波器组中的方向滤波器702对该频带子图
Figure PCTCN2022070095-appb-000034
进行方向滤波。参见图8,终端将该频带子图
Figure PCTCN2022070095-appb-000035
输入楔形滤波器801,通过楔形滤波器801的楔形滤波矩阵
Figure PCTCN2022070095-appb-000036
对该频带子图
Figure PCTCN2022070095-appb-000037
进行方向滤波,也即是将楔形滤波矩阵与该频带子图对应位置的数值进行相乘,得到第一级方向特征
Figure PCTCN2022070095-appb-000038
终端将该第一级方向特征
Figure PCTCN2022070095-appb-000039
输入楔形滤波器801之后的方形滤波器8011,通过方形滤波器8011的方形滤波矩阵
Figure PCTCN2022070095-appb-000040
对该第一级方向特征
Figure PCTCN2022070095-appb-000041
进行方向滤波,也即是将该方形滤波矩阵与该第一级方向特征对应位置 的数值进行相乘,得到一个第二级方向特征
Figure PCTCN2022070095-appb-000042
终端将第二级方向特征输入重构滤波器703,通过重构滤波器,基于输入的第二级方向特征
Figure PCTCN2022070095-appb-000043
生成该频带子的一个方向子图,比如为
Figure PCTCN2022070095-appb-000044
终端将第一级方向特征
Figure PCTCN2022070095-appb-000045
输入楔形滤波器801之后的方形滤波器8012,通过方形滤波器8012的方形滤波矩阵
Figure PCTCN2022070095-appb-000046
对该第一级方向特征
Figure PCTCN2022070095-appb-000047
进行方向滤波,也即是将该方形滤波矩阵与该第一级方向特征对应位置的数值进行相乘,得到另一个第二级方向特征
Figure PCTCN2022070095-appb-000048
终端将该第二级方向特征
Figure PCTCN2022070095-appb-000049
输入重构滤波器703,通过重构滤波器,基于输入的第二级方向特征
Figure PCTCN2022070095-appb-000050
生成该频带子的另一个方向子图,比如为
Figure PCTCN2022070095-appb-000051
另外,终端通过楔形滤波器802的处理过程与终端通过楔形滤波器801的处理方式属于同一发明构思,终端通过方形滤波器8021以及方形滤波器8022的处理过程与通过方形滤波器8011以及方形滤波器8012也属于同一发明构思,在此不再赘述。通过上述处理过程,终端也就将一个频带子图分解为四个方向子图。对于多个频带子图来说,每个频带子图也就被分解为4个方向子图,在存在N个频带子图的情况下,能够得到4N个方向子图,其中,N为正整数。
需要说明的是,在上述举例的过程中,是以终端将每个频带子图划分为4个方向子图为 例进行说明的,在其他可能的实施方式中,终端也能够将每个频带子图划分为8个或更多的频带子图,只需在方形滤波器8011、方形滤波器8012、方形滤波器8021以及方形滤波器8022之后增加不同形状的滤波器即可,划分为8个方向子图的方法与上述划分为4个方向子图的方法属于同一发明构思,在此不再赘述。
404、终端根据多个方向子图,获取第一目标区域的方向频带融合特征。
在一种可能的实施方式中,终端获取多个频带子图中任一频带子图对应的多个第一方向子图的能量。终端按照能量从大到小的顺序,对该多个第一方向子图进行排序,得到多个参考方向子图,该多个参考方向子图为排序后的多个第一方向子图。终端基于多个参考方向子图以及多个第二方向子图,获取第一目标区域的方向频带融合特征,第二方向子图为多个频带子图中除该频带子图之外的其他频带子图对应的方向子图。
在这种实施方式下,终端只需获取多个频带子图中的任一频带子图对应的多个方向子图进行排序,就能够获取第一目标区域的方向频带融合特征,无需逐一对每个频带特征进行处理,减少了终端的计算量。
为了更加清楚的对上述实施方式进行说明,下面将分为几个部分对上述实施方式进行说明。
第一部分,对终端获取多个频带子图中任一频带子图对应的多个第一方向子图的能量的方法进行说明。需要说明是,在下述说明过程中,是以终端采用方向子图中数值的平方和来表示对应方向子图的能量为例进行说明的,在其他可能的实施方式中,也能采用其他形式来表示对应方向子图的能量,比如采用对应方向子图中数值的平方和的二范数来表示对应子图的能量,本申请实施例对此不做限定。
以一个频带子图对应于四个方向子图为例,终端从多个频带子图中确定任一频带子图,比如为
Figure PCTCN2022070095-appb-000052
终端获取该频带子图
Figure PCTCN2022070095-appb-000053
对应的四个第一方向子图
Figure PCTCN2022070095-appb-000054
以及
Figure PCTCN2022070095-appb-000055
终端获取四个第一方向子图
Figure PCTCN2022070095-appb-000056
以及
Figure PCTCN2022070095-appb-000057
的能量,在一些实施例中,能量也即是四个第一方向子图中数值的平方和,在这种情况下,第一方向子图
Figure PCTCN2022070095-appb-000058
的能量为1,第一方向子图
Figure PCTCN2022070095-appb-000059
的能量为2,第一方向 子图
Figure PCTCN2022070095-appb-000060
的能量为7,第一方向子图
Figure PCTCN2022070095-appb-000061
的能量为8。
第二部分,对终端按照能量从大到小的顺序,对该多个第一方向子图进行排序,得到多个参考方向子图的方法进行说明。
以一个频带子图对应于四个方向子图为例,终端对四个第一方向子图
Figure PCTCN2022070095-appb-000062
Figure PCTCN2022070095-appb-000063
以及
Figure PCTCN2022070095-appb-000064
进行排序,得到排序后的四个第一方向子图
Figure PCTCN2022070095-appb-000065
以及
Figure PCTCN2022070095-appb-000066
该排序后的四个第一方向子图也就是四个参考方向子图。
第三部分,对终端基于多个参考方向子图以及多个第二方向子图,获取第一目标区域的方向频带融合特征的方法进行说明。为了更加清楚的对第三部分进行说明,下面对该频带子图、多个频带子图以及其他频带子图的关系进行说明。该频带子图为多个频带子图中的任一频带子图,其他频带子图为多个频带子图中,除了该频带子图之外的频带子图。
在一种可能的实施方式中,终端基于多个参考方向子图对应的多个融合权重,将多个参考方向子图进行加权融合,得到该频带子图对应的第一方向融合图,融合权重与对应参考方向子图的能量正相关,也即是与对应第一方向子图的能量正相关。终端基于多个融合权重,将多个第二方向子图分别进行融合,得到其他频带子图对应的多个第二方向融合图。终端基于第一方向融合图和多个第二方向融合图,获取方向频带融合特征。
其中,终端能够通过下述步骤基于第一方向融合图和多个第二方向融合图,获取方向频带融合特征:
在一种可能的实施方式中,终端获取该第一方向融合图对应的第一积分图;终端获取该多个第二方向融合图对应的多个第二积分图;终端将第一积分特征向量与多个第二积分特征向量进行拼接,得到该第一目标区域的方向频带融合特征,该第一积分特征向量为该第一积分图对应的积分特征向量,该第二积分特征向量为该第二积分图对应的特征向量。
下面对上述说明中涉及的积分图以及积分特征向量的含义进行说明。对于积分图来说,由于图像是由一系列的离散像素点组成,因此图像的积分其实就是求和。积分图中每个点的值是原图像中该点左上角的所有像素值之和。在本申请实施例中,第一积分图中每个点的数值也即是第一方向融合图中该点左上角所有数值之和,第二积分图中每个点的数值也即是第二方向融合图中该点左上角所有数值之和。积分特征向量也即是积分图中数值按照从左至右,从上至下的顺序进行排列后得到的向量,在本申请实施例中,第一积分特征向量也即是第一 积分图中各个数值按照顺序进行排列后得到的向量,第二积分特征向量也即是第二积分图中各个数值按照顺序进行排列后得到的向量。
以一个频带子图对应于四个方向子图为例,终端根据频带子图
Figure PCTCN2022070095-appb-000067
的四个参考方向子图
Figure PCTCN2022070095-appb-000068
以及
Figure PCTCN2022070095-appb-000069
的能量,确定四个参考方向子图
Figure PCTCN2022070095-appb-000070
以及
Figure PCTCN2022070095-appb-000071
对应的融合权重。在一个实施例中,参考方向子图
Figure PCTCN2022070095-appb-000072
的能量为8,参考方向子图
Figure PCTCN2022070095-appb-000073
的能量为7,参考方向子图
Figure PCTCN2022070095-appb-000074
的能量为2,参考方向子图
Figure PCTCN2022070095-appb-000075
的能量为1。终端对四个参考方向子图
Figure PCTCN2022070095-appb-000076
Figure PCTCN2022070095-appb-000077
以及
Figure PCTCN2022070095-appb-000078
对应的能量(8,7,2,1)进行归一化,得到四个参考方向子图
Figure PCTCN2022070095-appb-000079
以及
Figure PCTCN2022070095-appb-000080
对应的融合权重(0.44,0.39,0.11,0.06)。终端基于融合权重(0.44,0.39,0.11,0.06),将四个参考方向子图
Figure PCTCN2022070095-appb-000081
以及
Figure PCTCN2022070095-appb-000082
进行加权融合,得到该频带子图对应的第一方向融合图
Figure PCTCN2022070095-appb-000083
在多个频带子图的数量为两个,且其他频带子图为
Figure PCTCN2022070095-appb-000084
的情况下,终端能够对该其他频带子图
Figure PCTCN2022070095-appb-000085
进行方向滤波,得到该其他频带子图
Figure PCTCN2022070095-appb-000086
的四个第二方向子图
Figure PCTCN2022070095-appb-000087
Figure PCTCN2022070095-appb-000088
以及
Figure PCTCN2022070095-appb-000089
终端基于之前确定的融合权重(0.44,0.39,0.11,0.06),将四个第二方向子图
Figure PCTCN2022070095-appb-000090
Figure PCTCN2022070095-appb-000091
以及
Figure PCTCN2022070095-appb-000092
进行加权融合,得到第二方向融合图
Figure PCTCN2022070095-appb-000093
终端获取第一方向融合图
Figure PCTCN2022070095-appb-000094
对应的第一积分图
Figure PCTCN2022070095-appb-000095
终端获取多个第二方向融合图对应的多个第二积分图,在该举例中,也即是获取第二方向融合图
Figure PCTCN2022070095-appb-000096
对应的第二积分图
Figure PCTCN2022070095-appb-000097
终端将第一积分图
Figure PCTCN2022070095-appb-000098
对应的第一积分特征向量与多个第二积分图对应的多个第二积分特征向量进行拼接,得到第一目标区域的方向频带融合特征,在该举例中,也即是将第一积分图
Figure PCTCN2022070095-appb-000099
对应的第一积分特征向量,比如为(1, 2,3,4)与第二积分图
Figure PCTCN2022070095-appb-000100
对应的第二积分特征向量,比如为(2,3,4,5)进行拼接,得到第一目标区域的方向频带融合特征(1,2,3,4,2,3,4,5)。
405、终端将方向频带融合特征输入目标检测模型,通过目标检测模型,基于方向频带融合特征进行预测,得到第一目标区域的预测标签,预测标签用于表示第一目标区域是否包括目标对象。
目标检测模型的训练方法参见下述对于图11的描述。
在一种可能的实施方式中,目标检测模型包括多个子模型,多个子模型之间相互独立。终端将方向频带融合特征输入目标检测模型也即是将方向频带融合特征分别输入多个子模型。终端通过多个子模型基于方向频带分别进行预测,输出多个子模型对应的多个预测参数,预测参数用于确定对应的预测标签。终端基于多个子模型对应的置信度,将多个预测参数进行融合,得到第一目标区域的预测标签,其中,置信度与对应子模型在测试时的预测准确性正相关。在一些实施例中,目标检测模型也被称为Adaboost模型。
在这种实施方式下,终端能够通过目标检测模型的多个子模型分别进行独立预测,并基于多个子模型的置信度将多个子模型的预测结果进行融合,得到最终的预测标签。这样能够利用多个子模型的预测能力,避免某个子模型的预测错误导致整体预测标签错误的问题,也即是避免了对模型时的过拟合现象,提高了目标检测模型的预测能力。
举例来说,目标检测模型包括3个子模型,每个子模型均为独立训练的子模型。终端将第一目标区域的方向频带融合特征(1,2,3,4,2,3,4,5)分别输入3个子模型,由3个子模型通过权重矩阵
Figure PCTCN2022070095-appb-000101
以及
Figure PCTCN2022070095-appb-000102
对方向频带融合特征(1,2,3,4,2,3,4,5)进行全连接,得到3个子模型对应的3个预测参数(7,8)、(8,9)以及(8,12)。终端分别通过3个子模型对3个预测参数(7,8)、(8,9)以及(8,12)进行归一化,得到3个子模型对应的3个概率向量(0.46,0.54)、(0.47,0.53)以及(0.4,0.6)。终端根据3个子模型对应的3个置信度0.1、0.2以及0.7,将3个概率向量进行加权融合,得到融合概率向量(0.42,0.58),其中,0.42表示第一目标区域包括目标对象的概率为42%,0.58表示第一目标区域不包括目标对象的概率为58%。若终端采用0表示第一目标区域包括目标对象的预测标签,采用1表示第一目标区域不包括目标对象的预测标签,那么终端能够将第一目标区域的预测标签设置为1。
需要说明是,上述是以目标检测模型为Adaboost模型为例进行说明的,在其他可能的实施方式中,目标检测模型也可以为其他结构的模型,比如为决策树模型或者卷积网络模型,本申请实施例对此不做限定。
406、响应于预测标签指示第一目标区域包括目标对象,终端在第一视频帧中对第一目标区域的轮廓进行突出显示。
参见图9,终端能够在第一视频帧901中显示第一目标区域的轮廓902,便于技术人员快速确定目标对象在第一视频帧901中的位置。
上述所有可选技术方案,可以采用任意结合形成本申请的可选实施例,在此不再一一赘 述。
下面将结合图10以及上述步骤401-406中各个可能的实施方式,对本申请实施例提供的技术方案进行一步说明。
参见图10,终端通过NSP(Non Subsampled Pyramid,非下采样金字塔)对第一视频帧的第一目标区域进行多频带滤波,得到多个频带子图。终端通过NSDFB(Non Subsampled DFB,非下采样方向滤波)对多个频带子图中的任一频带子图s 1进行多方向滤波,得到频带子图s 1对应的多个方向子图f 1,j,其中1表示频带子图s 1,j为方向子图的数量,在一些实施例中,j=8。终端按照8个方向子图f 1,1-f 1,8的能量对8个方向子图f 1,1-f 1,8进行排序,得到排序后的8个方向子图pic 1,1-pic 1,8,也即是8个参考方向子图。终端根据8个方向子图f 1,1-f 1, 8的能量获取8个方向子图对应的8个融合权重a 1-a 8,终端基于8个融合权重a 1-a 8将8个排序后的方向子图pic 1,1-pic 1,8进行加权融合,得到第一方向融合图Img 1。终端将8个融合权重a 1-a 8集成为融合等效滤波器,其他频带图像的多个方向子图通过融合等效滤波器就能够快速融合为对应的第二方向融合图,多个其他频带图像的对应的多个方向子图也就能够通过等效滤波器融合为多个第二方向融合图。在一些实施例中,若多个频带子图的数量为3,那么终端能够基于8个融合权重a 1-a 8集成为融合等效滤波器hc 2和融合等效滤波器hc 3。终端通过融合等效滤波器hc 2和融合等效滤波器hc 3对其他两个频带子图进行处理,得到其他两个频带子图对应的第二方向融合图Img 2和第三方向融合图Img 3。终端获取第一方向融合图Img 1的第一积分图Int 1,第二方向融合图Img 2的第二积分图Int 2和第三方向融合图Img 3的第三积分图Int 3。终端将第一积分图Int 1的第一积分特征向量,第二积分图Int 2的第二积分特征向量以及第三积分图Int 3的第三积分特征向量进行拼接,得到第一目标区域的方向频带融合特征X。终端能够将方向频带融合特征X输入目标检测模型,通过目标检测模型来输出第一目标区域的预测标签。当该预测标签指示第一目标区域包括目标对象时,终端在第一视频帧中对第一目标区域的轮廓进行突出显示。
本申请实施例提供的技术方案在游戏任务的识别与轨迹跟踪方面有很好的效果,游戏人物在旋转、遮挡过程中,该技术方案能够正确识别游戏人物的位置,在很大程度上优化了游戏人物数值分析的效率。其中,在卡牌游戏中识别成功率有99%,MOBA游戏也有95%以上。
通过本申请实施例提供的技术方案,终端能够对第一视频帧的第一目标区域进行频带滤波和方向滤波,得到表示第一目标区域频带信息的多个频带子图以及表示第一目标区域方向信息的多个方向子图。终端将频带信息和方向信息进行融合后得到的方向频带融合特征,也就能够更加完整地表示第一目标区域的特征,即使目标对象运动较慢或者发生旋转,方向频带融合信息也能够准确的表示第一目标区域的特征。后续终端通过目标检测模型,基于方向频带融合特征进行目标检测时,也就能够得到更加准确的检测效果。
另外,终端在对第一目标区域进行滤波的过程中采用了非下采样塔形滤波器,而非下采样塔形滤波器由于去掉了下采样过程,输出的频带子图的图像的尺寸与第一目标区域相同,也就减去了下采样之后的图像放大和图像配准过程,提高了后续目标检测的准确性。
还有,终端对频带子图进行多方向滤波时,也采用了非下采样滤波器,避免了方向滤波时尺度变换导致的图像失配,提高了后续目标检测的准确性。
除此之外,在获取第一目标区域的方向频带融合特征时,终端只需获取多个频带子图中的任一频带子图对应的多个方向子图进行排序,就能够获取第一目标区域的方向频带融合特征,无需逐一对每个频带特征进行处理,减少了终端的计算量。
更重要的是,终端在第一视频帧中确定出第一目标区域包括目标对象时,能够对第一目标区域的轮廓进行突出显示,便于技术人员及时发现目标对象的位置,提高了人机交互的效率。
在介绍完本申请实施例提供的目标检测方法之后,下面通过步骤1101-1104对本申请实施例提供的目标检测模型的训练方法进行说明,在本申请实施例中,目标检测模型既能够由终端进行训练,也能够由服务器进行训练,以目标检测模型由服务器进行训练为例,参见图11,方法包括:
1101、服务器对样本视频帧中待检测的样本区域进行多频带滤波,得到多个样本频带子图。
其中,服务器可以从不同的视频中截取视频图像,由技术人员对服务器截取的视频图像进行筛选,将包含目标对象的视频图像作为样本视频帧,在对视频图像进行筛选的过程中,技术人员还可以在视频图像中进行增加标签,标签用于指示样本视频帧中对应区域是否包含目标对象。服务器后续能够以标签为监督,对目标检测模型进行训练。
在一种可能的实施方式中,终端将样本区域输入频带滤波器组,通过频带滤波器组中的多个频带滤波器,对样本区域进行多频带滤波,得到多个频带子图,频带滤波器为非下采样塔形滤波器。
该实施方式与上述步骤402中的对应实施方式属于同一发明构思,实现方式参见步骤402的相关描述,在此不再赘述。
1102、服务器对多个样本频带子图进行多方向滤波,得到多个样本方向子图。
在一种可能的实施方式中,对于每个样本频带子图,终端将该样本频带子图输入方向滤波器组,通过方向滤波器组中的多个方向滤波器,对该样本频带子图进行多方向滤波,得到样本频带子图对应的多个方向特征。终端分别将每个方向特征输入方向滤波器组中对应的重构滤波器,通过重构滤波器,基于输入的方向特征,生成该样本频带子图的多个方向子图,方向滤波器和重构滤波器均为非下采样滤波器。
该实施方式与上述步骤403中的对应实施方式属于同一发明构思,实现方式参见步骤403的相关描述,在此不再赘述。
1103、服务器根据多个样本方向子图,获取样本区域的样本方向频带融合特征。
在一种可能的实施方式中,终端获取多个样本频带子图中任一样本频带子图对应的多个样本方向子图的能量。终端按照能量从大到小的顺序,对该样本频带子图对应的多个样本方向子图进行排序。终端基于排序后的该样本频带子图对应的多个样本方向子图以及多个样本频带子图中其他样本频带子图对应的多个样本方向子图,获取样本区域的样本方向频带融合特征。
该实施方式与上述步骤404中的对应实施方式属于同一发明构思,实现方式参见步骤404的相关描述,在此不再赘述。
1104、服务器基于样本方向频带融合特征以及样本区域的标签,对目标检测模型进行训练,标签用于指示样本区域是否包括样本对象。
在一种可能的实施方式中,服务器将样本方向频带融合特征输入目标检测模型,通过目标检测模型,基于样本方向频带融合特征进行预测,输出样本区域的预测标签。服务器基于预测标签和样本区域的标签之间的差异信息,对目标检测模型的模型参数进行更新。
下面以目标检测模型为Adaboost模型为例进行说明。
Adaboost模型包括多个子模型Classifier t,其中t为子模型的数量。对于第一个子模型Classifier 1来说,服务器将样本方向融合特征t 1输入第一个子模型Classifier 1,由第一个子模型Classifier 1对样本方向融合特征t 1进行全连接和归一化,输出样本区域的预测样本标签,预测样本标签用于指示样本区域是否包括样本对象。服务器构建表示预测样本标签与样本区域的标签之间差异的损失函数,来对第一个子模型Classifier 1的模型参数进行更新。
另外,若第一个子模型Classifier 1基于样本方向融合特征预测t 1的预测样本标签与样本区域对应的标签相同,也就表示样本方向融合特征预测t 1的预测难度较低,该那么服务器能够降低该样本方向频带融合特征t 1的训练权重w 1;若第一个子模型Classifier 1基于样本方向融 合特征t 1预测的预测样本标签与样本区域对应的标签不同,也就表示样本方向融合特征预测t 1的预测难度较高,那么服务器能够提升该样本方向频带融合特征t 1的训练权重w 1,训练权重也就该样本方向融合特征对模型参数的更新幅度,训练权重越大,对模型参数的更新幅度也就越大。
为该样本方向融合特征设置完训练权重之后,服务器能够获取另一个样本视频帧,对另一个样本视频帧执行上述1101-1103的步骤,得到该另一个样本视频帧的样本方向融合特征t 2。服务器初始化第二个子模型Classifier 2,基于样本方向融合特征预测t 1、样本方向融合特征预测t 1对应训练权重w 1以及样本方向融合特征t 2,对第二个子模型Classifier 2进行训练,以此类推。在这个过程中,后一个训练的子模型采用的训练样本均为前一个训练的子模型使用过的且添加了对应的训练权重的样本方向融合特征,这样能够提高目标检测模型对难以区分区域的识别能力。
在对目标检测模型的t个子模型训练完毕之后,服务器能够通过相同的测试集对t个子模型进行测试,分别记录t个子模型在测试过程中的正确率。服务器根据t个子模型在测试过程中的正确率,为t个子模型设置对应的置信度,在后续使用目标检测模型进行预测的过程中,能够基于t个子模型对应的置信度来进行标签预测。
需要说明的是,上述说明中均是以采用服务器训练模型为例进行说明的,在其他可能的实现方式中,上述目标检测模型也可以由终端进行训练,或者通过终端与服务器之间的交互来进行模型训练,例如由终端收集目标检测模型的训练图像,并将目标检测模型的训练图像发送至服务器,由服务器对目标检测模型进行训练。
下面将结合图12、上述步骤401-406以及上述步骤1101-1104中各个可能的实施方式,对本申请实施例提供的技术方案进行说明。
参见图12,上半部分为目标检测模型的训练流程图,与步骤1101-1104对应。服务器在样本视频帧上采集多个样本区域,多个样本区域包括正样本区域(包括样本对象的区域)和负样本区域(背景区域)。以样本视频帧中的一个样本区域为例,服务器通过NKP(Non Subsampled Pyramid,非下采样金字塔)对样本视频帧的样本区域进行多频带滤波,得到多个样本频带子图。服务器通过NSDFB(Non Subsampled Directional Filter Banks,非下采样方向滤波器组)对多个样本频带子图中的任一样本频带子图k 1进行多方向滤波,得到样本频带子图k 1对应的多个样本方向子图m 1,j,其中1表示样本频带子图k 1,j为样本方向子图的数量,在一些实施例中,j=8。服务器按照8个样本方向子图m 1,1-m 1,8的能量对8个样本方向子图m 1,1-m 1,8进行排序,得到排序后的8个样本方向子图pic 1,1-pic 1,8。服务器根据8个样本方向子图m 1,1-m 1,8的能量获取8个样本方向子图对应的8个融合权重a 1-a 8,服务器基于8个融合权重a 1-a 8将8个排序后的样本方向子图pic 1,1-pic 1,8进行加权融合,得到第一样本方向融合图SImg 1。服务器将8个融合权重a 1-a 8集成为融合等效滤波器,其他频带图像的多个样本方向子图通过融合等效滤波器就能够快速融合为对应的第二样本方向融合图,多个其他频带图像的对应的多个样本方向子图也就能够通过等效滤波器融合为多个第二样本方向融合图。在一些实施例中,若多个样本频带子图的数量为3,那么服务器能够基于8个融合权重a 1-a 8集成为融合等效滤波器hc 2和融合等效滤波器hc 3。服务器通过融合等效滤波器hc 2和融合等效滤波器hc 3对其他两个样本频带子图进行处理,得到其他两个样本频带子图对应的第二样本方向融合图SImg 2和第三样本方向融合图SImg 3。服务器获取第一样本方向融合图SImg 1的第一样本积分图SInt 1,第二样本方向融合图SImg 2的第二样本积分图SInt 2和第三样本方向融合图SImg 3的第三样本积分图SInt 3。服务器将第一样本积分图SInt 1的第一样本积分特征向量,第二样本积分图SInt 2的第二样本积分特征向量以及第三样本积分图SInt 3的第三样本积分特征向量进行拼接,得到样本区域的样本方向频带融合特征Y。服务器能够将样本方向频带融合特征Y输入目标检测模型,通过目标检测模型来输出样本区域的预测标签。服务器基于预测标签和该样本区域的标签之间的差异信息,对目标检测模型进行训练,训练过 程参见步骤1104的描述,在此不再赘述。
图12的下半部分为使用目标检测模型进行目标检测的流程图,对步骤401-406对应,该过程的说明参见上述对图10的描述,在此不再赘述。需要说明的是,图12中的旧位置是指目标对象在第二视频帧中的位置,新位置是指目标对象在第一视频帧中的位置。
图13是本申请实施例提供的一种目标检测装置结构示意图,参见图13,装置包括:多频带滤波模块1301、多方向滤波模块1302、特征获取模块1303以及输入模块1304。
多频带滤波模块1301,用于对第一目标区域进行多频带滤波,得到多个频带子图,该第一目标区域为第一视频帧中待检测的区域。
多方向滤波模块1302,用于对多个频带子图进行多方向滤波,得到多个方向子图。
特征获取模块1303,用于根据多个方向子图,获取第一目标区域的方向频带融合特征。
输入模块1304,用于将方向频带融合特征输入目标检测模型,通过目标检测模型,基于方向频带融合特征进行预测,得到第一目标区域的预测标签,预测标签用于表示第一目标区域是否包括目标对象。
在一种可能的实施方式中,多频带滤波模块1301,用于将第一目标区域输入频带滤波器组,通过频带滤波器组中的多个频带滤波器,对第一目标区域进行多频带滤波,得到多个频带子图。
在一种可能的实施方式中,多方向滤波模块1302,用于对于多个频带子图中任一频带子图,将频带子图输入方向滤波器组,通过方向滤波器组中的多个方向滤波器,对频带子图进行多方向滤波,得到频带子图对应的多个方向特征。将各个方向特征输入方向滤波器中对应的重构滤波器,通过重构滤波器,基于输入的方向特征,生成频带子图的多个方向子图。
在一种可能的实施方式中,特征获取模块1303,用于获取多个频带子图中任一频带子图对应的多个第一方向子图的能量。按照能量从大到小的顺序,对多个第一方向子图进行排序,得到多个参考方向子图。基于多个参考方向子图以及多个第二方向子图,获取第一目标区域的方向频带融合特征,第二方向子图为多个频带子图中其他频带子图对应的方向子图。
在一种可能的实施方式中,特征获取模块1303,用于多个参考方向子图对应的多个融合权重,将多个参考方向子图进行加权融合,得到频带子图对应的第一方向融合图,融合权重与对应第一方向子图的能量正相关。基于多个融合权重,将多个第二方向子图分别进行融合,得到其他频带子图对应的多个第二方向融合图。基于第一方向融合图和多个第二方向融合图,获取方向频带融合特征。
在一种可能的实施方式中,特征获取模块1303,用于获取第一方向融合图对应的第一积分图。获取多个第二方向融合图对应的多个第二积分图。将第一积分特征向量与多个第二积分特征向量进行拼接,得到该第一目标区域的方向频带融合特征,该第一积分特征向量为该第一积分图对应的积分特征向量,该第二积分特征向量为该第二积分图对应的特征向量。
在一种可能的实施方式中,装置还包括:
第一目标区域确定模块,用于确定第二目标区域,该第二目标区域为该目标对象在第二视频帧中所在的区域,第二视频帧为显示时间在第一视频帧之前的视频帧。基于第一视频帧和第二视频帧,对第二目标区域进行偏移,得到第一目标区域,第一目标区域为偏移后的第二目标区域在第一视频帧中对应的区域。
在一种可能的实施方式中,装置还包括:
显示模块,用于响应于预测标签指示第一目标区域包括目标对象,在第一视频帧中对第一目标区域的轮廓进行突出显示。
在一种可能的实施方式中,目标检测模型的训练装置用于对样本视频帧中待检测的样本区域进行多频带滤波,得到多个样本频带子图。对多个样本频带子图进行多方向滤波,得到多个样本方向子图。根据多个样本方向子图,获取样本区域的样本方向频带融合特征。基于 样本方向频带融合特征以及样本区域的标签,对目标检测模型进行训练,标签用于指示样本区域是否包括样本对象。
在一种可能的实施方式中,目标检测模型的训练装置用于将样本方向频带融合特征输入目标检测模型,通过目标检测模型,基于样本方向频带融合特征进行预测,输出样本区域的预测标签。基于预测标签和样本区域的标签之间的差异信息,对目标检测模型的模型参数进行更新。
通过本申请实施例提供的技术方案,终端能够对第一视频帧的第一目标区域进行频带滤波和方向滤波,得到表示第一目标区域频带信息的多个频带子图以及表示第一目标区域方向信息的多个方向子图。终端将频带信息和方向信息进行融合后得到的方向频带融合特征,也就能够更加完整地表示第一目标区域的特征,即使目标对象运动较慢或者发生旋转,方向频带融合信息也能够准确的表示第一目标区域的特征。后续终端通过目标检测模型,基于方向频带融合特征进行目标检测时,也就能够得到更加准确的检测效果。
本申请实施例提供了一种计算机设备,用于执行上述方法,该计算机设备可以实现为终端或者服务器,下面先对终端的结构进行介绍:
图14是本申请实施例提供的一种终端的结构示意图。该终端1400可以是:智能手机、平板电脑、笔记本电脑或台式电脑。终端1400还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。
通常,终端1400包括有:一个或多个处理器1401和一个或多个存储器1402。
处理器1401可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器1401可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器1401也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器1401可以在集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器1401还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器1402可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器1402还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器1402中的非暂态的计算机可读存储介质用于存储至少一个计算机程序,该至少一个计算机程序用于被处理器1401所执行以实现本申请中方法实施例提供的目标检测方法。
本领域技术人员可以理解,图14中示出的结构并不构成对终端1400的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
上述计算机设备还可以实现为服务器,下面对服务器的结构进行介绍:
图15是本申请实施例提供的一种服务器的结构示意图,该服务器1500可因配置或性能不同而产生比较大的差异,可以包括一个或多个处理器(Central Processing Units,CPU)1501和一个或多个的存储器1502,其中,所述一个或多个存储器1502中存储有至少一条计算机程序,所述至少一条计算机程序由所述一个或多个处理器1501加载并执行以实现上述各个方法实施例提供的方法。当然,该服务器1500还可以具有有线或无线网络接口、键盘以及输入输出接口等部件,以便进行输入输出,该服务器1500还可以包括其他用于实现设备功能的部件,在此不做赘述。
在示例性实施例中,还提供了一种计算机可读存储介质,例如包括计算机程序的存储器, 上述计算机程序可由处理器执行以完成上述实施例中的目标检测方法。例如,该计算机可读存储介质可以是只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)、磁带、软盘和光数据存储设备等。
在示例性实施例中,还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括程序代码,该程序代码存储在计算机可读存储介质中,计算机设备的处理器从计算机可读存储介质读取该程序代码,处理器执行该程序代码,使得该计算机设备执行上述各种可选实现方式中提供的目标检测方法。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,该程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
上述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (15)

  1. 一种目标检测方法,由计算机设备执行,所述方法包括:
    对第一目标区域进行多频带滤波,得到多个频带子图,所述第一目标区域为第一视频帧中待检测的区域;
    对所述多个频带子图进行多方向滤波,得到多个方向子图;
    根据所述多个方向子图,获取所述第一目标区域的方向频带融合特征;
    将所述方向频带融合特征输入目标检测模型,通过所述目标检测模型,基于所述方向频带融合特征进行预测,得到所述第一目标区域的预测标签,所述预测标签用于表示所述第一目标区域是否包括目标对象。
  2. 根据权利要求1所述的方法,其中,所述对第一视频帧中待检测的第一目标区域进行多频带滤波,得到多个频带子图包括:
    将所述第一目标区域输入频带滤波器组,通过所述频带滤波器组中的多个频带滤波器,对所述第一目标区域进行多频带滤波,得到所述多个频带子图。
  3. 根据权利要求1所述的方法,其中,所述对所述多个频带子图进行多方向滤波,得到多个方向子图包括:
    对于所述多个频带子图中任一频带子图,将所述频带子图输入方向滤波器组,通过所述方向滤波器组中的多个方向滤波器,对所述频带子图进行多方向滤波,得到所述频带子图对应的多个方向特征;
    将各个所述方向特征输入所述方向滤波器组中对应的重构滤波器,通过所述重构滤波器,基于输入的方向特征,生成所述频带子图的多个方向子图。
  4. 根据权利要求1所述的方法,其中,所述根据所述多个方向子图,获取所述第一目标区域的方向频带融合特征包括:
    获取所述多个频带子图中任一频带子图对应的多个第一方向子图的能量;
    按照能量从大到小的顺序,对所述多个第一方向子图进行排序,得到多个参考方向子图;
    基于所述多个参考方向子图以及多个第二方向子图,获取所述第一目标区域的方向频带融合特征,所述第二方向子图为所述多个频带子图中除所述频带子图之外的其他频带子图对应的方向子图。
  5. 根据权利要求4所述的方法,其中,所述基于所述多个参考方向子图以及多个第二方向子图,获取所述第一目标区域的方向频带融合特征包括:
    基于所述多个参考方向子图对应的多个融合权重,将所述多个参考方向子图进行加权融合,得到所述频带子图对应的第一方向融合图,所述融合权重与对应第一方向子图的能量正相关;
    基于所述多个融合权重,将所述多个第二方向子图分别进行融合,得到所述其他频带子图对应的多个第二方向融合图;
    基于所述第一方向融合图和所述多个第二方向融合图,获取所述方向频带融合特征。
  6. 根据权利要求5所述的方法,其中,所述基于所述第一方向融合图和所述多个第二方向融合图,获取所述方向频带融合特征包括:
    获取所述第一方向融合图对应的第一积分图;
    获取所述多个第二方向融合图对应的多个第二积分图;
    将第一积分特征向量与多个第二积分特征向量进行拼接,得到所述第一目标区域的方向频带融合特征,所述第一积分特征向量为所述第一积分图对应的积分特征向量,所述第二积分特征向量为所述第二积分图对应的特征向量。
  7. 根据权利要求1所述的方法,其中,所述对第一目标区域进行多频带滤波,得到多个频带子图之前,所述方法还包括:
    确定第二目标区域,所述第二目标区域为所述目标对象在第二视频帧中所在的区域,所述第二视频帧为显示时间在所述第一视频帧之前的视频帧;
    基于所述第一视频帧和所述第二视频帧,对所述第二目标区域进行偏移,得到所述第一目标区域,所述第一目标区域为偏移后的所述第二目标区域在所述第一视频帧中对应的区域。
  8. 根据权利要求1所述的方法,其中,所述得到所述第一目标区域的预测标签之后,所述方法还包括:
    响应于所述预测标签指示所述第一目标区域包括所述目标对象,在所述第一视频帧中对所述第一目标区域的轮廓进行突出显示。
  9. 根据权利要求1所述的方法,其中,所述目标检测模型的训练方法包括:
    对样本视频帧中待检测的样本区域进行多频带滤波,得到多个样本频带子图;
    对所述多个样本频带子图进行多方向滤波,得到多个样本方向子图;
    根据所述多个样本方向子图,获取所述样本区域的样本方向频带融合特征;
    基于所述样本方向频带融合特征以及所述样本区域的标签,对所述目标检测模型进行训练,所述标签用于指示所述样本区域是否包括样本对象。
  10. 根据权利要求9所述的方法,其中,所述基于所述样本方向频带融合特征以及所述样本区域的标签,对所述目标检测模型进行训练包括:
    将所述样本方向频带融合特征输入所述目标检测模型,通过所述目标检测模型,基于所述样本方向频带融合特征进行预测,输出所述样本区域的预测标签;
    基于所述预测标签和所述样本区域的标签之间的差异信息,对所述目标检测模型的模型参数进行更新。
  11. 一种目标检测装置,所述装置包括:
    多频带滤波模块,用于对第一目标区域进行多频带滤波,得到多个频带子图,所述第一目标区域为第一视频帧中待检测的区域;
    多方向滤波模块,用于对所述多个频带子图进行多方向滤波,得到多个方向子图;
    特征获取模块,用于根据所述多个方向子图,获取所述第一目标区域的方向频带融合特征;
    输入模块,用于将所述方向频带融合特征输入目标检测模型,通过所述目标检测模型,基于所述方向频带融合特征进行预测,得到所述第一目标区域的预测标签,所述预测标签用于表示所述第一目标区域是否包括目标对象。
  12. 根据权利要求11所述的装置,其中,所述多频带滤波模块,用于将所述第一目标区域输入频带滤波器组,通过所述频带滤波器组中的多个频带滤波器,对所述第一目标区域进行多频带滤波,得到所述多个频带子图。
  13. 根据权利要求11所述的装置,其中,所述多方向滤波模块,用于对于所述多个频带子图中任一频带子图,将所述频带子图输入方向滤波器组,通过所述方向滤波器组中的多个 方向滤波器,对所述频带子图进行多方向滤波,得到所述频带子图对应的多个方向特征;将各个所述方向特征输入所述方向滤波器中对应的重构滤波器,通过所述重构滤波器,基于输入的方向特征,生成所述频带子图的多个方向子图。
  14. 一种计算机设备,所述计算机设备包括一个或多个处理器和一个或多个存储器,所述一个或多个存储器中存储有至少一条计算机程序,所述计算机程序由所述一个或多个处理器加载并执行以实现如权利要求1至权利要求10任一项所述的目标检测方法。
  15. 一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条计算机程序,所述计算机程序由处理器加载并执行以实现如权利要求1至权利要求10任一项所述的目标检测方法。
PCT/CN2022/070095 2021-01-12 2022-01-04 目标检测方法、装置、设备以及存储介质 WO2022152009A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/982,101 US20230053911A1 (en) 2021-01-12 2022-11-07 Detecting an object in an image using multiband and multidirectional filtering

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110033937.9A CN112699832B (zh) 2021-01-12 2021-01-12 目标检测方法、装置、设备以及存储介质
CN202110033937.9 2021-01-12

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/982,101 Continuation US20230053911A1 (en) 2021-01-12 2022-11-07 Detecting an object in an image using multiband and multidirectional filtering

Publications (1)

Publication Number Publication Date
WO2022152009A1 true WO2022152009A1 (zh) 2022-07-21

Family

ID=75514046

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/070095 WO2022152009A1 (zh) 2021-01-12 2022-01-04 目标检测方法、装置、设备以及存储介质

Country Status (3)

Country Link
US (1) US20230053911A1 (zh)
CN (1) CN112699832B (zh)
WO (1) WO2022152009A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116597313A (zh) * 2023-05-25 2023-08-15 清华大学深圳国际研究生院 一种基于改进YOLOv7的舰船光学图像尾迹检测方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112699832B (zh) * 2021-01-12 2023-07-04 腾讯科技(深圳)有限公司 目标检测方法、装置、设备以及存储介质
CN113344198B (zh) * 2021-06-09 2022-08-26 北京三快在线科技有限公司 一种模型训练的方法及装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1588445A (zh) * 2004-07-22 2005-03-02 上海交通大学 基于方向滤波器组的图像融合方法
CN102005037A (zh) * 2010-11-12 2011-04-06 湖南大学 结合多尺度双边滤波与方向滤波的多模图像融合方法
CN107783096A (zh) * 2016-08-25 2018-03-09 中国科学院声学研究所 一种用于方位历程图显示的二维背景均衡方法
CN107832798A (zh) * 2017-11-20 2018-03-23 西安电子科技大学 基于nsct阶梯网模型的极化sar图像目标检测方法
US20200183002A1 (en) * 2018-12-11 2020-06-11 Hyundai Motor Company System and method for fusing surrounding v2v signal and sensing signal of ego vehicle
CN111666854A (zh) * 2020-05-29 2020-09-15 武汉大学 融合统计显著性的高分辨率sar影像车辆目标检测方法
CN112699832A (zh) * 2021-01-12 2021-04-23 腾讯科技(深圳)有限公司 目标检测方法、装置、设备以及存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120213438A1 (en) * 2011-02-23 2012-08-23 Rovi Technologies Corporation Method and apparatus for identifying video program material or content via filter banks
CN105975911B (zh) * 2016-04-28 2019-04-19 大连民族大学 基于滤波器的能量感知运动显著目标检测方法
CN108154147A (zh) * 2018-01-15 2018-06-12 中国人民解放军陆军装甲兵学院 基于视觉注意模型的感兴趣区域检测方法
CN108921809B (zh) * 2018-06-11 2022-02-18 上海海洋大学 整体原则下基于空间频率的多光谱和全色图像融合方法
CN110738149A (zh) * 2019-09-29 2020-01-31 深圳市优必选科技股份有限公司 目标跟踪方法、终端及存储介质
CN111489378B (zh) * 2020-06-28 2020-10-16 腾讯科技(深圳)有限公司 视频帧特征提取方法、装置、计算机设备及存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1588445A (zh) * 2004-07-22 2005-03-02 上海交通大学 基于方向滤波器组的图像融合方法
CN102005037A (zh) * 2010-11-12 2011-04-06 湖南大学 结合多尺度双边滤波与方向滤波的多模图像融合方法
CN107783096A (zh) * 2016-08-25 2018-03-09 中国科学院声学研究所 一种用于方位历程图显示的二维背景均衡方法
CN107832798A (zh) * 2017-11-20 2018-03-23 西安电子科技大学 基于nsct阶梯网模型的极化sar图像目标检测方法
US20200183002A1 (en) * 2018-12-11 2020-06-11 Hyundai Motor Company System and method for fusing surrounding v2v signal and sensing signal of ego vehicle
CN111666854A (zh) * 2020-05-29 2020-09-15 武汉大学 融合统计显著性的高分辨率sar影像车辆目标检测方法
CN112699832A (zh) * 2021-01-12 2021-04-23 腾讯科技(深圳)有限公司 目标检测方法、装置、设备以及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116597313A (zh) * 2023-05-25 2023-08-15 清华大学深圳国际研究生院 一种基于改进YOLOv7的舰船光学图像尾迹检测方法
CN116597313B (zh) * 2023-05-25 2024-05-14 清华大学深圳国际研究生院 一种基于改进YOLOv7的舰船光学图像尾迹检测方法

Also Published As

Publication number Publication date
CN112699832B (zh) 2023-07-04
US20230053911A1 (en) 2023-02-23
CN112699832A (zh) 2021-04-23

Similar Documents

Publication Publication Date Title
Lyu et al. Multi-oriented scene text detection via corner localization and region segmentation
WO2022152009A1 (zh) 目标检测方法、装置、设备以及存储介质
CN111754596B (zh) 编辑模型生成、人脸图像编辑方法、装置、设备及介质
CN111444828B (zh) 一种模型训练的方法、目标检测的方法、装置及存储介质
Devries et al. Multi-task learning of facial landmarks and expression
CN108304835A (zh) 文字检测方法和装置
CN111709409A (zh) 人脸活体检测方法、装置、设备及介质
Zhang et al. Multi-scale adversarial network for vehicle detection in UAV imagery
US20220222918A1 (en) Image retrieval method and apparatus, storage medium, and device
CN112598643A (zh) 深度伪造图像检测及模型训练方法、装置、设备、介质
US20220254134A1 (en) Region recognition method, apparatus and device, and readable storage medium
CN108229344A (zh) 图像处理方法和装置、电子设备、计算机程序和存储介质
CN110163111A (zh) 基于人脸识别的叫号方法、装置、电子设备及存储介质
CN105303163B (zh) 一种目标检测的方法及检测装置
CN110222718A (zh) 图像处理的方法及装置
Wu et al. A deep residual convolutional neural network for facial keypoint detection with missing labels
CN114332473A (zh) 目标检测方法、装置、计算机设备、存储介质及程序产品
CN113011387A (zh) 网络训练及人脸活体检测方法、装置、设备及存储介质
CN113569607A (zh) 动作识别方法、装置、设备以及存储介质
Zhang et al. Crowd counting based on attention-guided multi-scale fusion networks
CN111694954A (zh) 图像分类方法、装置和电子设备
Zhang et al. Efficient feature fusion network based on center and scale prediction for pedestrian detection
CN113096080A (zh) 图像分析方法及系统
CN113822134A (zh) 一种基于视频的实例跟踪方法、装置、设备及存储介质
CN111429414B (zh) 基于人工智能的病灶影像样本确定方法和相关装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22738890

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 21.11.2023)