WO2019184520A1 - 一种视频特征提取方法及装置 - Google Patents

一种视频特征提取方法及装置 Download PDF

Info

Publication number
WO2019184520A1
WO2019184520A1 PCT/CN2018/125496 CN2018125496W WO2019184520A1 WO 2019184520 A1 WO2019184520 A1 WO 2019184520A1 CN 2018125496 W CN2018125496 W CN 2018125496W WO 2019184520 A1 WO2019184520 A1 WO 2019184520A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
video
video feature
image
pooling
Prior art date
Application number
PCT/CN2018/125496
Other languages
English (en)
French (fr)
Inventor
何轶
李磊
杨成
李�根
李亦锬
Original Assignee
北京字节跳动网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字节跳动网络技术有限公司 filed Critical 北京字节跳动网络技术有限公司
Priority to US16/971,760 priority Critical patent/US11455802B2/en
Priority to JP2020545849A priority patent/JP6982194B2/ja
Priority to SG11202008272RA priority patent/SG11202008272RA/en
Publication of WO2019184520A1 publication Critical patent/WO2019184520A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/28Quantising the image, e.g. histogram thresholding for discrimination between background and foreground patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/50Extraction of image or video features by performing operations within image blocks; by using histograms, e.g. histogram of oriented gradients [HoG]; by summing image-intensity values; Projection analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/751Comparing pixel values or logical combinations thereof, or feature values having positional relevance, e.g. template matching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Definitions

  • the present disclosure relates to the field of video processing technologies, and in particular, to a method and an apparatus for extracting video features.
  • the problem videos mainly include: and existing videos in the platform video database. Duplicate videos, videos that are duplicated with videos in the copyright database (for example, videos that require royalties), and some videos that are inappropriate or prohibited. Therefore, it is necessary to quickly compare and filter a large amount of videos uploaded by users.
  • the core technology to improve the speed and accuracy of video comparison is to reasonably extract and compare the characteristics of video frames.
  • the video feature extraction method includes the following steps: performing frame extraction on a video object to obtain one or more frame images; performing various types of pooling on each of the frame images step by step to obtain a An image feature of the frame image; wherein the plurality of types of pooling includes maximum pooling, minimum pooling, and averaging pooling; determining video features based on the image features of the one or more frame images.
  • the object of the present disclosure can also be further achieved by the following technical measures.
  • performing the plurality of types of pooling for each of the frame images step by step comprises: performing the plurality of types step by step based on a plurality of color channels of the frame image Pooling.
  • the video feature extraction method of the foregoing, wherein the performing multiple types of pooling on each of the frame images step by step to obtain image features of the frame image comprises: determining a matrix according to the frame image, utilizing The plurality of types of pooling, generating smaller matrices step by step, until being reduced to a matrix containing only one point, the image features are determined according to the matrix containing only one point.
  • the foregoing video feature extraction method wherein the performing multiple types of pooling on each of the frame images step by step to obtain image features of the frame image comprises the following steps: (a) according to one of the frame images Determining a first matrix having a first matrix dimension and a second matrix dimension; a point in the first matrix corresponding to a pixel in the frame image; a value of a point in the first matrix a vector, the first vector is a 3-dimensional vector for indicating the brightness of three color channels of the corresponding pixel; (b) a plurality of first blocks are disposed on the first matrix, each of the a block includes a plurality of the first vectors; the number of the plurality of first blocks in the first matrix dimension is less than the number of points included in the first matrix dimension of the first matrix, and The number of the plurality of first blocks in the second matrix dimension is less than the number of points included in the second matrix dimension of the first matrix; for each of the first blocks, the first a maximum value of each dimension of the plurality of the first vectors included
  • the video feature extraction method wherein the determining a video feature according to the image feature of the one or more frame images comprises: performing binarization processing on the image feature to obtain a binarized image feature; Deriving the binarized image features of one or more frame images to determine video features.
  • the foregoing video feature extraction method wherein the performing binarization processing on the image feature to obtain a binarized image feature comprises the steps of: generating a plurality of groups according to the image feature, each of the groups comprising the a plurality of elements in the image feature; respectively summing the plurality of elements in each of the groups to obtain an added value of each of the groups; pairing the plurality of groups into two pairs to obtain a plurality of a pair of groups; for each of the groups, comparing the magnitudes of the summed values of the two of the group of pairs, generating a binarized image feature bit based on the comparison; according to the plurality of groups Pairing the image feature bits, determining a binarized image feature of the frame image.
  • the object of the present disclosure is also achieved by the following technical solutions.
  • the video feature library construction method according to the present disclosure includes the following steps: extracting video features of a video object according to the video feature extraction method of any of the foregoing; storing the video features into a video feature library.
  • the object of the present disclosure is also achieved by the following technical solutions.
  • the video feature extraction apparatus includes: a frame drawing module, configured to frame a video object to obtain one or more frame images; and an image feature determining module, configured to step by step for each of the frame images Performing various types of pooling to obtain image features of the frame image; wherein the plurality of types of pooling include maximum pooling, minimum pooling, and average pooling; and a video feature determining module for Deriving the image feature vector of one or more frame images to determine a video feature.
  • the object of the present disclosure can also be further achieved by the following technical measures.
  • the aforementioned video feature extraction apparatus further includes means for performing the steps of any of the foregoing video feature extraction methods.
  • the audio fingerprint library construction apparatus includes: a video feature extraction module, configured to extract a video feature of a video object according to the video feature extraction method of any one of the foregoing; a video feature storage module, configured to store the video feature And a video feature library for storing the video feature.
  • a video feature extraction hardware device comprising: a memory for storing non-transitory computer readable instructions; and a processor for executing the computer readable instructions such that the processor is implemented Any of the foregoing video feature extraction methods.
  • a computer readable storage medium for storing non-transitory computer readable instructions that, when executed by a computer, cause the computer to perform any of the aforementioned video features Extraction Method.
  • a terminal device comprising any of the foregoing video feature extraction devices.
  • FIG. 1 is a block flow diagram of a video feature extraction method in accordance with an embodiment of the present disclosure.
  • FIG. 2 is a block diagram of a process for performing multi-type pooling processing step by step according to an embodiment of the present disclosure.
  • FIG. 3 is a block diagram of a process for binarizing image features by using a random projection method according to an embodiment of the present disclosure.
  • FIG. 4 is a flow diagram of one specific example of extracting image features of a frame image using the method of the present disclosure.
  • FIG. 5 is a block diagram of a video feature library construction method according to an embodiment of the present disclosure.
  • FIG. 6 is a structural block diagram of a video feature extraction apparatus according to an embodiment of the present disclosure.
  • FIG. 7 is a structural block diagram of a video feature library construction apparatus according to an embodiment of the present disclosure.
  • FIG. 8 is a hardware block diagram of a video feature extraction hardware device in accordance with an embodiment of the present disclosure.
  • FIG. 9 is a schematic diagram of a computer readable storage medium in accordance with an embodiment of the present disclosure.
  • FIG. 10 is a structural block diagram of a terminal device according to an embodiment of the present disclosure.
  • FIG. 1 is a schematic flow chart of an embodiment of a video feature extraction method according to the present disclosure.
  • a video feature extraction method of an example of the present disclosure mainly includes the following steps:
  • step S11 the video object is framed to obtain one or more frame images.
  • the type of the video object is not limited, and may be a video signal or a video file. Thereafter, the process proceeds to step S12.
  • step S12 for each frame image, various types of pooling processing are performed step by step to obtain image features of the frame image.
  • Pooling is a dimensionality reduction method in the field of convolutional neural networks, and so-called multiple types of pooling include maximum pooling, minimum pooling, and average pooling. Thereafter, the processing proceeds to step S13.
  • various types of pooling may be performed step by step based on a plurality of color channels of the frame image to obtain image features according to multiple color channels of the frame image.
  • Step S13 Determine a video feature of the video object according to the plurality of image features corresponding to the one or more frame images. Specifically, the plurality of image features may be combined together in chronological order of the frame images to obtain the video features.
  • the video feature extraction method proposed by the present disclosure can greatly improve the accuracy and extraction efficiency of video feature extraction by performing multiple types of pooling of frame images obtained by video frame-by-step to generate video features. Improve the goodness and robustness of the resulting video features.
  • performing multiple types of pooling on a frame image step by step includes: determining a matrix according to the frame image, and using a plurality of types of pooling to generate a smaller matrix step by step. Until it is reduced to a matrix comprising only one point (or "points" in the matrix can also be referred to as "elements" in the matrix), the image features of the frame image are determined from the matrix containing only one point.
  • FIG. 2 is a schematic flow chart of a multi-type pooling process step by step according to an embodiment of the video feature extraction method of the present disclosure.
  • the step-by-step multi-type pooling process in step S12 provided by one embodiment of the video feature extraction method of the present disclosure specifically includes the following steps:
  • Step (a) determining a first matrix having a first matrix dimension and a second matrix dimension (or having a length direction and a width direction) according to a frame image. It may be assumed that the length of the frame image is x pixels and the width is y pixels, where x and y are positive integers.
  • a point in the first matrix (the points in the matrix can also be referred to as elements in the matrix, but in order to distinguish from the elements in the vector, the elements in the matrix are referred to as "points" below) corresponding to the frame image a pixel in the first matrix, such that the length of the first matrix dimension is x, and the length of the second matrix dimension is y (ie, x*y matrix); the first matrix dimension of the matrix herein is /
  • the length of the second matrix dimension is used to represent the number of points that the matrix contains on the first matrix dimension/second matrix dimension.
  • the value of each point in the first matrix is a 3-dimensional vector, and the 3-dimensional vector is defined as a first vector, which is used to represent three color channels of corresponding pixels in the frame image. Brightness.
  • the color mode of the video object is red, green and blue mode (RGB mode)
  • RGB mode red, green and blue mode
  • three color channels of red, green and blue are not necessarily taken, for example, It can be selected according to the color mode used by the video object; even the number of selected color channels does not have to be three. For example, two of the three color channels of red, green and blue can be selected. Thereafter, the process proceeds to step (b).
  • Step (b) setting a plurality of first blocks on the first matrix (in fact, each block is equivalent to a pooled window, so the first block may also be referred to as a first pooled window), may wish to set X 1 *y 1 first block, where x 1 and y 1 are positive integers, and each first block contains a plurality of points of the first matrix (or a plurality of first vectors);
  • the number of the plurality of first blocks in the first matrix dimension is less than the length of the first matrix dimension of the first matrix (or less than the number of points included in the first matrix dimension of the first matrix)
  • the number of the plurality of first blocks in the second matrix dimension is less than the length of the second matrix dimension of the first matrix (or less than the point included in the second matrix dimension of the first matrix)
  • the number) that is, the value of x 1 is less than x, and the value of y 1 is less than y.
  • each first block respectively calculating a maximum value, a minimum value, and an average value of each dimension of the plurality of first vectors included in the first block, to obtain a 9-dimensional vector corresponding to the first block, This 9-dimensional vector is defined as the second vector.
  • each of the first blocks may partially overlap each other, that is, may or may not overlap each other. Thereafter, the process proceeds to step (c).
  • the first matrix dimension of the first matrix may be uniformly divided into x 1 segments, each segment having the same length, and the adjacent two segments contain the same point (partial overlap)
  • the second matrix dimension of the first matrix is divided into y 1 segments, and the x 1 segment is combined with the y 1 segment to obtain x 1 *y 1 first blocks of the first matrix.
  • each of the first blocks provided has the same size and the same pitch (the two adjacent first blocks may overlap)
  • the foregoing plurality of first areas are disposed on the first matrix.
  • the process of calculating and calculating the second vector of each first block is actually equivalent to scanning (or traversing) the entire first matrix at a certain interval with a pooling window, and calculating the pooling in each scan.
  • the second vector of the area covered by the window is actually equivalent to scanning (or traversing) the entire first matrix at a certain interval with a pooling window, and calculating the pooling in each scan.
  • Step (c) determining a second matrix according to the plurality of x 1 *y 1 first blocks and a second vector corresponding to each first block; a point in the second matrix corresponding to a first area a block, when x 1 *y 1 first block is set, the second matrix is a matrix having a length of x 1 in the first matrix dimension and a length y 1 in the second matrix dimension (ie, x 1 *y 1 matrix); the value of each point in the second matrix is the second vector of the corresponding first block.
  • step (d) determining a second matrix according to the plurality of x 1 *y 1 first blocks and a second vector corresponding to each first block; a point in the second matrix corresponding to a first area a block, when x 1 *y 1 first block is set, the second matrix is a matrix having a length of x 1 in the first matrix dimension and a length y 1 in the second matrix dimension (ie, x 1 *y 1 matrix); the value of each point in the second matrix is the second
  • each point in the second matrix may be arranged in the order of the positions of the respective first blocks in the first matrix.
  • the x 1 * y 1 comprising a point and a second point value for each 9-dimensional vector matrix to obtain x 2 * y 2 comprising points
  • each point takes a value of a third matrix of 27-dimensional vectors (where x 2 is a positive integer less than x 1 and y 2 is a positive integer less than y 1 ); and then according to x 2 *y 2 points and The third matrix of each point is a 27-dimensional vector, and a third matrix containing x 3 *y 3 points and each point having a value of 81-dimensional vector is obtained (where x 3 is less than x 2 An integer, y 3 is a positive integer less than y 2 );...; until the first matrix (or the frame image) is reduced to a 1*1 N-th matrix (in fact, the matrix is dimension-reduced It becomes a point, where N is a positive integer, and the Nth matrix includes only one point, and the value of the
  • step (d) in each process of setting a block, the block should be set according to the size of the matrix to adapt to the first matrix dimension and the second matrix dimension of the matrix.
  • the step by step is reduced.
  • the method may further include: performing binarization processing on the determined image feature to obtain a binarized image feature, wherein the binarized image feature is a bit string composed of 0/1;
  • the video features are determined based on the resulting binarized image features.
  • the image feature is binarized, which can compress the storage of video features and accelerate the process of similarity calculation of video comparison.
  • binarization processing is also beneficial to the index library recall process of video comparison.
  • FIG. 3 is a schematic block diagram of binarizing image features by using a random projection method according to an embodiment of the video feature extraction method of the present disclosure.
  • the process of binarizing image features by using a random projection method of the example of the present disclosure mainly includes the following steps:
  • Step S21 in order to generate a binarized image feature of length n, according to the image feature, generate 2n groups, each group containing a plurality of elements in the image feature (that is, each group contains image features The value of multiple dimensions). Where n is a positive integer. Thereafter, the process proceeds to step S22.
  • each group specifically includes which elements are arbitrary, and two different groups may include some of the same elements. However, in order to facilitate video matching, each group contains specific elements that can be preset, or for multiple video objects, the team can be generated in the same way.
  • the number of elements included in each group is the same. However, it should be noted that the number of elements included in each group can be different.
  • step S22 a plurality of elements included in each group are respectively summed to obtain an added value of each group. Thereafter, the processing proceeds to step S23.
  • step S23 the 2n groups are paired two to two to obtain n group pairs. Thereafter, the processing proceeds to step S24.
  • 2n groups can be sorted in advance (or the group numbered), and the adjacent two groups can be paired.
  • step S24 the n group pairs are compared, the sum of the two groups in each group is compared, and a binarized image feature bit is generated according to the comparison result. Thereafter, the processing proceeds to step S25.
  • the group has been sorted (or has been numbered) in advance
  • a take is generated in a pair of groups.
  • a binarized image feature bit with a value of 1 is generated instead.
  • the method for generating the binarized image feature bits is not limited.
  • the binarization value of 1 may be generated when the sum of the groups of the top ranked group is smaller than the sum of the groups of the ranked group.
  • Image feature bits may be generated when the sum of the groups of the top ranked group is smaller than the sum of the groups of the ranked group.
  • Step S25 forming a binarized image feature having a length n of the frame image according to the n binarized image feature bits of the n group pairs.
  • FIG. 4 is a schematic flowchart of a process of extracting image features of a specific frame image by using the video feature extraction method of the present disclosure.
  • the steps of extracting a specific example of image features of a frame image provided by an embodiment of the present disclosure are as follows:
  • Step S31 for a frame image of 243*243 (length 243 pixels, width 243 pixels) obtained from sampling the video object, each pixel has three channels of red, green and blue, and I, II, III are used in FIG. Three channels of red, green and blue are respectively identified.
  • Defining a first matrix according to the frame image each point in the first matrix corresponds to a pixel at the same position in the frame image, and the corresponding point is determined according to the brightness values of the red, green and blue channels of each pixel The value, thus obtaining a first matrix of 243*243, and the value of the point in the first matrix is a 3-dimensional vector.
  • Step S32 using a 13*13 matrix block (or, the matrix block may be referred to as a pooling window) across the first matrix;
  • the matrix block sequentially passes all the points along the length and width direction, and calculates the maximum value of each dimension of the plurality of points covered by the matrix block, Minimum and intermediate values;
  • a second matrix of 81*81 is obtained, and the points in the second matrix are taken as 9-dimensional vectors.
  • Step S33, step S32 is repeated, and a matrix of 10*10 is used to cross the second matrix and each time 3 points are drawn to obtain a third matrix of 27*27, and the value of the point in the third matrix is 27 Dimension vector; use a 6*6 matrix square to cross the third matrix and each time 2 points, to obtain a 9*9 fourth matrix, and the points in the fourth matrix are 81-dimensional vectors; .., until a 1*1 single-point matrix is obtained, the point contained in the single-dot matrix is a 729-dimensional vector, and the 729-dimensional vector is defined as a pooled vector.
  • step S34 the pooled vector is binarized by a random projection method to obtain a binarized image feature of the frame image.
  • FIG. 5 is a schematic flowchart of an embodiment of a video feature library construction method according to the present disclosure.
  • a video feature library construction method of an example of the present disclosure mainly includes the following steps:
  • Step S41 extracting video features of the video object according to the steps of the video feature extraction method of the foregoing example of the present disclosure. Thereafter, the processing proceeds to step S42.
  • step S42 the video features of the video object are stored in the video feature library.
  • the video feature in a video feature library should be obtained according to the same feature extraction method, that is, in the process of video feature extraction in step S41, frame extraction is performed in the same manner in step S11.
  • the frame images are subjected to various types of pooling in a stepwise manner in S12, and the image features are composed into video features in the same manner in step S13.
  • the video signature library can be updated at any time over time.
  • FIG. 6 is a schematic structural block diagram of an embodiment of a video feature extraction apparatus according to the present disclosure.
  • the video feature extraction apparatus 100 of the example of the present disclosure mainly includes:
  • the frame drawing module 110 is configured to frame the video object to obtain one or more frame images
  • the image feature determining module 120 is configured to perform multiple types of pooling on each frame image to obtain image features of the frame image.
  • the so-called multiple types of pooling include maximum pooling and minimum pooling. And average pooling;
  • the video feature determining module 130 is configured to determine a video feature according to the plurality of image features corresponding to the one or more frame images.
  • the image feature determining module 120 is further configured to perform multi-type pooling on the frame image step by step according to the specific steps shown in the foregoing embodiment of the video feature extraction method of the present disclosure.
  • the video feature extraction apparatus 100 of the example of the present disclosure further includes a binarization module (not shown) for the specific steps shown in the embodiment of the video feature extraction method of the present disclosure.
  • the feature is binarized.
  • the video feature determining module 130 is configured to determine a video feature according to the binarized image feature.
  • FIG. 7 is a schematic structural diagram of an embodiment of a video feature library construction apparatus according to the present disclosure.
  • the video feature library construction apparatus 200 of the example of the present disclosure mainly includes:
  • the video feature extraction module 201 includes the frame extraction module 110, the image feature determination module 120, and the video feature determination module 130 of the video feature extraction device of the foregoing disclosed example, and may include a binarization module for use in accordance with the foregoing disclosure.
  • the steps of the example video feature extraction method extract video features of the video object.
  • the video feature storage module 202 is configured to store the video feature into the video feature library.
  • the video feature library 203 is configured to store video features of respective video objects.
  • FIG. 8 is a hardware block diagram illustrating a video feature extraction hardware device in accordance with an embodiment of the present disclosure.
  • a video feature extraction hardware device 300 in accordance with an embodiment of the present disclosure includes a memory 301 and a processor 302.
  • the components in video feature extraction hardware device 300 are interconnected by a bus system and/or other form of connection mechanism (not shown).
  • the memory 301 is for storing non-transitory computer readable instructions.
  • memory 301 can include one or more computer program products, which can include various forms of computer readable storage media, such as volatile memory and/or nonvolatile memory.
  • the volatile memory may include, for example, random access memory (RAM) and/or cache or the like.
  • the nonvolatile memory may include, for example, a read only memory (ROM), a hard disk, a flash memory, or the like.
  • the processor 302 can be a central processing unit (CPU) or other form of processing unit with data processing capabilities and/or instruction execution capabilities, and can control other components in the video feature extraction hardware device 300 to perform the desired functions.
  • the processor 302 is configured to execute the computer readable instructions stored in the memory 301 such that the video feature extraction hardware device 300 performs the video feature extraction method of the foregoing embodiments of the present disclosure. All or part of the steps.
  • FIG. 9 is a schematic diagram illustrating a computer readable storage medium in accordance with an embodiment of the present disclosure.
  • a computer readable storage medium 400 is stored thereon with non-transitory computer readable instructions 401 stored thereon.
  • the non-transitory computer readable instructions 401 are executed by a processor, all or part of the steps of the video feature extraction method of the various embodiments of the present disclosure described above are performed.
  • FIG. 10 is a schematic diagram showing a hardware structure of a terminal device according to an embodiment of the present disclosure.
  • the terminal device may be implemented in various forms, and the terminal device in the present disclosure may include, but is not limited to, such as a mobile phone, a smart phone, a notebook computer, a digital broadcast receiver, a PDA (Personal Digital Assistant), a PAD (Tablet), a PMP.
  • Mobile terminal devices portable multimedia players
  • navigation devices in-vehicle terminal devices, in-vehicle display terminals, in-vehicle electronic rearview mirrors, and the like, and fixed terminal devices such as digital TVs, desktop computers, and the like.
  • the terminal device 1100 may include a wireless communication unit 1110, an A/V (Audio/Video) input unit 1120, a user input unit 1130, a sensing unit 1140, an output unit 1150, a memory 1160, an interface unit 1170, and control.
  • Figure 10 illustrates a terminal device having various components, but it should be understood that not all illustrated components are required to be implemented. More or fewer components can be implemented instead.
  • the wireless communication unit 1110 allows radio communication between the terminal device 1100 and a wireless communication system or network.
  • the A/V input unit 1120 is for receiving an audio or video signal.
  • the user input unit 1130 can generate key input data according to a command input by the user to control various operations of the terminal device.
  • the sensing unit 1140 detects the current state of the terminal device 1100, the location of the terminal device 1100, the presence or absence of a user's touch input to the terminal device 1100, the orientation of the terminal device 1100, the acceleration or deceleration movement and direction of the terminal device 1100, and the like, and A command or signal for controlling the operation of the terminal device 1100 is generated.
  • the interface unit 1170 serves as an interface through which at least one external device can connect with the terminal device 1100.
  • Output unit 1150 is configured to provide an output signal in a visual, audio, and/or tactile manner.
  • the memory 1160 may store a software program or the like that performs processing and control operations performed by the controller 1180, or may temporarily store data that has been output or is to be output.
  • Memory 1160 can include at least one type of storage medium.
  • the terminal device 1100 can cooperate with a network storage device that performs a storage function of the memory 1160 through a network connection.
  • Controller 1180 typically controls the overall operation of the terminal device. Additionally, the controller 1180 can include a multimedia module for reproducing or playing back multimedia data.
  • the controller 1180 can perform a pattern recognition process to recognize a handwriting input or a picture drawing input performed on the touch screen as a character or an image.
  • the power supply unit 1190 receives external power or internal power under the control of the controller 1180 and provides appropriate power required to operate the various components and components.
  • Various embodiments of the video feature extraction methods proposed by the present disclosure may be implemented in a computer readable medium using, for example, computer software, hardware, or any combination thereof.
  • various embodiments of the video feature extraction method proposed by the present disclosure may be through the use of an application specific integrated circuit (ASIC), a digital signal processor (DSP), a digital signal processing device (DSPD), a programmable logic device (PLD). a field programmable gate array (FPGA), a processor, a controller, a microcontroller, a microprocessor, an electronic unit designed to perform the functions described herein, in some cases,
  • ASIC application specific integrated circuit
  • DSP digital signal processor
  • DSPD digital signal processing device
  • PLD programmable logic device
  • FPGA field programmable gate array
  • processor a processor
  • controller a microcontroller
  • microprocessor an electronic unit designed to perform the functions described herein, in some cases
  • Various embodiments of the disclosed video feature extraction method may be implemented in the controller 1180.
  • various implementations of the video feature extraction methods proposed by the present disclosure can be implemented with separate software modules that allow for the execution of at least one function or operation.
  • the software code can be implemented by a software application (or program) written in any suitable programming language, which can be stored in memory 1160 and executed by controller 1180.
  • a video feature extraction method, apparatus, hardware device, computer readable storage medium, and terminal device perform various types of pooling on a frame image obtained by video frame drawing to generate a video.
  • the feature can greatly improve the accuracy of the feature extraction of the video and the efficiency of the extraction, and can improve the superiority and robustness of the obtained video features, thereby enabling video comparison based on the video features obtained by the video feature extraction method of the present disclosure.
  • Video retrieval, video deduplication, and video content monitoring have higher accuracy, higher efficiency, and better robustness.
  • exemplary does not mean that the described examples are preferred or better than the other examples.

Abstract

本公开涉及一种视频特征提取方法及装置,该方法包括:对视频对象进行抽帧,得到一个或多个帧图像;对每个所述帧图像逐级地进行多种类型的池化,以得到所述帧图像的图像特征;其中,所述多种类型的池化包括最大池化、最小池化和平均池化;根据所述一个或多个帧图像的所述图像特征确定视频特征。

Description

一种视频特征提取方法及装置
相关申请的交叉引用
本申请要求申请号为201810271774.6、申请日为2018年3月29日的中国专利申请的优先权,该文献的全部内容以引用方式并入本文。
技术领域
本公开涉及视频处理技术领域,特别是涉及一种视频特征的提取方法及装置。
背景技术
在如今的多媒体信息社会中,用户每天向视频平台上传海量的视频,这些视频中大部分是正常的有价值视频,然而也有一些问题视频,问题视频主要包括:和平台视频数据库中的已有视频重复的视频、与版权数据库中的视频重复的视频(例如,需要支付版权费的视频)以及某些不适宜或禁止展示的视频。因此需要对用户上传的海量视频进行快速的比对和过滤。而提高视频比对速度和准确性的核心技术是对视频帧的特征进行合理的提取和相似度判断。
为了提高比对速度和准确性,需要为一个视频生成一个能表征此视频的视频特征,目的是通过比对两个视频的视频特征来判断视频的相似程度。视频特征的提取方法和视频特征的优良程度决定了视频比对的效率和准确率。
发明内容
本公开的目的在于提供一种新的视频特征提取方法及装置。
本公开的目的是采用以下的技术方案来实现的。依据本公开提出的视频特征提取方法,包括以下步骤:对视频对象进行抽帧,得到一个或多个帧图像;对每个所述帧图像逐级地进行多种类型的池化,以得到所述帧图像的图像特征;其中,所述多种类型的池化包括最大池化、最小池化和平均池化;根据所述一个或多个帧图像的所述图像特征确定视频特征。
本公开的目的还可以采用以下的技术措施来进一步实现。
前述的视频特征提取方法,其中所述的对每个所述帧图像逐级地进行多种类型的池化包括:基于所述帧图像的多种颜色通道逐级地进行所述多种类型的池化。
前述的视频特征提取方法,其中所述的对每个所述帧图像逐级地进行多种类型的池化,以得到所述帧图像的图像特征包括:根据所述帧图像确定一个矩阵,利用所述多种类型的池化,逐级地生成更小的矩阵,直到缩 小为仅包含一个点的矩阵,根据所述仅包含一个点的矩阵确定所述图像特征。
前述的视频特征提取方法,其中所述对每个所述帧图像逐级地进行多种类型的池化,以得到所述帧图像的图像特征包括以下步骤:(a)根据一个所述帧图像,确定一个具有第一矩阵维度和第二矩阵维度的第一矩阵;所述第一矩阵中的点与所述帧图像中的像素相对应;所述第一矩阵中的点的取值为第一向量,所述第一向量为3维的向量,用于表示对应像素的三个颜色通道的亮度;(b)在所述第一矩阵上设置多个第一区块,每个所述第一区块包含多个所述第一向量;所述多个第一区块在第一矩阵维度上的数量少于所述第一矩阵在第一矩阵维度上所包含的点的数量,且所述多个第一区块在第二矩阵维度上的数量少于所述第一矩阵在第二矩阵维度上所包含的点的数量;对于每个所述第一区块,分别计算所述第一区块所包含的多个所述第一向量的各个维的最大值、最小值和平均值,得到一个9维的第二向量;(c)根据所述多个第一区块所对应的所述第二向量,确定第二矩阵;所述第二矩阵中的点与所述第一区块相对应,所述第二矩阵中的点的取值为所述第二向量;(d)重复步骤(b)和步骤(c),直到将所述第一矩阵缩小成一个取值为3 N维向量的点,其中的N为正整数;将所述3 N维向量确定为所述帧图像的图像特征。
前述的视频特征提取方法,其中所述的根据所述一个或多个帧图像的所述图像特征确定视频特征包括:对所述图像特征进行二值化处理,得到二值化图像特征;根据所述一个或多个帧图像的所述二值化图像特征,确定视频特征。
前述的视频特征提取方法,其中所述的对所述图像特征进行二值化处理,得到二值化图像特征包括以下步骤:根据所述图像特征生成多个小组,每个所述小组包含所述图像特征中的多个元素;分别对每个所述小组中的所述多个元素进行求和,得到每个所述小组的加和值;将所述多个小组两两配对,得到多个小组对;对于每个所述小组,比较所述小组对中的两个所述小组的所述加和值的大小,根据比较结果生成一个二值化的图像特征比特;根据所述多个小组对的所述图像特征比特,确定所述帧图像的二值化图像特征。
本公开的目的还采用以下技术方案来实现。依据本公开提出的视频特征库构建方法,包括以下步骤:按照前述的任意一项的视频特征提取方法提取视频对象的视频特征;将所述视频特征存储到视频特征库中。
本公开的目的还采用以下技术方案来实现。依据本公开提出的视频特征提取装置,包括:抽帧模块,用于对视频对象进行抽帧,得到一个或多个帧图像;图像特征确定模块,用于对每个所述帧图像逐级地进行多种类 型的池化,以得到所述帧图像的图像特征;其中,所述多种类型的池化包括最大池化、最小池化和平均池化;视频特征确定模块,用于根据所述一个或多个帧图像的所述图像特征向量,确定视频特征。
本公开的目的还可以采用以下的技术措施来进一步实现。
前述的视频特征提取装置,其还包括执行前述任一视频特征提取方法步骤的模块。
本公开的目的还采用以下技术方案来实现。依据本公开提出的音频指纹库构建装置,包括:视频特征提取模块,用于按照前述任意一项的视频特征提取方法提取视频对象的视频特征;视频特征存储模块,用于将所述视频特征存储到视频特征库中;视频特征库,用于存储所述视频特征。
本公开的目的还采用以下技术方案来实现。依据本公开提出的一种视频特征提取硬件装置,包括:存储器,用于存储非暂时性计算机可读指令;以及处理器,用于运行所述计算机可读指令,使得所述处理器执行时实现前述任意一种视频特征提取方法。
本公开的目的还采用以下技术方案来实现。依据本公开提出的一种计算机可读存储介质,用于存储非暂时性计算机可读指令,当所述非暂时性计算机可读指令由计算机执行时,使得所述计算机执行前述任意一种视频特征提取方法。
本公开的目的还采用以下技术方案来实现。依据本公开提出的一种终端设备,包括前述任意一种视频特征提取装置。
上述说明仅是本公开技术方案的概述,为了能更清楚了解本公开的技术手段,而可依照说明书的内容予以实施,并且为让本公开的上述和其他目的、特征和优点能够更明显易懂,以下特举较佳实施例,并配合附图,详细说明如下。
附图说明
图1是本公开一个实施例的视频特征提取方法的流程框图。
图2是本公开一个实施例提供的逐级进行多类型池化处理的流程框图。
图3是本公开一个实施例提供的利用随机投影法对图像特征进行二值化处理的流程框图。
图4是利用本公开的方法来提取帧图像的图像特征的一个具体示例的流程示意图。
图5是本公开一个实施例的视频特征库构建方法的流程框图。
图6是本公开一个实施例的视频特征提取装置的结构框图。
图7是本公开一个实施例的视频特征库构建装置的结构框图。
图8是本公开一个实施例的视频特征提取硬件装置的硬件框图。
图9是本公开一个实施例的计算机可读存储介质的示意图。
图10是本公开一个实施例的终端设备的结构框图。
具体实施方式
为更进一步阐述本公开为达成预定发明目的所采取的技术手段及功效,以下结合附图及较佳实施例,对依据本公开提出的视频特征提取方法及装置的具体实施方式、结构、特征及其功效,详细说明如后。
图1为本公开的视频特征提取方法一个实施例的示意性流程框图。请参阅图1,本公开示例的视频特征提取方法,主要包括以下步骤:
步骤S11,对视频对象进行抽帧,得到一个或多个帧图像。需注意,对视频对象的类型不做限制,可以是一段视频信号,也可以是一个视频文件。此后,处理进到步骤S12。
步骤S12,对每个帧图像,逐级地进行多种类型的池化(Pooling)处理,以得到该帧图像的图像特征。其中,池化(Pooling)是一种在卷积神经网络领域的降维方法,而所谓的多种类型的池化包括最大池化、最小池化和平均池化。此后,处理进到步骤S13。
具体地,可以基于帧图像的多种颜色通道逐级地进行多种类型的池化,以根据帧图像的多种颜色通道得到图像特征。
步骤S13,根据该一个或多个帧图像对应的多个图像特征,确定该视频对象的视频特征。具体地,可以将该多个图像特征按照帧图像的时间顺序组合在一起,得到该视频特征。
本公开提出的视频特征提取方法,通过逐级地对由视频抽帧得到的帧图像进行多种类型的池化以生成视频特征,能够大大提高视频特征提取的准确性和提取的效率,并能提高得到的视频特征的优良程度和鲁棒性。
在本公开的一种实施例中,对帧图像逐级地进行多种类型的池化的包括:根据帧图像确定一个矩阵,利用多种类型的池化,逐级地生成更小的矩阵,直到缩小为一个仅包括一个点的矩阵(或者,也可以将矩阵中的“点”称为矩阵中的“元素”),根据该仅包含一个点的矩阵确定该帧图像的图像特征。
图2为本公开的视频特征提取方法一个实施例提供的逐级进行多类型池化处理的示意性流程框图。具体地,请参阅图2,本公开的视频特征提取方法一个实施例提供的步骤S12中的逐级进行多类型池化处理具体包括以下步骤:
步骤(a),根据一个帧图像,确定一个具有第一矩阵维度和第二矩阵维度(或者说,具有长度方向和宽度方向)的第一矩阵。不妨假设该帧图像的长度为x像素、宽度为y像素,其中的x和y为正整数。该第一矩阵 中的一个点(也可以将矩阵中的点称为矩阵中的元素,但为了与向量中的元素相区分,以下均将矩阵中的元素称为“点”)对应该帧图像中的一个像素,从而该第一矩阵为一个第一矩阵维度的长度为x、第二矩阵维度的长度为y的矩阵(即x*y矩阵);这里所说的矩阵的第一矩阵维度/第二矩阵维度的长度用于表示该矩阵在第一矩阵维度/第二矩阵维度上所包含的点的数量。该第一矩阵中的每个点的取值为一个3维的向量,将该3维的向量定义为第一向量,该第一向量用于表示该帧图像中的对应像素的三个颜色通道的亮度。需要注意的是,当视频对象的颜色模式为红绿蓝模式(RGB模式)时,可以取红、绿、蓝三个颜色通道;但并非一定取红、绿、蓝三个颜色通道,例如,可以根据视频对象所使用的颜色模式进行选取;甚至所选取的颜色通道的数量也并非必须是三个,例如,可以选取红绿蓝三个颜色通道中的两个。此后,处理进到步骤(b)。
步骤(b),在第一矩阵上设置多个第一区块(事实上每个区块相当于一个池化窗,因此也可将第一区块称为第一池化窗),不妨设置x 1*y 1个第一区块,其中的x 1和y 1为正整数,每个第一区块包含多个该第一矩阵的点(或者说,包含多个第一向量);该多个第一区块在第一矩阵维度上的数量少于该第一矩阵的第一矩阵维度的长度(或者说,少于该第一矩阵在第一矩阵维度上所包含的点的数量),且该多个第一区块在第二矩阵维度上的数量少于该第一矩阵的第二矩阵维度的长度(或者说,少于该第一矩阵在第二矩阵维度上所包含的点的数量),即有x 1的值小于x,且y 1的值小于y。对于每个第一区块,分别计算第一区块所包含的多个第一向量的各个维的最大值、最小值和平均值,得到该第一区块对应的一个9维的向量,将该9维的向量定义为第二向量。需要说明的是,各个第一区块之间可以部分相互重叠,即可以包含相同的点,也可以不相互重叠。此后,处理进到步骤(c)。
具体地,在设置第一区块时,可以均匀地将第一矩阵的第一矩阵维度分成x 1段,每段具有相同的长度,且相邻两段之间包含相同的点(部分重叠),按照同样的方式,将第一矩阵的第二矩阵维度分成y 1段,再将该x 1段与该y 1段进行组合,得到第一矩阵的x 1*y 1个第一区块。
需要说明的是,当设置的每个第一区块具有相同的大小和相同的间距时(相邻的两个第一区块可以重叠),前述的在第一矩阵上设置多个第一区块并计算各个第一区块的第二向量的过程,事实上等同于用一个池化窗按照一定间距扫描(或者说划过)整个第一矩阵,并在每次扫描中,计算该池化窗覆盖的区域的第二向量。
步骤(c),根据该多个x 1*y 1个第一区块以及每个第一区块对应的第二向量,确定第二矩阵;该第二矩阵中的一个点对应一个第一区块,当设置 了x 1*y 1个第一区块时,该第二矩阵就是一个第一矩阵维度的长度为x 1、第二矩阵维度的长度为y 1的矩阵(即x 1*y 1矩阵);该第二矩阵中的各个点的取值为对应的第一区块的该第二向量。此后,处理进到步骤(d)。
需要说明的是,在确定第二矩阵时,需要按照一定顺序进行第一区块与第二矩阵中的点的对应。作为一种具体示例,可以按照各个第一区块在第一矩阵中的位置顺序,对第二矩阵中的各个点进行排列。
步骤(d),重复步骤(b)和步骤(c):根据包含x 1*y 1个点且每个点的取值为9维向量的第二矩阵,得到包含x 2*y 2个点且每个点的取值为27维向量的第三矩阵(其中的x 2为小于x 1的正整数,y 2为小于y 1的正整数);再根据包含x 2*y 2个点且每个点的取值为27维向量的第三矩阵,得到包含x 3*y 3个点且每个点的取值为81维向量的第三矩阵(其中的x 3为小于x 2的正整数,y 3为小于y 2的正整数);...;直到将该第一矩阵(或者说,该帧图像)缩小成一个1*1的第N矩阵(事实上,就是将矩阵降维成了一个点),其中的N为正整数,该第N矩阵仅包括一个点,该点的取值为一个3 N维的向量;将该3 N维向量确定为该帧图像的图像特征。
需要注意的是,在步骤(d)中,在各次的设置区块的过程中,应根据矩阵的大小采用相应的方式来设置区块,以适应矩阵的第一矩阵维度、第二矩阵维度的逐级减小。
在本公开的实施例中,还可以包括以下步骤:对所确定的图像特征进行二值化处理,得到二值化图像特征,该二值化图像特征为由0/1构成的比特串;再根据所得到的二值化图像特征,确定视频特征。
将图像特征进行二值化处理,能够压缩视频特征的存储,并加速视频比对的相似度计算过程,另外,进行二值化处理还有利于进行视频比对的索引库召回过程。
具体地,可以利用随机投影(random projection)法将图像特征转化为二值化形式的图像特征,该方法特别适用于将向量形式的图像特征二值化。图3为本公开的视频特征提取方法一个实施例提供的利用随机投影法对图像特征进行二值化的示意性框图。请参阅图3,本公开示例的利用随机投影法对图像特征进行二值化处理的过程,主要包括以下步骤:
步骤S21,为了生成长度为n的二值化图像特征,根据图像特征,生成2n个小组(group),每个小组包含该图像特征中的多个元素(也就是,每个小组包含图像特征的多个维度的数值)。其中的n为正整数。此后,处理进到步骤S22。
需要说明的是,一个小组具体包含哪些元素是任意的,并且两个不同小组可以包括一些相同的元素。不过,为了便于视频比对,每个小组包含具体包含哪些元素可以是预设的,或者对多个视频对象,可以采用同样的 方式来生成该小组。
在本示例中,每个小组所包含的元素的数量是相同的。但需要说明的是,事实上各个小组所包含的元素的数量也可以是不同的。
步骤S22,分别对每个小组所包括的多个元素进行求和,以得到每个小组的加和值。此后,处理进到步骤S23。
步骤S23,将该2n个小组两两配对,得到n个小组对。此后,处理进到步骤S24。
具体地,可以预先将2n个小组排序(或者将小组编号),并将相邻的两个小组配成一对。
步骤S24,分别对n个小组对进行比较,比较每一个小组对中的两个小组的加和值的大小,根据比较的结果生成一个二值化的图像特征比特。此后,处理进到步骤S25。
具体地,在预先已将小组排序(或者已将编号)的示例中,在一对小组中,如果排序靠前的小组的加和值大于排序靠后的小组的加和值,则生成一个取值为1的二值化图像特征比特,反之则生成一个取值为0的二值化图像特征比特。需要说明的是,不限制生成二值化图像特征比特的方式,比如,可以当排序靠前的小组的加和值小于排序靠后的小组的加和值时生成取值为1的二值化图像特征比特。
步骤S25,根据该n个小组对的该n个二值化图像特征比特,组成该帧图像的长度为n的二值化图像特征。
图4为利用本公开的视频特征提取方法进行一次具体的提取帧图像的图像特征过程的示意性流程图。请参阅图4,利用本公开的实施例提供的提取帧图像的图像特征的一个具体示例的步骤如下:
步骤S31,对于从视频对象抽样得到的一个243*243的帧图像(长为243像素,宽为243像素),每个像素有红绿蓝3个通道,在图4中用I、II、III分别标识出了红、绿、蓝三个通道。根据帧图像定义第一矩阵:将第一矩阵中每一个点与在帧图像中的相同位置的像素相对应,根据每个像素的红绿蓝3个通道的亮度取值定义对应的点的取值,从而得到了一个243*243的第一矩阵,且第一矩阵中的点的取值为一个3维向量。
步骤S32,用一个13*13的矩阵方块(或者,可将该矩阵方块称为池化窗)划过第一矩阵;
获取矩阵方块所覆盖的13*13个点的每一维(事实上就是红绿蓝三个颜色通道的亮度)的最大值、最小值和中间值,从而得到一个9维向量;
矩阵方块每次向第一矩阵的长度方向或向宽度方向移动3个点,矩阵方块沿长宽方向依次划过所有点,并计算矩阵方块所覆盖的多个点的每一维的最大值、最小值和中间值;
处理完整个第一矩阵后,得到81*81的第二矩阵,该第二矩阵中的点的取值为9维向量。
步骤S33,重复步骤S32,利用一个10*10的矩阵方块划过第二矩阵且每次划过3个点,得到27*27的第三矩阵,该第三矩阵中的点的取值为27维向量;利用一个6*6的矩阵方块划过第三矩阵且每次划过2个点,得到9*9的第四矩阵,该第四矩阵中的点的取值为81维向量;...,直到得到一个1*1的单点矩阵,该单点矩阵所包含的点的取值为729维向量,将该729维向量定义为池化向量。
步骤S34,通过随机投影法,对该池化向量进行二值化处理,得到该帧图像的二值化的图像特征。
图5为本公开的视频特征库构建方法一个实施例的示意性流程图。请参阅图5,本公开示例的视频特征库构建方法,主要包括以下步骤:
步骤S41,按照前述的本公开示例的视频特征提取方法的步骤提取视频对象的视频特征。此后,处理进到步骤S42。
步骤S42,将视频对象的视频特征存储到视频特征库中。
需要说明的是,一个视频特征库中的视频特征应该是按照同样的特征提取方法得到的,即在步骤S41的视频特征提取的过程中,在步骤S11中基于同样的方式进行抽帧,在步骤S12中基于同样的方式逐级地对帧图像进行多种类型的池化,且在步骤S13中基于同样的方式将图像特征组成视频特征。另外,随着时间的推移,可以随时对视频特征库进行更新。
图6为本公开的视频特征提取装置一个实施例的示意性结构框图。请参阅图6,本公开示例的视频特征提取装置100主要包括:
抽帧模块110,用于对视频对象进行抽帧,得到一个或多个帧图像;
图像特征确定模块120,用于对每个帧图像,逐级地进行多种类型的池化,得到该帧图像的图像特征;其中,所谓的多种类型的池化包括最大池化、最小池化和平均池化;
视频特征确定模块130,用于根据该一个或多个帧图像对应的多个图像特征确定视频特征。
具体地,其中的图像特征确定模块120还用于按照前述的本公开的视频特征提取方法的实施例所示出的具体步骤,逐级地对帧图像进行多类型池化。
进一步地,本公开示例的视频特征提取装置100还包括二值化模块(图中未示出),用于按照前述的本公开的视频特征提取方法的实施例所示出的具体步骤,对图像特征进行二值化处理。此时,该视频特征确定模块130,用于根据二值化后的图像特征确定视频特征。
图7为本公开的视频特征库构建装置一个实施例的示意性结构图。请 参阅图7,本公开示例的视频特征库构建装置200主要包括:
视频特征提取模块201,包括前述的本公开示例的视频特征提取装置的抽帧模块110、图像特征确定模块120及视频特征确定模块130,并可包括二值化模块,用于按照前述的本公开示例的视频特征提取方法的步骤提取视频对象的视频特征。
视频特征存储模块202,用于将视频特征存储到视频特征库中。
视频特征库203,用于存储各个视频对象的视频特征。
图8是图示根据本公开的实施例的视频特征提取硬件装置的硬件框图。如图8所示,根据本公开实施例的视频特征提取硬件装置300包括存储器301和处理器302。视频特征提取硬件装置300中的各组件通过总线系统和/或其它形式的连接机构(未示出)互连。
该存储器301用于存储非暂时性计算机可读指令。具体地,存储器301可以包括一个或多个计算机程序产品,该计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。该易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。该非易失性存储器例如可以包括只读存储器(ROM)、硬盘、闪存等。
该处理器302可以是中央处理单元(CPU)或者具有数据处理能力和/或指令执行能力的其它形式的处理单元,并且可以控制视频特征提取硬件装置300中的其它组件以执行期望的功能。在本公开的一个实施例中,该处理器302用于运行该存储器301中存储的该计算机可读指令,使得该视频特征提取硬件装置300执行前述的本公开各实施例的视频特征提取方法的全部或部分步骤。
图9是图示根据本公开的实施例的计算机可读存储介质的示意图。如图9所示,根据本公开实施例的计算机可读存储介质400,其上存储有非暂时性计算机可读指令401。当该非暂时性计算机可读指令401由处理器运行时,执行前述的本公开各实施例的视频特征提取方法的全部或部分步骤。
图10是图示根据本公开实施例的终端设备的硬件结构示意图。终端设备可以以各种形式来实施,本公开中的终端设备可以包括但不限于诸如移动电话、智能电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、导航装置、车载终端设备、车载显示终端、车载电子后视镜等等的移动终端设备以及诸如数字TV、台式计算机等等的固定终端设备。
如图10所示,终端设备1100可以包括无线通信单元1110、A/V(音频/视频)输入单元1120、用户输入单元1130、感测单元1140、输出单元1150、存储器1160、接口单元1170、控制器1180和电源单元1190等等。图10 示出了具有各种组件的终端设备,但是应理解的是,并不要求实施所有示出的组件。可以替代地实施更多或更少的组件。
其中,无线通信单元1110允许终端设备1100与无线通信系统或网络之间的无线电通信。A/V输入单元1120用于接收音频或视频信号。用户输入单元1130可以根据用户输入的命令生成键输入数据以控制终端设备的各种操作。感测单元1140检测终端设备1100的当前状态、终端设备1100的位置、用户对于终端设备1100的触摸输入的有无、终端设备1100的取向、终端设备1100的加速或减速移动和方向等等,并且生成用于控制终端设备1100的操作的命令或信号。接口单元1170用作至少一个外部装置与终端设备1100连接可以通过的接口。输出单元1150被构造为以视觉、音频和/或触觉方式提供输出信号。存储器1160可以存储由控制器1180执行的处理和控制操作的软件程序等等,或者可以暂时地存储己经输出或将要输出的数据。存储器1160可以包括至少一种类型的存储介质。而且,终端设备1100可以与通过网络连接执行存储器1160的存储功能的网络存储装置协作。控制器1180通常控制终端设备的总体操作。另外,控制器1180可以包括用于再现或回放多媒体数据的多媒体模块。控制器1180可以执行模式识别处理,以将在触摸屏上执行的手写输入或者图片绘制输入识别为字符或图像。电源单元1190在控制器1180的控制下接收外部电力或内部电力并且提供操作各元件和组件所需的适当的电力。
本公开提出的视频特征提取方法的各种实施方式可以以使用例如计算机软件、硬件或其任何组合的计算机可读介质来实施。对于硬件实施,本公开提出的视频特征提取方法的各种实施方式可以通过使用特定用途集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理装置(DSPD)、可编程逻辑装置(PLD)、现场可编程门阵列(FPGA)、处理器、控制器、微控制器、微处理器、被设计为执行这里描述的功能的电子单元中的至少一种来实施,在一些情况下,本公开提出的视频特征提取方法的各种实施方式可以在控制器1180中实施。对于软件实施,本公开提出的视频特征提取方法的各种实施方式可以与允许执行至少一种功能或操作的单独的软件模块来实施。软件代码可以由以任何适当的编程语言编写的软件应用程序(或程序)来实施,软件代码可以存储在存储器1160中并且由控制器1180执行。
以上,根据本公开实施例的视频特征提取方法、装置、硬件装置、计算机可读存储介质以及终端设备,通过逐级地对由视频抽帧得到的帧图像进行多种类型的池化以生成视频特征,能够大大提高视频特征提取的准确性和提取的效率,并能提高得到的视频特征的优良程度和鲁棒性,进而使得基于本公开的视频特征提取方法得到的视频特征进行的视频比对、视频检索、视频消重以及视频内容监测具有更高的准确率、更高的效率和更好 的鲁棒性。
以上结合具体实施例描述了本公开的基本原理,但是,需要指出的是,在本公开中提及的优点、优势、效果等仅是示例而非限制,不能认为这些优点、优势、效果等是本公开的各个实施例必须具备的。另外,上述公开的具体细节仅是为了示例的作用和便于理解的作用,而非限制,上述细节并不限制本公开为必须采用上述具体的细节来实现。
本公开中涉及的器件、装置、设备、系统的方框图仅作为例示性的例子并且不意图要求或暗示必须按照方框图示出的方式进行连接、布置、配置。如本领域技术人员将认识到的,可以按任意方式连接、布置、配置这些器件、装置、设备、系统。诸如“包括”、“包含”、“具有”等等的词语是开放性词汇,指“包括但不限于”,且可与其互换使用。这里所使用的词汇“或”和“和”指词汇“和/或”,且可与其互换使用,除非上下文明确指示不是如此。这里所使用的词汇“诸如”指词组“诸如但不限于”,且可与其互换使用。
另外,如在此使用的,在以“至少一个”开始的项的列举中使用的“或”指示分离的列举,以便例如“A、B或C的至少一个”的列举意味着A或B或C,或AB或AC或BC,或ABC(即A和B和C)。此外,措辞“示例的”不意味着描述的例子是优选的或者比其他例子更好。
还需要指出的是,在本公开的系统和方法中,各部件或各步骤是可以分解和/或重新组合的。这些分解和/或重新组合应视为本公开的等效方案。
可以不脱离由所附权利要求定义的教导的技术而进行对在此所述的技术的各种改变、替换和更改。此外,本公开的权利要求的范围不限于以上所述的处理、机器、制造、事件的组成、手段、方法和动作的具体方面。可以利用与在此所述的相应方面进行基本相同的功能或者实现基本相同的结果的当前存在的或者稍后要开发的处理、机器、制造、事件的组成、手段、方法或动作。因而,所附权利要求包括在其范围内的这样的处理、机器、制造、事件的组成、手段、方法或动作。
提供所公开的方面的以上描述以使本领域的任何技术人员能够做出或者使用本公开。对这些方面的各种修改对于本领域技术人员而言是非常显而易见的,并且在此定义的一般原理可以应用于其他方面而不脱离本公开的范围。因此,本公开不意图被限制到在此示出的方面,而是按照与在此公开的原理和新颖的特征一致的最宽范围。
为了例示和描述的目的已经给出了以上描述。此外,此描述不意图将本公开的实施例限制到在此公开的形式。尽管以上已经讨论了多个示例方面和实施例,但是本领域技术人员将认识到其某些变型、修改、改变、添加和子组合。

Claims (13)

  1. 一种视频特征提取方法,所述方法包括:
    对视频对象进行抽帧,得到一个或多个帧图像;
    对每个所述帧图像逐级地进行多种类型的池化,以得到所述帧图像的图像特征;其中,所述多种类型的池化包括最大池化、最小池化和平均池化;
    根据所述一个或多个帧图像的所述图像特征确定视频特征。
  2. 根据权利要求1所述的视频特征提取方法,其中,所述的对每个所述帧图像逐级地进行多种类型的池化包括:
    基于所述帧图像的多种颜色通道逐级地进行所述多种类型的池化。
  3. 根据权利要求1所述的视频特征提取方法,其中,所述的对每个所述帧图像逐级地进行多种类型的池化,以得到所述帧图像的图像特征包括:
    根据所述帧图像确定一个矩阵,利用所述多种类型的池化,逐级地生成更小的矩阵,直到缩小为仅包含一个点的矩阵,根据所述仅包含一个点的矩阵确定所述图像特征。
  4. 根据权利要求3所述的视频特征提取方法,其中,所述的根据所述帧图像确定一个矩阵,利用所述多种类型的池化,逐级地生成更小的矩阵,直到缩小为仅包含一个点的矩阵,根据所述仅包含一个点的矩阵确定所述图像特征包括以下步骤:
    (a)根据一个所述帧图像,确定一个具有第一矩阵维度和第二矩阵维度的第一矩阵;所述第一矩阵中的点与所述帧图像中的像素相对应;所述第一矩阵中的点的取值为第一向量,所述第一向量为3维的向量,用于表示对应像素的三个颜色通道的亮度;
    (b)在所述第一矩阵上设置多个第一区块,每个所述第一区块包含多个所述第一向量;所述多个第一区块在第一矩阵维度上的数量少于所述第一矩阵在第一矩阵维度上所包含的点的数量,且所述多个第一区块在第二矩阵维度上的数量少于所述第一矩阵在第二矩阵维度上所包含的点的数量;对于每个所述第一区块,分别计算所述第一区块所包含的多个所述第一向量的各个维的最大值、最小值和平均值,得到一个9维的第二向量;
    (c)根据所述多个第一区块所对应的所述第二向量,确定第二矩阵;所述第二矩阵中的点与所述第一区块相对应,所述第二矩阵中的点的取值为所述第二向量;
    (d)重复步骤(b)和步骤(c),直到将所述第一矩阵缩小成一个取值为3 N维向量的点,其中的N为正整数;将所述3 N维向量确定为所述帧图像的图像特征。
  5. 根据权利要求1所述的视频特征提取方法,所述根据所述一个或多个帧图像的所述图像特征确定视频特征包括:
    对所述图像特征进行二值化处理,得到二值化图像特征;
    根据所述一个或多个帧图像的所述二值化图像特征,确定视频特征。
  6. 根据权利要求5所述的视频特征提取方法,其中,所述的对所述图像特征进行二值化处理,得到二值化图像特征包括以下步骤:
    根据所述图像特征生成多个小组,每个所述小组包含所述图像特征中的多个元素;
    分别对每个所述小组中的所述多个元素进行求和,得到每个所述小组的加和值;
    将所述多个小组两两配对,得到多个小组对;
    对于每个所述小组,比较所述小组对中的两个所述小组的所述加和值的大小,根据比较结果生成一个二值化的图像特征比特;
    根据所述多个小组对的所述图像特征比特,确定所述帧图像的二值化图像特征。
  7. 一种视频特征库构建方法,所述方法包括:
    按照如权利要求1到6中任意一项所述的视频特征提取方法提取视频对象的视频特征;
    将所述视频特征存储到视频特征库中。
  8. 一种视频特征提取装置,所述装置包括:
    抽帧模块,用于对视频对象进行抽帧,得到一个或多个帧图像;
    图像特征确定模块,用于对每个所述帧图像逐级地进行多种类型的池化,以得到所述帧图像的图像特征;其中,所述多种类型的池化包括最大池化、最小池化和平均池化;
    视频特征确定模块,用于根据所述一个或多个帧图像的所述图像特征确定视频特征。
  9. 根据权利要求8所述的视频特征提取装置,还包括执行权利要求2到6中任一权利要求所述步骤的模块。
  10. 一种视频特征库构建装置,所述装置包括:
    视频特征提取模块,用于按照如权利要求1到6中任意一项所述的视频特征提取方法提取视频对象的视频特征;
    视频特征存储模块,用于将所述视频特征存储到视频特征库中;
    视频特征库,用于存储所述视频特征。
  11. 一种视频特征提取硬件装置,包括:
    存储器,用于存储非暂时性计算机可读指令;以及
    处理器,用于运行所述计算机可读指令,使得所述处理器执行时实现 根据权利要求1到6中任意一项所述的视频特征提取方法。
  12. 一种计算机可读存储介质,用于存储非暂时性计算机可读指令,当所述非暂时性计算机可读指令由计算机执行时,使得所述计算机执行权利要求1到6中任意一项所述的视频特征提取方法。
  13. 一种终端设备,包括权利要求8或9所述的一种视频特征提取装置。
PCT/CN2018/125496 2018-03-29 2018-12-29 一种视频特征提取方法及装置 WO2019184520A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US16/971,760 US11455802B2 (en) 2018-03-29 2018-12-29 Video feature extraction method and device
JP2020545849A JP6982194B2 (ja) 2018-03-29 2018-12-29 ビデオ特徴の抽出方法および装置
SG11202008272RA SG11202008272RA (en) 2018-03-29 2018-12-29 Video feature extraction method and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810271774.6A CN110321759B (zh) 2018-03-29 2018-03-29 一种视频特征提取方法及装置
CN201810271774.6 2018-03-29

Publications (1)

Publication Number Publication Date
WO2019184520A1 true WO2019184520A1 (zh) 2019-10-03

Family

ID=68062443

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/125496 WO2019184520A1 (zh) 2018-03-29 2018-12-29 一种视频特征提取方法及装置

Country Status (5)

Country Link
US (1) US11455802B2 (zh)
JP (1) JP6982194B2 (zh)
CN (1) CN110321759B (zh)
SG (1) SG11202008272RA (zh)
WO (1) WO2019184520A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110807769B (zh) * 2019-10-30 2021-12-14 腾讯科技(深圳)有限公司 图像显示控制方法及装置
CN111369472B (zh) * 2020-03-12 2021-04-23 北京字节跳动网络技术有限公司 图像去雾方法、装置、电子设备及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003005697A2 (en) * 2001-07-06 2003-01-16 Scopus Network Technologies Ltd. System and method for the application of a statistical multiplexing algorithm for video encoding
CN106295605A (zh) * 2016-08-18 2017-01-04 宁波傲视智绘光电科技有限公司 红绿灯检测与识别方法
CN106649663A (zh) * 2016-12-14 2017-05-10 大连理工大学 一种基于紧凑视频表征的视频拷贝检测方法
CN107169415A (zh) * 2017-04-13 2017-09-15 西安电子科技大学 基于卷积神经网络特征编码的人体动作识别方法
CN107491748A (zh) * 2017-08-09 2017-12-19 电子科技大学 一种基于视频的目标车辆提取方法

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9396621B2 (en) * 2012-03-23 2016-07-19 International Business Machines Corporation Systems and methods for false alarm reduction during event detection
JP6211407B2 (ja) * 2013-12-06 2017-10-11 株式会社デンソーアイティーラボラトリ 画像検索システム、画像検索装置、検索サーバ装置、画像検索方法、及び画像検索プログラム
US9432702B2 (en) * 2014-07-07 2016-08-30 TCL Research America Inc. System and method for video program recognition
KR20170128454A (ko) 2015-03-11 2017-11-22 지멘스 악티엔게젤샤프트 세포 이미지들 및 비디오들의 디컨볼루셔널 네트워크 기반 분류를 위한 시스템들 및 방법들
US10068138B2 (en) * 2015-09-17 2018-09-04 Canon Kabushiki Kaisha Devices, systems, and methods for generating a temporal-adaptive representation for video-event classification
CN105574215B (zh) 2016-03-04 2019-11-12 哈尔滨工业大学深圳研究生院 一种基于多层特征表示的实例级图像搜索方法
JP6525912B2 (ja) * 2016-03-23 2019-06-05 富士フイルム株式会社 画像分類装置、方法およびプログラム
US10390082B2 (en) * 2016-04-01 2019-08-20 Oath Inc. Computerized system and method for automatically detecting and rendering highlights from streaming videos
BR102016007265B1 (pt) * 2016-04-01 2022-11-16 Samsung Eletrônica da Amazônia Ltda. Método multimodal e em tempo real para filtragem de conteúdo sensível
US10803318B1 (en) * 2016-05-18 2020-10-13 Educational Testing Service Automated scoring of video clips using extracted physiological features
US10681391B2 (en) * 2016-07-13 2020-06-09 Oath Inc. Computerized system and method for automatic highlight detection from live streaming media and rendering within a specialized media player
JP6612196B2 (ja) * 2016-07-27 2019-11-27 日本システムウエア株式会社 岩盤強度判定装置、岩盤強度判定方法、及び岩盤強度判定プログラム
CN107092960A (zh) 2017-04-17 2017-08-25 中国民航大学 一种改进的并行通道卷积神经网络训练方法
CN107247949B (zh) 2017-08-02 2020-06-19 智慧眼科技股份有限公司 基于深度学习的人脸识别方法、装置和电子设备
CN107564009B (zh) 2017-08-30 2021-02-05 电子科技大学 基于深度卷积神经网络的室外场景多目标分割方法
CN107844766A (zh) 2017-10-31 2018-03-27 北京小米移动软件有限公司 人脸图像模糊度的获取方法、装置和设备
US10552671B2 (en) * 2017-11-22 2020-02-04 King Fahd University Of Petroleum And Minerals Multi-kernel fuzzy local Gabor feature extraction method for automatic gait recognition
CN110324660B (zh) * 2018-03-29 2021-01-19 北京字节跳动网络技术有限公司 一种重复视频的判断方法及装置
US20200258616A1 (en) * 2019-02-07 2020-08-13 The Regents Of The University Of Michigan Automated identification and grading of intraoperative quality

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003005697A2 (en) * 2001-07-06 2003-01-16 Scopus Network Technologies Ltd. System and method for the application of a statistical multiplexing algorithm for video encoding
CN106295605A (zh) * 2016-08-18 2017-01-04 宁波傲视智绘光电科技有限公司 红绿灯检测与识别方法
CN106649663A (zh) * 2016-12-14 2017-05-10 大连理工大学 一种基于紧凑视频表征的视频拷贝检测方法
CN107169415A (zh) * 2017-04-13 2017-09-15 西安电子科技大学 基于卷积神经网络特征编码的人体动作识别方法
CN107491748A (zh) * 2017-08-09 2017-12-19 电子科技大学 一种基于视频的目标车辆提取方法

Also Published As

Publication number Publication date
JP2021504855A (ja) 2021-02-15
US20210089785A1 (en) 2021-03-25
JP6982194B2 (ja) 2021-12-17
US11455802B2 (en) 2022-09-27
CN110321759A (zh) 2019-10-11
CN110321759B (zh) 2020-07-07
SG11202008272RA (en) 2020-09-29

Similar Documents

Publication Publication Date Title
WO2019184522A1 (zh) 一种重复视频的判断方法及装置
US10990827B2 (en) Imported video analysis device and method
WO2021237570A1 (zh) 影像审核方法及装置、设备、存储介质
EP4030749A1 (en) Image photographing method and apparatus
WO2017107855A1 (zh) 一种图片搜索方法及装置
CN111666442B (zh) 一种图像检索方法、装置及计算机设备
WO2019184520A1 (zh) 一种视频特征提取方法及装置
WO2019184517A1 (zh) 一种音频指纹提取方法及装置
CN111274446A (zh) 视频处理方法及相关装置
US11593582B2 (en) Method and device for comparing media features
CN111428740A (zh) 网络翻拍照片的检测方法、装置、计算机设备及存储介质
US11874869B2 (en) Media retrieval method and apparatus
CN108549702B (zh) 一种移动终端的图片库的清理方法及移动终端
WO2019184521A1 (zh) 一种视频特征提取方法及装置
CN115082999A (zh) 合影图像人物分析方法、装置、计算机设备和存储介质
CN112100412B (zh) 图片检索方法、装置、计算机设备和存储介质
CN115269494A (zh) 数据归档方法及装置
CN112036501A (zh) 基于卷积神经网络的图片的相似度检测方法及其相关设备
CN110941589A (zh) 一种图片导出方法、装置、电子设备及可读存储介质
CN110717362B (zh) 数位影像的特征树结构的建立方法与影像物件辨识方法
TWI684919B (zh) 數位影像的特徵樹結構之建立方法與影像物件辨識方法
US10108636B2 (en) Data deduplication method
CN117238017A (zh) 人脸识别方法、装置、计算机设备和存储介质
Ma et al. VMambaCC: A Visual State Space Model for Crowd Counting
US9075847B2 (en) Methods, apparatus and system for identifying a document

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18911660

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020545849

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18911660

Country of ref document: EP

Kind code of ref document: A1