WO2024109138A1 - 视频编码方法、装置及存储介质 - Google Patents

视频编码方法、装置及存储介质 Download PDF

Info

Publication number
WO2024109138A1
WO2024109138A1 PCT/CN2023/109319 CN2023109319W WO2024109138A1 WO 2024109138 A1 WO2024109138 A1 WO 2024109138A1 CN 2023109319 W CN2023109319 W CN 2023109319W WO 2024109138 A1 WO2024109138 A1 WO 2024109138A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame image
video
bit rate
target
training
Prior art date
Application number
PCT/CN2023/109319
Other languages
English (en)
French (fr)
Inventor
曲建峰
Original Assignee
行吟信息科技(武汉)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 行吟信息科技(武汉)有限公司 filed Critical 行吟信息科技(武汉)有限公司
Publication of WO2024109138A1 publication Critical patent/WO2024109138A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/103Selection of coding mode or of prediction mode
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/136Incoming video signal characteristics or properties
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/136Incoming video signal characteristics or properties
    • H04N19/137Motion inside a coding unit, e.g. average field, frame or block difference
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/146Data rate or code amount at the encoder output
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/154Measured or subjectively estimated visual quality after decoding, e.g. measurement of distortion

Definitions

  • the present application relates to the field of computer technology, and in particular to a video encoding method, device and storage medium.
  • Bit rate control is a method of controlling the size of a video file and the quality of the video image by determining how many bits are allocated to each frame of the image.
  • the commonly used bit rate control method for video encoding is a constant rate factor (CRF), which means that the image quality of each frame of the video is kept constant, and the bit rate (the number of bits transmitted per unit time during data transmission) is variable.
  • CRF mainly selects image quality parameters and image resolution (also known as selected original code points) based on the average bit rate and average image quality of the video to encode the video.
  • an embodiment of the present application provides a video encoding method, including:
  • the classification result indicates that when the image quality of each frame image reaches the preset image quality, there is a bit rate that is less than the initial bit rate of each frame image, then determine the target bit rate of each frame image according to the texture features of each frame image and the motion features of each frame image; wherein the target bit rate of each frame image refers to the bit rate of each frame image when the image quality of each frame image reaches the preset image quality; the target bit rate of each frame image is less than the initial bit rate of each frame image;
  • the target video is encoded according to the target bit rate of each frame image to obtain an encoded target video.
  • an embodiment of the present application provides a video encoding device, the video encoding device comprising a processing unit and an encoding unit, wherein:
  • the processing unit is used to perform feature extraction processing on each frame image in the target video to obtain texture features of each frame image and motion features of each frame image;
  • the processing unit is further used to determine the initial bit rate of each frame image based on the preset image quality of the target video and the preset image resolution of the target video;
  • the processing unit is further used to classify the target video according to the texture features of each frame image and the motion features of each frame image to obtain a classification result; wherein the classification result is used to indicate whether there is a bit rate less than an initial bit rate of each frame image when the image quality of each frame image reaches the preset image quality;
  • the processing unit is further configured to determine, if the classification result indicates that when the image quality of each frame image reaches the preset image quality, a bit rate less than an initial bit rate of each frame image exists, a target bit rate of each frame image according to a texture feature of each frame image and a motion feature of each frame image; wherein the target bit rate of each frame image refers to the bit rate of each frame image when the image quality of each frame image reaches the preset image quality; and the target bit rate of each frame image is less than the initial bit rate of each frame image;
  • the encoding unit is further used to encode the target video according to the target bit rate of each frame image to obtain an encoded target video.
  • an embodiment of the present application provides an electronic device, the electronic device comprising an input interface and an output interface, and further comprising:
  • a processor adapted to implement one or more instructions
  • a computer storage medium storing one or more instructions, wherein the one or more instructions are suitable for being loaded by the processor and executing the above-mentioned video encoding method.
  • an embodiment of the present application provides a computer storage medium, in which computer program instructions are stored.
  • the computer program instructions are executed by a processor, they are used to execute the above-mentioned video encoding method.
  • an embodiment of the present application provides a computer program product or a computer program, wherein the computer program product or the computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium; a processor of an electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions.
  • the computer instructions are executed by the processor, they are used to execute the above-mentioned video encoding method.
  • the texture features and motion features of each frame image are first used to determine whether there is a bit rate less than the initial bit rate of each frame image when the image quality of each frame image reaches the preset image quality, and to determine whether the bit rate of each frame image can be further reduced. After determining that there is still room for reducing the initial bit rate of each frame image, the texture features and motion features of each frame image can be further used to obtain the target bit rate of each frame image that can ensure that the image quality reaches the preset image quality and is less than the initial bit rate of each frame image.
  • the embodiment of the present application will further use the texture features and motion features of each frame image to obtain a target bit rate less than the initial bit rate of each frame image, so as to encode the video, thereby reducing the amount of data of the encoded video while ensuring that the image quality of each frame image in the video reaches the preset image quality, which is beneficial to the storage and transmission of the video.
  • FIG1 is a schematic diagram of the structure of a video encoding system provided in an embodiment of the present application.
  • FIG2 is a schematic diagram of a flow chart of a video encoding method provided in an embodiment of the present application.
  • FIG3 is a schematic diagram of the relationship between image quality and bit rate provided in an embodiment of the present application.
  • FIG4 is another schematic diagram of the relationship between image quality and bit rate provided in an embodiment of the present application.
  • FIG5 is a schematic diagram of a flow chart of another video encoding method provided in an embodiment of the present application.
  • FIG6 is a schematic diagram of a video encoding process provided by an embodiment of the present application.
  • FIG7 is a schematic diagram of the structure of a video encoding device provided in an embodiment of the present application.
  • FIG8 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.
  • AI artificial intelligence
  • the so-called artificial intelligence technology refers to the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technology in computer science; it mainly produces a new intelligent machine that can respond in a similar way to human intelligence by understanding the essence of intelligence, so that the intelligent machine has multiple functions such as perception, reasoning and decision-making.
  • AI technology is a comprehensive discipline, which mainly includes several major directions such as computer vision technology (Computer Vision, CV), speech processing technology, natural language processing technology and machine learning (Machine Learning, ML)/deep learning.
  • machine learning is a multi-disciplinary cross-discipline involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in studying how computers simulate or implement human learning behavior to acquire new knowledge or skills, and reorganize existing knowledge structures to continuously improve their performance.
  • Machine learning is the core of AI and the fundamental way to make computer devices intelligent; the so-called machine learning is a multi-disciplinary cross-discipline involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines; it specializes in studying how computer devices simulate or implement human learning behavior to acquire new knowledge or skills, and reorganize existing knowledge structures to continuously improve their performance.
  • Deep learning is a technology that uses deep neural network systems to perform machine learning; machine learning/deep learning can usually include artificial neural networks, reinforcement learning (RL), supervised learning, unsupervised learning, contrastive learning and other technologies; the so-called supervised learning refers to the processing method of using training samples with known categories (with labeled categories) for model training, and unsupervised learning refers to the processing method of using training samples with unknown categories (not labeled) for model training.
  • RL reinforcement learning
  • supervised learning refers to the processing method of using training samples with known categories (with labeled categories) for model training
  • unsupervised learning refers to the processing method of using training samples with unknown categories (not labeled) for model training.
  • video is a continuous sequence of images, consisting of continuous image frames.
  • Video encoding is a way of converting files in the original video format into files in another video format through compression technology. Due to the visual persistence effect of the human eye, when the frame sequence is played at a certain rate, what we see is a video with continuous action. The similarity between continuous images is very high, so in order to facilitate storage and transmission, the original video can be encoded and compressed to remove the redundancy of the spatial and temporal dimensions in the video.
  • bit rate refers to the number of data bits transmitted per unit time.
  • the bit rate of each frame in the video is related to the data volume of the video. Due to different video compression ratios, the video data volume will also be different. Generally speaking, the greater the video compression ratio, the smaller the video data volume, and the more convenient it is to transmit; but the greater the video compression ratio, the worse the video image quality. Therefore, the image quality and data volume of the video can be balanced by controlling the bit rate of each frame in the video.
  • the embodiment of the present application provides a video encoding scheme, which classifies the target video according to the texture features and motion features of each frame image in the target video to determine whether there is a bit rate lower than the initial bit rate of each frame image when the image quality of each frame image reaches the preset image quality.
  • the target bit rate of each frame image can be obtained according to the texture characteristics and motion characteristics of each frame image, so as to encode the target video according to the target bit rate of each frame image to obtain the encoded target video.
  • the target bit rate of each frame image refers to the bit rate of each frame image when the image quality of each frame image reaches the preset image quality; the target bit rate of each frame image is less than the initial bit rate of each frame image.
  • this scheme will first determine whether there is a code rate less than the initial code rate of each frame image when the image quality of each frame image reaches the preset image quality through the texture features and motion features of each frame image, and determine whether the code rate of each frame image can be further reduced. After determining that there is still room for reducing the initial code rate of each frame image, the texture features and motion features of each frame image can be further used to obtain the target code rate of each frame image that can ensure that the image quality reaches the preset image quality and is less than the initial code rate of each frame image.
  • the scheme will further obtain the target code rate less than the initial code rate of each frame image through the texture features and motion features of each frame image, so as to encode the video, thereby reducing the data volume of the encoded video while ensuring that the image quality of each frame image in the video reaches the preset image quality, which is beneficial to the storage and transmission of the video.
  • the initial bit rate of each frame image is obtained based on the preset image quality of the target video and the preset image resolution of the target video.
  • the preset image quality refers to the image quality of the target video preset by humans or electronic devices.
  • the electronic device may refer to the terminal device or server hereinafter.
  • the electronic device may analyze the transmission requirements and/or storage requirements of the target video and set a reasonable preset image quality. It may also be that the electronic device directly sets a preset image quality that can meet the viewing requirements of the human eye.
  • the transmission requirement of the target video is that the transmission time is less than 5 seconds
  • the storage requirement is that the storage space is less than 800kB.
  • the electronic device can analyze and obtain the preset image quality that meets the conditions according to the original data volume of the target video, the transmission time is less than 5 seconds, and the storage space is less than 800kB.
  • a constant bit rate factor i.e., CRF
  • CRF constant rate factor
  • the value range of CRF is 0 to 51, where 0 is a lossless mode, i.e., the image quality of the encoded video is controlled to be the original image quality of the video; and the larger the value of CRF, the worse the image quality of the encoded video. Therefore, the preset image quality can specifically be the image quality indicated by the preset CRF value.
  • the preset image resolution refers to the image resolution of the target video preset by humans or electronic devices.
  • the electronic device can obtain the preset image resolution after analyzing the display requirements of the target video.
  • the display resolution of the electronic device to display the target video is up to 720P
  • the display requirement of the target video is an image resolution less than or equal to 720P
  • the electronic device can set the preset image resolution to 720P, 540P and 360P.
  • the image resolution of a video refers to the number of pixels in each frame of the video.
  • the resolution of 720P is 1280*720 pixels (P, Progressive, i.e., progressive scanning, where the number of Ps indicates the number of vertical rows of pixels in the area)
  • the resolution of 1080P is 1920*1080 pixels
  • the resolution of 2k is 2560*1440 pixels.
  • the same video can have one or more preset image resolutions.
  • texture features refer to a visual feature that reflects homogeneous phenomena in an image. It reflects the slowly changing or periodically changing surface structure and organization properties of an object.
  • texture features can be represented by the grayscale distribution of pixels and their surrounding spatial domains. According to the grayscale range, they can be divided into local texture features and global texture features.
  • Motion features include the motion trajectory and motion vector of the moving object in each frame image.
  • the video encoding system shown in FIG1 may include multiple terminal devices 101 and multiple servers 102, wherein a communication connection is established between any terminal device and any server.
  • the terminal device 101 may include any one or more of a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart car-mounted device, and a smart wearable device.
  • applications applications, APPs
  • the server 102 may be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and big data and artificial intelligence platforms.
  • the terminal device 101 and the server 102 may be directly or indirectly connected in communication via wired or wireless communication, and this application does not limit this.
  • the above-mentioned video encoding method can be executed only by the terminal device 101 in the video encoding system shown in FIG. 1, and the specific execution process is as follows: the terminal device 101 can first perform feature extraction processing on each frame image in the target video to obtain the texture features of each frame image and the motion features of each frame image; at the same time, the terminal device 101 will also determine the initial bit rate of each frame image according to the preset image quality of the target video and the preset image resolution of the target video.
  • the terminal device 101 will classify the target video according to the texture features of each frame image and the motion features of each frame image, and obtain a classification result for indicating whether there is a bit rate less than the initial bit rate of each frame image when the image quality of each frame image reaches the preset image quality. If the classification result indicates that there is a bit rate less than the initial bit rate of each frame image when the image quality of each frame image reaches the preset image quality, the terminal device 101 will determine the target bit rate of each frame image according to the texture features of each frame image and the motion features of each frame image. Afterwards, the terminal device 101 can encode the target video according to the target bit rate of each frame image to obtain the encoded target video. Optionally, the terminal device 101 may also transmit the encoded target video to the server 102 , so that the server 102 may transmit the encoded target video to other terminal devices.
  • the above-mentioned video encoding method can also be executed only by the server 102 in the video encoding system shown in FIG1 .
  • the specific execution process can refer to the specific execution process of the above-mentioned terminal device 101 during video encoding, which will not be repeated here.
  • the terminal device 101 can also transmit the encoded target video to the server 102, and then the server 102 can decode the encoded target video to obtain a decoded video; finally, the server 102 can display the decoded video or transmit the decoded video to other terminal devices.
  • the video encoding method can be run in a video encoding system, and the video encoding system can include a terminal device and a server.
  • the video encoding method can be jointly performed by the terminal device 101 and the server 102 included in the video encoding system shown in FIG. 1, and the specific execution process is: the terminal device 101 collects the preset image quality of the target video and the preset image resolution of the target video, and then uploads the collected preset image quality and preset image resolution of the target video to the server 102.
  • the server 102 will first perform feature extraction processing on each frame image in the target video to obtain the texture features of each frame image and the motion features of each frame image; at the same time, the server 102 will also determine the initial bit rate of each frame image through the preset image quality of the target video and the preset image resolution of the target video. Afterwards, the server 102 can classify the target video according to the texture features of each frame image and the motion features of each frame image, and obtain a classification result for indicating whether there is a bit rate less than the initial bit rate of each frame image when the image quality of each frame image reaches the preset image quality.
  • the server 102 will determine the target bit rate of each frame image according to the texture features of each frame image and the motion features of each frame image. Finally, the server 102 can encode the target video according to the target bit rate of each frame image to obtain the encoded target video.
  • the server 102 may also transmit the encoded target video to the terminal device 101 so that the terminal device 101
  • the encoded target video may be decoded and then the decoded target video may be displayed.
  • the terminal device 101 may first display a video selection interface, wherein the video selection interface includes multiple video identifiers, and different video identifiers indicate different videos.
  • the video selection interface may also include multiple video thumbnails, wherein one video thumbnail corresponds to a video identifier.
  • the video thumbnail is equivalent to the cover of the video; the video identifier may be a character or character segment that can identify the video, such as a serial number or a video name, which is not limited here.
  • the terminal device 101 can respond to the selection operation of multiple video identifiers in the video selection interface to obtain the target video identifier.
  • the terminal device 101 collects the preset image quality and the preset image resolution.
  • the specific way for the terminal device 101 to collect the preset image quality and the preset image resolution can be: the terminal device 101 displays the video format editing interface, wherein the video format editing interface includes a quality setting component and a resolution setting component; the terminal device 101 responds to the editing operation of the quality setting component to obtain the preset image quality; and the terminal device 101 responds to the editing operation of the resolution setting component to obtain the preset resolution.
  • the terminal device 101 generates a video acquisition request based on the target video identifier, the preset image quality and the preset resolution.
  • the terminal device 101 sends the video acquisition request to the server 102.
  • the server 102 can parse the video acquisition request to obtain the target video identifier, the preset image quality and the preset resolution.
  • the server 102 first searches for the target video whose video identifier is the target video identifier in the preset video library; then, the server 102 performs feature extraction processing on each frame image in the found target video to obtain the texture features of each frame image and the motion features of each frame image.
  • the server 102 also determines the initial bit rate of each frame image by parsing the preset image quality and the preset image resolution.
  • the server 102 can classify the target video according to the texture features of each frame image and the motion features of each frame image to obtain a classification result indicating whether there is a bit rate less than the initial bit rate of each frame image when the image quality of each frame image reaches the preset image quality. If the classification result indicates that there is a bit rate less than the initial bit rate of each frame image when the image quality of each frame image reaches the preset image quality, the server 102 will determine the target bit rate of each frame image according to the texture features of each frame image and the motion features of each frame image. Finally, the server 102 may encode the target video according to the target bit rate of each frame image to obtain an encoded target video.
  • the server 102 After obtaining the encoded target video, the server 102 sends the encoded target video to the terminal device 101. After receiving the encoded target video, the terminal device 101 can decode the encoded target video and display the decoded target video.
  • an embodiment of the present application provides a video encoding method.
  • FIG. 2 a flow chart of a video encoding method provided in an embodiment of the present application is shown.
  • the video encoding method shown in FIG. 2 can be executed by a server or a terminal device.
  • the video encoding method shown in FIG. 2 may include steps S201-S205:
  • the target video refers to the video to be encoded.
  • the video M is selected as the video to be encoded, so as to facilitate the storage and transmission of the video M
  • the video M is the target video.
  • each frame of the target video refers to each frame of the multiple frames of images that constitute the target video.
  • texture features refer to features that reflect the surface texture of an object in an image.
  • texture features in a video can also be referred to as spatial complexity.
  • texture features may include grayscale variation features, gray-level co-occurrence matrix (CLCM), and other features used to characterize image texture.
  • CLCM gray-level co-occurrence matrix
  • NCC normalized cross-correlation
  • the specific method of obtaining the texture feature of each frame image can be: obtaining the grayscale value of each pixel in each frame image of the target video; constructing the grayscale co-occurrence matrix of each frame image according to the grayscale value of each pixel in each frame image; wherein the grayscale co-occurrence matrix of each frame image is used to characterize: the grayscale ratio between each pixel in each frame image and the pixel within a preset range from each pixel.
  • the preset range can be set manually or by the system, which is not limited here.
  • the preset range can be 5 pixels (i.e., the pixel within the preset range from a certain pixel is the pixel within 5 pixels from a certain pixel), 10 pixels (i.e., the pixel within the preset range from a certain pixel is the pixel within 10 pixels from a certain pixel), etc.
  • the specific method for obtaining the texture feature of each frame image can be: obtaining the grayscale value of each pixel point in each frame image of the target video; constructing the grayscale co-occurrence matrix of each frame image according to the grayscale value of each pixel point in each frame image; wherein the grayscale co-occurrence matrix of each frame image is used to characterize: the grayscale ratio between each pixel point in each frame image and the pixel points within a preset range from each pixel point; for each grayscale ratio in the grayscale co-occurrence matrix of each frame image, counting the number of grayscale ratios in the grayscale co-occurrence matrix of each frame image, and obtaining the grayscale change feature of each frame image.
  • texture features may also include grayscale variation features, grayscale co-occurrence matrix, Tamura texture features, wavelet transform, etc., which are used to characterize one or more features of image texture, which are not limited here.
  • Tamura texture features refer to the expression of texture features proposed by Tamura et al.
  • the six components of Tamura texture features correspond to the six attributes of texture features from a psychological perspective, namely coarseness, contrast, directionality, linelikeness, regularity, and roughness. It should be noted that the technical means of extracting Tamura texture features and wavelet transform of each frame image in the video are technical means commonly used by those skilled in the art, and will not be repeated here.
  • motion features refer to features that reflect the degree of motion of the image content of each frame of the video.
  • motion features can also be called temporal complexity.
  • motion features can include a weighted peak signal-to-noise ratio (A LOW-COMPLEXITY EXTENSION OF THE PERCEPTUALLY WEIGHTED PEAK SIGNAL-TO-NOISE RATIO FOR HIGH-RESOLUTION VIDEOQUALITY ASSESSMENT, XPSNR), the displacement between each pixel block in each frame of the image and the corresponding pixel block in the target image of each frame of the image.
  • a LOW-COMPLEXITY EXTENSION OF THE PERCEPTUALLY WEIGHTED PEAK SIGNAL-TO-NOISE RATIO FOR HIGH-RESOLUTION VIDEOQUALITY ASSESSMENT, XPSNR the displacement between each pixel block in each frame of the image and the corresponding pixel block in the target image of each frame of the image.
  • the target image of each frame image refers to an image whose difference between the image sequence number in the target video and the image sequence number of each frame image in the target video is less than or equal to a preset threshold.
  • the preset threshold can be set manually or by the system, which is not limited here.
  • the preset threshold can be 10, 6, 2, etc.
  • the image sequence number of image A in the target video is 10 in the target video
  • the target image of image A includes images whose image sequence numbers in the target video are 8, 9, 11, and 12.
  • the displacement between each pixel block in each frame image and the corresponding pixel block in the target image of each frame image is also called motion estimation.
  • the specific method of performing feature extraction processing on each frame image in the target video to obtain the motion feature of each frame image can be: for each pixel block in each frame image, search for a target pixel block matching the pixel block in the target image of each frame image (i.e., the corresponding pixel block in the target image of each frame image); wherein any pixel block includes at least one pixel point, and the pixel block can also be called a macroblock of the image.
  • the specific method of the target pixel block of pixel block matching can be: for each pixel block in each frame image, determine at least one overlapping pixel block of the pixel block in the target image of each frame image; based on the gray value of each pixel point in the pixel block, and the gray value of each pixel point in each overlapping pixel block, obtain the pixel block similarity between the pixel block and each overlapping pixel block; determine the overlapping pixel block with the largest pixel block similarity as the target pixel block.
  • weighted peak signal-to-noise ratio is usually used to reflect the error between the original image and the compressed original image.
  • the technical means for extracting the weighted peak signal-to-noise ratio are commonly used by those skilled in the art and will not be described in detail here.
  • the preset image quality refers to the image quality preset by humans or electronic devices.
  • the image quality can be the distortion of the image, or the clarity of the image perceived by the human eye, etc. In other words, there are many evaluation criteria for image quality, so they are not repeated here.
  • various distortions or distortions may occur in the process of video acquisition, compression, transmission and storage. Any distortion or distortion may cause a decrease in the image quality perceived by the human eye. Therefore, the image quality that the image quality in the target video must at least reach can be preset to ensure the overall image quality of the target video.
  • the preset image quality can be a specific value or a range of values, which is not limited here.
  • the preset image resolution refers to the image resolution of the target video preset by humans or electronic devices.
  • the image resolution of a video refers to the number of pixels included in each frame of the video.
  • the preset image resolution in the embodiment of the present application can be one or more.
  • the preset image quality refers to a fixed bit rate factor (CRF) preset by humans or the system to reflect the image quality in the video.
  • the specific method of determining the initial bit rate of each frame image can be: according to the preset image quality of the target video and the preset image resolution of the target video, determine the fixed quantization value (i.e., Constant Quantizer, QP) of each frame image in the target video; based on the fixed quantization value of each frame image and the data amount of each frame image, obtain the initial bit rate of each frame image.
  • the fixed quantization value i.e., Constant Quantizer, QP
  • QP Constant Quantizer
  • the classification result refers to an indication of whether there is a bit rate less than the initial bit rate of each frame image when the image quality of each frame image reaches the preset image quality.
  • the classification result refers to whether the bit rate of each frame image can be smaller on the basis of the initial bit rate of each frame image while maintaining the image quality of each frame image at the lowest preset image quality.
  • the classification result can be 1 or 0, and 1 is pre-set to indicate that when the image quality of each frame image reaches the preset image quality, there is a bit rate less than the initial bit rate of each frame image; and 0 indicates that when the image quality of each frame image reaches the preset image quality, there is no bit rate less than the initial bit rate of each frame image.
  • the texture features of each frame image can reflect the content complexity of each frame image (also known as spatial complexity), and the motion features can reflect the motion complexity of each frame image (also known as temporal complexity), and the purpose of encoding and compressing the video is to remove the redundancy of the spatial and temporal dimensions in the video. Therefore, the redundant information in each frame image can be determined through the texture features and motion features of each frame image, which is conducive to the subsequent determination of whether there is room for reducing the bit rate of each frame image.
  • Figure 3 shows a schematic diagram of the relationship between image quality and bit rate.
  • the horizontal axis in Figure 3 is the bit rate, in k/bps; the vertical axis is the Visual Multimethod Assessment Fusion (VMAF) parameter used to evaluate the image quality in the video.
  • VMAF Visual Multimethod Assessment Fusion
  • the preset image quality of the video M can be set to the image quality indicated when the VMAF is 94, and the preset image resolution of the video M is 540P, 720P and 1080P.
  • the light gray curve in Figure 3 indicates the actual measured correspondence between the bit rate and image quality of a certain frame of the video M when the image resolution is 540P; the dark gray curve indicates the actual measured correspondence between the bit rate and image quality of a certain frame of the video M when the image resolution is 720P; the black curve indicates the actual measured correspondence between the bit rate and image quality of a certain frame of the video M when the image resolution is 1080P.
  • the CRF bit rate control method can be used to determine the original code point as shown in Figure 3 according to the average bit rate and average image quality of the video M.
  • the bit rate of the original code point that is, the initial bit rate mentioned above
  • the image resolution is 720P.
  • bit rate corresponding to the black curve is smaller than the bit rate of the original code point. Therefore, in fact, when VMAF is 94, the bit rate of the original code point is not the minimum bit rate, that is, the original code point is not the optimal code point, and the bit rate of the optimal code point should be 750k/bps, and the image resolution is 1080P.
  • Figure 4 shows another schematic diagram of the relationship between image quality and bit rate.
  • the horizontal axis in Figure 4 is the bit rate, in k/bps; the vertical axis is VMAF.
  • the preset image quality of video M can be set to the image quality indicated when VMAF is 73, and the preset image resolution of video M is 540P, 720P and 1080P.
  • the light gray curve in Figure 4 represents the actual measured correspondence between the bit rate and image quality of a frame image in video M when the image resolution is 540P;
  • the dark gray curve represents the actual measured correspondence between the bit rate and image quality of a frame image in video M when the image resolution is 720P;
  • the black curve represents the actual measured correspondence between the bit rate and image quality of a frame image in video M when the image resolution is 1080P.
  • the CRF bitrate control method can be used to determine the original code point as shown in Figure 4 based on the average bitrate and average image quality of video M.
  • the bitrate of the original code point i.e., the initial bitrate mentioned above
  • the image resolution is 720P.
  • VMAF is 73
  • the original code point determined by the bit rate control method may be the optimal code point or not; and the classification result indicates whether there is a bit rate less than the initial bit rate of each frame image when the image quality of each frame image reaches the preset image quality, which is equivalent to indicating whether the original code point is the optimal code point.
  • the target video is encoded according to the initial bit rate of each frame image to obtain the encoded target video.
  • the target video is classified according to the texture features of each frame image and the motion features of each frame image
  • the specific method for obtaining the classification result may be: calling the classification model, classifying the texture features of each frame image and the motion features of each frame image, and obtaining the classification result.
  • the training process of the classification model is a supervised training process, so the initial classification model can be trained through multiple training videos and the classification labels of each training video to obtain the classification model.
  • the classification label is used to indicate whether there is a bit rate less than the initial bit rate of each frame image in each training video when the image quality of each frame image in each training video reaches the preset image quality.
  • the initial bit rate of each frame image in each training video is obtained based on the preset image quality of each training video and the preset image resolution of each training video.
  • any of a variety of boosting algorithms (a machine learning algorithm that can be used to reduce bias in supervised learning) such as adaptive boosting algorithm (Ada boosting), gradient boosting algorithm (Gradient Boosting, GBDT), XGBoost (eXtreme Gradient Boosting, a decision tree-based integrated machine learning algorithm, using a gradient ascent framework) can be used to supervise the neural network model (i.e., the above-mentioned initial classification model) to obtain a classification model.
  • the neural network model can be one of a variety of models with classification capabilities such as a recurrent neural network model (Recurrent Neural Network, RNN), a convolutional neural network model (Convolutional Neural Network, CNN), and a decision tree model.
  • each frame image may have one or more texture features
  • each frame image may have one or more motion features
  • the target video is classified and processed according to the texture features of each frame image and the motion features of each frame image
  • the specific method for obtaining the classification result may also be: splicing the various texture features of each frame image to obtain the target texture features of each frame image; and splicing the various motion features of each frame image to obtain the target motion features of each frame image; finally, according to the target texture features of each frame image and the target motion features of each frame image, the target video is classified and processed to obtain the classification result.
  • the target bit rate of each frame image refers to the bit rate of each frame image when the image quality of each frame image reaches the preset image quality; at the same time, the target bit rate of each frame image is less than the initial bit rate of each frame image.
  • the bit rate of the original code point refers to the initial bit rate of the image
  • the bit rate of the optimal code point refers to the bit rate of the image when the image quality reaches the preset image quality, and the bit rate of the optimal code point is less than the bit rate of the original code point. Therefore, in the specific implementation, the target bit rate of each frame image is equivalent to the bit rate of the optimal code point; that is, the determined target bit rate of each frame image is equivalent to the bit rate of the optimal code point determined for each frame image.
  • the specific method of determining the target bit rate of each frame image can be: calling the regression prediction model, performing regression prediction processing on the texture features of each frame image and the motion features of each frame image, and obtaining the target bit rate of each frame image.
  • the training process of the regression prediction model is a supervised training process, so the initial regression prediction model can be trained through multiple training videos and the training bit rates of each frame image in each training video to obtain the regression prediction model.
  • the training bit rate refers to the bit rate of each frame image of each training video when the image quality of each frame image of each training video reaches the training image quality; the training bit rate is less than the initial bit rate of each frame image of each training video.
  • the specific meaning of the training image quality can be referred to the preset image quality, which will not be repeated here.
  • any of a variety of boosting algorithms (a machine learning algorithm that can be used to reduce bias in supervised learning) such as adaptive boosting algorithm (Ada boosting), gradient boosting (Gradient Boosting, GBDT), XGBoost (eXtreme Gradient Boosting, a decision tree-based integrated machine learning algorithm, using a gradient ascent framework) can be used to supervise the neural network model (i.e., the above-mentioned initial regression prediction model) to obtain a regression prediction model.
  • the neural network model can be one of a variety of models with regression prediction capabilities such as deep neural network models (Deep Neural Networks, DNN), convolutional neural network models (Convolutional Neural Network, CNN), and decision tree models.
  • each frame image may have one or more texture features
  • each frame image may have one or more motion features
  • the specific method of determining the target bit rate of each frame image may also be: splicing each texture feature of each frame image to obtain the target texture feature of each frame image; and splicing each motion feature of each frame image to obtain the target motion feature of each frame image.
  • the target video is regressed and predicted to obtain the target bit rate of each frame image.
  • the specific method of encoding the target video according to the target bit rate of each frame image to obtain the encoded target video may be: according to the target bit rate of each frame image and the data amount of each frame image, the corresponding image in the target video is encoded to obtain the encoded target video.
  • step S204 the target video is encoded according to the initial bit rate of each frame image
  • the specific method for obtaining the encoded target video can be: according to the initial bit rate of each frame image and the data amount of each frame image, the corresponding image in the target video is encoded to obtain the encoded target video.
  • the texture features and motion features of each frame image are first used to determine whether there is a bit rate less than the initial bit rate of each frame image when the image quality of each frame image reaches the preset image quality, and to determine whether the bit rate of each frame image can be further reduced. After determining that there is still room for reducing the initial bit rate of each frame image, the texture features and motion features of each frame image can be further used to obtain the target bit rate of each frame image that can ensure that the image quality reaches the preset image quality and is less than the initial bit rate of each frame image.
  • the embodiment of the present application will further use the texture features and motion features of each frame image to obtain a target bit rate less than the initial bit rate of each frame image, so as to encode the video, thereby reducing the amount of data of the encoded video while ensuring that the image quality of each frame image in the video reaches the preset image quality, which is beneficial to the storage and transmission of the video.
  • the embodiment of the present application provides another video encoding method.
  • FIG. 5 a flow chart of another video encoding method provided by the embodiment of the present application is shown.
  • the video encoding method shown in FIG. 5 can be executed by the server or terminal device shown in FIG. 1.
  • the video encoding method shown in FIG. 5 may include the following steps:
  • S502 Determine an initial bit rate of each frame of the image based on a preset image quality of the target video and a preset image resolution of the target video.
  • steps S501-S502 may refer to the specific implementation of steps S201-S202, which will not be described in detail here.
  • each decision classifier is used to indicate whether there is a bit rate less than the initial bit rate of each frame image when the image quality of each frame image reaches the preset image quality.
  • the classification model includes multiple decision classifiers.
  • each decision classifier in the classification model is a small neural network model for classification.
  • the training process of the classification model may specifically be as follows: first, obtain multiple training video sets; wherein any training video set includes at least one training video, the target decision results of each training video in the at least one training video, and the texture features and motion features of each frame image of each training video in the at least one training video. Then, call multiple decision classifiers in the initial classification model, perform decision processing on the texture features and motion features of each frame image of each training video in the multiple training video sets, and obtain the training decision results of each training video in the multiple training video sets. Finally, according to the training decision results of each training video in the multiple training video sets and the target decision results of each training video in the multiple training video sets, the initial classification model is trained to obtain the classification model.
  • multiple decision classifiers in the initial classification model are called to classify each training video in multiple training video sets.
  • the specific process of performing decision processing based on the texture features and motion features of each frame image of the frequency to obtain the training decision results of each training video in multiple training video sets can be: traversing each decision classifier in the initial classification model, based on the training decision results of each training video in the training video set corresponding to the last traversed decision classifier obtained by processing the decision classifier of the last traversal, determining the target training video set corresponding to the currently traversed target decision classifier from multiple training video sets.
  • the method for determining the target training video set corresponding to the currently traversed target decision classifier from multiple training video sets can be: when the target decision classifier is the first traversed decision classifier, first select any training video set from the multiple training video sets as the target training video set of the first traversed decision classifier; when the target decision classifier is not the first traversed decision classifier, based on the training videos whose training decision results in the target training video set of the last traversed decision classifier are different from the target decision results, and a training video set selected from the remaining training video sets of the multiple training video sets, the target training video set of the target decision classifier is obtained.
  • the decision weight of the target decision classifier can be determined according to the training decision results of each training video in the target training video set and the target decision result, so as to obtain the decision weight of each decision classifier in the initial classification model.
  • the decision weights of each decision classifier in the initial classification model are evenly distributed. For example, if the initial classification model includes 10 decision classifiers, the decision weights of each decision classifier in the initial classification model are all 0.1. Then, the classification accuracy of each traversed decision classifier is determined by whether the training decision results of each training video in the target training video set of each traversed decision classifier are the same as its target decision result; finally, the decision weights of each traversed decision classifier are obtained through the classification accuracy of each traversed decision classifier.
  • the decision weight of each decision classifier in the classification model is a value less than 1, and the decision weights of all decision classifiers in the classification model add up to 1.
  • the decision weights of each decision classifier in the classification model are obtained based on the training decision results of each training video in the target training video set of each decision classifier during training and the target decision results. This is not repeated here.
  • the specific method of obtaining the classification result can be: multiplying the decision weights of each decision classifier by the decision results of the corresponding decision classifier to obtain the weighted decision results of each decision classifier; finally, adding the weighted decision results of all decision classifiers in the classification model to obtain the classification result.
  • the specific method of determining the target bit rate of each frame image according to the texture features of each frame image and the motion features of each frame image can be: calling each regression predictor in the regression prediction model, predicting the texture features of each frame image and the motion features of each frame image, and obtaining the predicted bit rate of each regression predictor; based on the regression weight of each regression predictor and the predicted bit rate of each regression predictor, the target bit rate of each frame image is obtained.
  • the regression weights of each regression predictor can be the same.
  • the regression weights of each regression predictor can be 1.
  • the training process of the regression prediction model may specifically be as follows: first, obtain a training video, and each frame of the training video The training bit rate of the image, and the texture features and motion features of each frame image of the training video; wherein the training bit rate of each frame image of the training video refers to the bit rate of each frame image of the training video when the image quality of each frame image of the training video reaches the training image quality; the training bit rate of each frame image of the training video is less than the initial bit rate of each frame image of the training video. Then, call each initial regression predictor in the initial regression prediction model, predict the texture features and motion features of each frame image of the training video, and obtain the predicted bit rate of each initial regression predictor. Finally, according to the predicted bit rate of each initial regression predictor and the training bit rate of each frame image of the training video, the initial regression prediction model is trained to obtain the regression prediction model.
  • the process of calling each initial regression predictor in the initial regression prediction model, predicting the texture features and motion features of each frame image of the training video, and obtaining the predicted bit rate of each initial regression predictor can be: traversing each initial regression predictor in the initial regression prediction model, and calling the currently traversed initial regression predictor, predicting the texture features and motion features of each frame image of the training video, and obtaining the predicted bit rate of the currently traversed initial regression predictor.
  • the initial regression prediction model is trained according to the prediction bit rates of each initial regression predictor and the training bit rates of each frame image of the training video, and the process of obtaining the regression prediction model can be: based on the training bit rates of each frame image of the training video and the prediction bit rates of the traversed initial regression predictors, the residual bit rate of the currently traversed initial regression predictor is determined; according to the residual bit rate of the currently traversed initial regression predictor and the residual bit rates of the traversed initial regression predictors, the initial regression prediction model is trained to obtain the regression prediction model.
  • each initial regression predictor in the initial regression prediction model will process the training video and predict a predicted bit rate; then, the training bit rate of the training video minus the predicted bit rate of the traversed initial regression predictor and the predicted bit rate of the currently traversed initial regression predictor, the residual bit rate of the currently traversed initial regression predictor is obtained. Then, the initial regression prediction model is trained by training each initial regression predictor in the direction of reducing the residual bit rate of each initial regression predictor, thereby obtaining a regression prediction model.
  • the target bit rate and target image resolution of each frame image may also be determined according to the texture features of each frame image and the motion features of each frame image.
  • the target image resolution of each frame image in the video may be the same.
  • the target image resolution refers to the image resolution corresponding to the target bit rate of each frame image when the image quality of each frame image reaches the preset image quality.
  • the specific method for determining the target bit rate and target image resolution of each frame image can be: calling the regression prediction model, performing regression prediction processing on the texture features of each frame image and the motion features of each frame image, and obtaining the target bit rate and target image resolution of each frame image.
  • the regression prediction model can predict the target bit rate and target image resolution of each frame image at this time, the regression prediction model can train the initial classification model through multiple training videos, as well as the training bit rate and training image resolution of each training video, thereby obtaining a classification model.
  • the training image resolution refers to the image resolution corresponding to the training bit rate of each frame image when the image quality of each frame image reaches the training image quality.
  • step S506 may refer to the specific implementation of step S205, which will not be described in detail here.
  • Figure 6 shows a schematic diagram of a video encoding process.
  • the specific process of decoding the target video can be: converting the original video format of the target video into a preset video format.
  • the preset video format can be YUV (a color encoding method.
  • common YUV formats include YUV420p, YUV420sp, NV21, etc.), RGB (a color standard), Raw (an unprocessed and uncompressed video format Formula), etc., are not limited here.
  • texture features and motion features of each frame of the image can be extracted, and the texture features and motion features of each frame of the image can be input into a classification model to obtain a classification result. If the classification result indicates that the original code point is the optimal code point (that is, the classification result indicates that when the image quality of each frame of the image reaches the preset image quality, there is no bit rate less than the initial bit rate of each frame of the image), then according to the initial bit rate of each frame of the image, the target video is encoded to obtain an encoded target video.
  • the classification result indicates that the original code point is not the optimal code point (i.e., the classification result indicates that when the image quality of each frame image reaches the preset image quality, there is a bit rate that is less than the initial bit rate of each frame image)
  • the texture features and motion features of each frame image are input into the regression prediction model to obtain the target bit rate of each frame image.
  • the target video is encoded to obtain the encoded target video.
  • the texture features and motion features of each frame image are first used to determine whether there is a bit rate less than the initial bit rate of each frame image when the image quality of each frame image reaches the preset image quality, and to determine whether the bit rate of each frame image can be further reduced. After determining that there is still room for reducing the initial bit rate of each frame image, the texture features and motion features of each frame image can be further used to obtain the target bit rate of each frame image that can ensure that the image quality reaches the preset image quality and is less than the initial bit rate of each frame image.
  • the embodiment of the present application will further use the texture features and motion features of each frame image to obtain a target bit rate less than the initial bit rate of each frame image, so as to encode the video, thereby reducing the amount of data of the encoded video while ensuring that the image quality of each frame image in the video reaches the preset image quality, which is beneficial to the storage and transmission of the video.
  • the embodiment of the present application will obtain the final classification result based on the multiple decision results obtained by processing the texture features and motion features of each frame image according to each decision classifier in the classification model, which is conducive to improving the accuracy of the classification result.
  • the embodiment of the present application will also obtain the target bit rate of each frame image according to the multiple predicted bit rates of each frame image obtained by processing the texture features and motion features of each frame image according to each regression predictor in the regression prediction model, which is conducive to improving the accuracy of the predicted bit rate.
  • the present application also discloses a video encoding device.
  • the video encoding device can be a computer program (including program code) running in the above-mentioned computer device.
  • the video encoding device can execute the video encoding method shown in Figures 2 and 5, please refer to Figure 7, and the video encoding device can at least include: a processing unit 701 and an encoding unit 702.
  • the processing unit 701 is used to perform feature extraction processing on each frame image in the target video to obtain texture features and motion features of each frame image;
  • the processing unit 701 is further used to determine the initial bit rate of each frame image based on the preset image quality of the target video and the preset image resolution of the target video;
  • the processing unit 701 is further used to classify the target video according to the texture features of each frame image and the motion features of each frame image to obtain a classification result; wherein the classification result is used to indicate whether there is a bit rate less than the initial bit rate of each frame image when the image quality of each frame image reaches a preset image quality;
  • the processing unit 701 is further used to determine the target bit rate of each frame image according to the texture features of each frame image and the motion features of each frame image if the classification result indicates that when the image quality of each frame image reaches the preset image quality, there is a bit rate less than the initial bit rate of each frame image; wherein the target bit rate of each frame image refers to the bit rate of each frame image when the image quality of each frame image reaches the preset image quality; the target bit rate of each frame image is less than the initial bit rate of each frame image.
  • the encoding unit 702 is used to encode the target video according to the target bit rate of each frame image to obtain the encoded Target video.
  • processing unit 701 when the processing unit 701 classifies the target video according to the texture features of each frame image and the motion features of each frame image to obtain the classification result, it can also be used to perform:
  • each decision classifier in the classification model performing decision processing on the target video according to the texture features of each frame image and the motion features of each frame image, and obtaining the decision results of each decision classifier; wherein the decision results are used to indicate whether there is a bit rate less than the initial bit rate of each frame image when the image quality of each frame image reaches the preset image quality;
  • the classification result is obtained according to the decision weights of each decision classifier and the decision results of each decision classifier.
  • processing unit 701 may also be configured to execute:
  • any training video set includes at least one training video, target decision results of each training video in at least one training video, and texture features and motion features of each frame image of each training video in at least one training video;
  • the initial classification model is trained to obtain a classification model.
  • processing unit 701 calls multiple decision classifiers in the initial classification model, performs decision processing on the texture features and motion features of each frame image of each training video in the multiple training video sets, and obtains the training decision results of each training video in the multiple training video sets, it can also be specifically used to execute:
  • the target decision classifier is called to perform decision processing on each training video in the target training video set according to the texture features and motion features of each frame image of each training video in the target training video set, so as to obtain the training decision results of each training video in the target training video set.
  • processing unit 701 determines the target bit rate of each frame image according to the texture feature of each frame image and the motion feature of each frame image, it can be specifically used to perform:
  • each regression predictor in the regression prediction model, performing prediction processing on the texture features of each frame image and the motion features of each frame image, and obtaining the prediction bit rate of each regression predictor;
  • a target bit rate of each frame image is obtained.
  • processing unit 701 may also be configured to execute:
  • each initial regression predictor in the initial regression prediction model performing prediction processing on the texture features and motion features of each frame image of the training video, and obtaining the prediction bit rate of each initial regression predictor;
  • the initial regression prediction model is trained according to the prediction bit rates of each initial regression predictor and the training bit rates of each frame image of the training video to obtain a regression prediction model.
  • the processing unit 701 calls each initial regression predictor in the initial regression prediction model to perform prediction processing on the texture features and motion features of each frame image of the training video to obtain the predictions of each initial regression predictor.
  • each initial regression predictor in the initial regression prediction model can be used to perform:
  • the processing unit 701 trains the initial regression prediction model according to the prediction bit rates of each initial regression predictor and the training bit rates of each frame image of the training video to obtain the regression prediction model, it can be specifically used to execute:
  • the initial regression prediction model is trained according to the residual bit rate of the currently traversed initial regression predictor and the residual bit rates of the traversed initial regression predictors to obtain the regression prediction model.
  • the texture features of each frame image include at least one of the following: grayscale change features, grayscale co-occurrence matrix;
  • the motion features of each frame image include at least one of the following: weighted peak signal-to-noise ratio, displacement between each pixel block in each frame image and the corresponding pixel block in the target image of each frame image; wherein, the target image of each frame image refers to an image whose difference between the image sequence number in the target video and the image sequence number of each frame image in the target video is less than or equal to a preset threshold.
  • processing unit 701 classifies the target video according to the texture features of each frame image and the motion features of each frame image to obtain the classification result, it can be specifically used to execute:
  • the texture features of each frame image are spliced to obtain the target texture features of each frame image
  • the motion features of each frame image are spliced to obtain the target motion features of each frame image
  • the target video is classified and processed to obtain a classification result.
  • the various units in the video encoding device shown in FIG. 7 are divided based on logical functions, and the above-mentioned various units can be separately or all merged into one or several other units to constitute, or one (some) of the units can be further divided into multiple smaller units in function to constitute, which can achieve the same operation without affecting the realization of the technical effects of the embodiments of the present application.
  • the above-mentioned video encoding device can also include other units. In practical applications, these functions can also be implemented with the assistance of other units, and can be implemented by the collaboration of multiple units.
  • a video encoding device as shown in FIG. 7 can be constructed, and the video encoding method of the embodiment of the present application can be implemented by running a computer program (including program code) capable of executing each step involved in the method shown in FIG. 2 or FIG. 5 on a general computing device such as a computer device including a central processing unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM) and other processing elements and storage elements.
  • the computer program can be recorded on, for example, a computer storage medium, and loaded into the above-mentioned computer device through the computer storage medium, and run therein.
  • the image quality of each frame image is first determined by the texture features and motion features of each frame image.
  • the preset image quality is reached, whether there is a bit rate less than the initial bit rate of each frame image, it is determined whether the bit rate of each frame image can be further reduced.
  • the texture features and motion features of each frame image can be further used to obtain the target bit rate of each frame image that can ensure that the image quality reaches the preset image quality and is less than the initial bit rate of each frame image.
  • the embodiment of the present application will further obtain the target bit rate less than the initial bit rate of each frame image through the texture features and motion features of each frame image, so as to encode the video, thereby reducing the amount of data of the encoded video while ensuring that the image quality of each frame image in the video reaches the preset image quality, which is beneficial to the storage and transmission of the video.
  • the present application also provides an electronic device. See Figure 8, which is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.
  • the electronic device shown in Figure 8 may include at least a processor 801, an input interface 802, an output interface 803, and a computer storage medium 804.
  • the processor 801, the input interface 802, the output interface 803, and the computer storage medium 804 may be connected via a bus or other means.
  • the computer storage medium 804 can be stored in the memory of the electronic device.
  • the computer storage medium 804 is used to store a computer program.
  • the computer program includes program instructions.
  • the processor 801 is used to execute the program instructions stored in the computer storage medium 804.
  • the processor 801 (or CPU (Central Processing Unit)) is the computing core and control core of the electronic device, which is suitable for implementing one or more instructions, and is specifically suitable for loading and executing one or more instructions to implement the above-mentioned video encoding method process or corresponding functions.
  • the embodiment of the present application also provides a computer storage medium (Memory), which is a memory device in an electronic device for storing programs and data. It is understandable that the computer storage medium here can include both the built-in storage medium in the terminal and the extended storage medium supported by the terminal.
  • the computer storage medium provides a storage space, which stores the operating system of the terminal.
  • one or more instructions suitable for being loaded and executed by the processor 801 are also stored in the storage space, and these instructions can be one or more computer programs (including program codes).
  • the computer storage medium here can be a high-speed random access memory (random access memory, RAM) memory, or a non-volatile memory (non-volatile memory), such as at least one disk storage; optionally, it can also be at least one computer storage medium located away from the aforementioned processor.
  • RAM random access memory
  • non-volatile memory non-volatile memory
  • the processor 801 may load and execute one or more instructions stored in a computer storage medium to implement the corresponding steps of the method in the above-mentioned video encoding method embodiments of FIG. 2 and FIG. 5 .
  • the processor 801 may load and execute the following steps for one or more instructions in a computer storage medium:
  • the processor 801 performs feature extraction processing on each frame image in the target video to obtain texture features of each frame image and motion features of each frame image;
  • the processor 801 determines an initial bit rate of each frame image based on a preset image quality of the target video and a preset image resolution of the target video;
  • the processor 801 classifies the target video according to the texture features of each frame image and the motion features of each frame image to obtain a classification result; wherein the classification result is used to indicate whether there is a bit rate less than the initial bit rate of each frame image when the image quality of each frame image reaches a preset image quality;
  • the processor 801 determines the target bit rate of each frame image according to the texture feature of each frame image and the motion feature of each frame image; wherein the target bit rate of each frame image refers to the bit rate of each frame image when the image quality of each frame image reaches the preset image quality; the target bit rate of each frame image is less than the initial bit rate of each frame image;
  • the processor 801 encodes the target video according to the target bit rate of each frame image to obtain an encoded target video.
  • the processor 801 when the processor 801 classifies the target video according to the texture features of each frame image and the motion features of each frame image to obtain the classification result, it can be specifically used to execute:
  • each decision classifier in the classification model performing decision processing on the target video according to the texture features of each frame image and the motion features of each frame image, and obtaining the decision results of each decision classifier; wherein the decision results are used to indicate whether there is a bit rate less than the initial bit rate of each frame image when the image quality of each frame image reaches the preset image quality;
  • the classification result is obtained according to the decision weights of each decision classifier and the decision results of each decision classifier.
  • the processor 801 may be further configured to execute:
  • any training video set includes at least one training video, target decision results of each training video in at least one training video, and texture features and motion features of each frame image of each training video in at least one training video;
  • the initial classification model is trained to obtain a classification model.
  • processor 801 when the processor 801 calls multiple decision classifiers in the initial classification model, performs decision processing on the texture features and motion features of each frame image of each training video in the multiple training video sets, and obtains the training decision results of each training video in the multiple training video sets, it can be specifically used to execute:
  • the target decision classifier is called to perform decision processing on each training video in the target training video set according to the texture features and motion features of each frame image of each training video in the target training video set, so as to obtain the training decision results of each training video in the target training video set.
  • the processor 801 determines the target bit rate of each frame image according to the texture feature of each frame image and the motion feature of each frame image, it can be specifically used to perform:
  • each regression predictor in the regression prediction model, performing prediction processing on the texture features of each frame image and the motion features of each frame image, and obtaining the prediction bit rate of each regression predictor;
  • a target bit rate of each frame image is obtained.
  • the processor 801 may also be configured to execute:
  • each initial regression predictor in the initial regression prediction model performing prediction processing on the texture features and motion features of each frame image of the training video, and obtaining the prediction bit rate of each initial regression predictor;
  • the initial regression prediction model is trained according to the prediction bit rates of each initial regression predictor and the training bit rates of each frame image of the training video to obtain a regression prediction model.
  • processor 801 calls each initial regression predictor in the initial regression prediction model, performs prediction processing on the texture features and motion features of each frame image of the training video, and obtains the prediction bit rate of each initial regression predictor, it can be specifically used to execute:
  • the processor 801 trains the initial regression prediction model according to the prediction bit rates of each initial regression predictor and the training bit rates of each frame image of the training video to obtain the regression prediction model, it can be specifically used to execute:
  • the initial regression prediction model is trained according to the residual bit rate of the currently traversed initial regression predictor and the residual bit rates of the traversed initial regression predictors to obtain the regression prediction model.
  • the texture features of each frame image include at least one of the following: grayscale change features, grayscale co-occurrence matrix;
  • the motion features of each frame image include at least one of the following: weighted peak signal-to-noise ratio, displacement between each pixel block in each frame image and the corresponding pixel block in the target image of each frame image; wherein the target image of each frame image refers to an image whose difference between the image sequence number in the target video and the image sequence number of each frame image in the target video is less than or equal to a preset threshold;
  • the processor 801 classifies the target video according to the texture features of each frame image and the motion features of each frame image to obtain the classification result, it can be specifically used to execute:
  • the texture features of each frame image are spliced to obtain the target texture features of each frame image
  • the motion features of each frame image are spliced to obtain the target motion features of each frame image
  • the target video is classified and processed to obtain a classification result.
  • the texture features and motion features of each frame image are first used to determine whether there is a bit rate less than the initial bit rate of each frame image when the image quality of each frame image reaches the preset image quality, and to determine whether the bit rate of each frame image can be further reduced. After determining that there is still room for reducing the initial bit rate of each frame image, the texture features and motion features of each frame image can be further used to obtain the target bit rate of each frame image that can ensure that the image quality reaches the preset image quality and is less than the initial bit rate of each frame image.
  • the embodiment of the present application will further use the texture features and motion features of each frame image to obtain a target bit rate less than the initial bit rate of each frame image, so as to encode the video, thereby reducing the amount of data of the encoded video while ensuring that the image quality of each frame image in the video reaches the preset image quality, which is beneficial to the storage and transmission of the video.
  • the embodiment of the present application provides a computer program product or a computer program, which includes a computer instruction, and the computer instruction is stored in a computer-readable storage medium.
  • the processor of the electronic device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the electronic device executes the method embodiment shown in Figures 2 and 5.
  • the computer-readable storage medium can be a magnetic disk, an optical disk, a read-only memory (ROM) or a random access memory (RAM), etc.
  • the video encoding method in the embodiment of the present invention is mainly illustrated by taking the field of video transmission and storage as an example.
  • the video encoding method in the embodiment of the present invention can also be applied to scenes related to video encoding and decoding, such as video display, which is not limited here.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

本申请实施例涉及计算机技术领域,公开了视频编码方法、装置及存储介质,该方法包括:根据各帧图像的纹理特征和各帧图像的运动特征,对目标视频进行分类处理,得到分类结果。若分类结果指示在各帧图像的图像质量达到预设图像质量时,存在小于各帧图像的初始码率的码率,则根据各帧图像的纹理特征和各帧图像的运动特征,确定各帧图像的目标码率。最后,根据各帧图像的目标码率,对目标视频进行编码处理,得到编码后的目标视频。采用本申请实施例,可在保证视频中各帧图像的图像质量达到预设图像质量的情况下,减小编码后的视频的数据量,有利于视频的存储和传输。

Description

视频编码方法、装置及存储介质 技术领域
本申请涉及计算机技术领域,尤其涉及视频编码方法、装置及存储介质。
背景技术
码率控制是一种通过决定为每一帧图像分配多少比特数,以控制视频文件的大小和视频图像质量的方法。常用的视频编码的码率控制方式为固定比特因子(Constant Rate Factor,CRF),即保持视频中各帧图像的图像质量不变,码率(数据传输时单位时间传送的比特数)可变。CRF主要是根据视频的平均码率和平均图像质量,选定图像质量参数和图像分辨率(也可称作选定原码点),以对视频进行编码。由于在保证视频质量的情况下,码率越小视频的数据量越小,也就越方便传输,但在多分辨率的情况下,原码点的图像分辨率在图像质量参数指示的视频质量下对应的码率不一定是最小的码率。因此,如何在保证视频的图像质量的情况下,决策出视频中每帧图像的最小码率,以对视频编码,是目前需要解决的问题。
发明内容
本申请实施例提供视频编码方法、装置及存储介质,可在保证视频中各帧图像的图像质量达到预设图像质量的情况下,减小编码后的视频的数据量,有利于视频的存储和传输。
一方面,本申请实施例提供了一种视频编码方法,包括:
对目标视频中的各帧图像进行特征提取处理,得到所述各帧图像的纹理特征,以及所述各帧图像的运动特征;
基于所述目标视频的预设图像质量,以及所述目标视频的预设图像分辨率,确定所述各帧图像的初始码率;
根据所述各帧图像的纹理特征和所述各帧图像的运动特征,对所述目标视频进行分类处理,得到分类结果;其中,所述分类结果用于指示在所述各帧图像的图像质量达到所述预设图像质量时,是否存在小于所述各帧图像的初始码率的码率;
若所述分类结果指示在所述各帧图像的图像质量达到所述预设图像质量时,存在小于所述各帧图像的初始码率的码率,则根据所述各帧图像的纹理特征和所述各帧图像的运动特征,确定所述各帧图像的目标码率;其中,所述各帧图像的目标码率指的是在所述各帧图像的图像质量达到所述预设图像质量时,所述各帧图像的码率;所述各帧图像的目标码率小于所述各帧图像的初始码率;
根据所述各帧图像的目标码率,对所述目标视频进行编码处理,得到编码后的目标视频。
一方面,本申请实施例提供了一种视频编码装置,所述视频编码装置包括处理单元和编码单元,其中:
所述处理单元,用于对目标视频中的各帧图像进行特征提取处理,得到所述各帧图像的纹理特征,以及所述各帧图像的运动特征;
所述处理单元,还用于基于所述目标视频的预设图像质量,以及所述目标视频的预设图像分辨率,确定所述各帧图像的初始码率;
所述处理单元,还用于根据所述各帧图像的纹理特征和所述各帧图像的运动特征,对所述目标视频进行分类处理,得到分类结果;其中,所述分类结果用于指示在所述各帧图像的图像质量达到所述预设图像质量时,是否存在小于所述各帧图像的初始码率的码率;
所述处理单元,还用于若所述分类结果指示在所述各帧图像的图像质量达到所述预设图像质量时,存在小于所述各帧图像的初始码率的码率,则根据所述各帧图像的纹理特征和所述各帧图像的运动特征,确定所述各帧图像的目标码率;其中,所述各帧图像的目标码率指的是在所述各帧图像的图像质量达到所述预设图像质量时,所述各帧图像的码率;所述各帧图像的目标码率小于所述各帧图像的初始码率;
所述编码单元,还用于根据所述各帧图像的目标码率,对所述目标视频进行编码处理,得到编码后的目标视频。
一方面,本申请实施例提供了一种电子设备,所述电子设备包括输入接口和输出接口,还包括:
处理器,适于实现一条或多条指令;以及,
计算机存储介质,所述计算机存储介质存储有一条或多条指令,所述一条或多条指令适于由所述处理器加载并执行上述视频编码方法。
一方面,本申请实施例提供了一种计算机存储介质,所述计算机存储介质中存储有计算机程序指令,所述计算机程序指令被处理器执行时,用于执行上述视频编码方法。
一方面,本申请实施例提供了一种计算机程序产品或计算机程序,所述计算机程序产品或所述计算机程序包括计算机指令,所述计算机指令存储在计算机可读存储介质中;电子设备的处理器从所述计算机可读存储介质中读取所述计算机指令,所述处理器执行所述计算机指令,所述计算机指令被处理器执行时,用于执行上述视频编码方法。
本申请实施例中,会先通过各帧图像的纹理特征和运动特征,判断在各帧图像的图像质量达到预设图像质量时,是否存在小于各帧图像的初始码率的码率的方式,判断出各帧图像的码率是否还可以进一步降低。在确定了各帧图像的初始码率还有降低的空间之后,可以进一步通过各帧图像的纹理特征和运动特征,得到可以保证图像质量达到预设图像质量且小于各帧图像的初始码率的各帧图像的目标码率。也就是说,本申请实施例在确定出各帧图像的码率还可以降低之后,会进一步通过各帧图像的纹理特征和运动特征,得到小于各帧图像的初始码率的目标码率,以便对视频进行编码,从而在保证视频中各帧图像的图像质量达到预设图像质量的情况下,减小编码后的视频的数据量,有利于视频的存储和传输。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种视频编码系统的结构示意图;
图2是本申请实施例提供的一种视频编码方法的流程示意图;
图3是本申请实施例提供的一种图像质量和码率的关系示意图;
图4是本申请实施例提供的另一种图像质量和码率的关系示意图;
图5是本申请实施例提供的另一种视频编码方法的流程示意图;
图6是本申请实施例提供的一种视频编码过程的示意图;
图7是本申请实施例提供的一种视频编码装置的结构示意图;
图8是本申请实施例提供的一种电子设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
随着互联网技术的不断发展,人工智能(Artificial Intelligence,AI)技术也随之得到更好的发展。所谓的人工智能技术是指利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术;其主要通过了解智能的实质,生产出一种新的能以人类智能相似的方式做出反应的智能机器,使得智能机器具有感知、推理与决策等多种功能。相应的,AI技术是一门综合学科,其主要包括计算机视觉技术(Computer Vision,CV)、语音处理技术、自然语言处理技术以及机器学习(Machine Learning,ML)/深度学习等几大方向。
其中,机器学习是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是AI的核心,是使计算机设备具有智能的根本途径;所谓的机器学习是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科;其专门研究计算机设备怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。而深度学习则是一种利用深层神经网络系统,进行机器学习的技术;机器学习/深度学习通常可包括人工神经网络、强化学习(Reinforcement Learning,RL)、有监督学习、无监督学习、对比学习等多种技术;所谓的有监督学习是指采用类别已知(具有标注类别)的训练样本进行模型训练的处理方式,无监督学习是指采用类别未知(没有被标记)的训练样本进行模型训练的处理方式。
此外,视频是连续的图像序列,由连续的图像帧构成。视频编码则是通过压缩技术,将原始视频格式的文件转换成另一种视频格式文件的方式。由于人眼的视觉暂留效应,当帧序列以一定的速率播放时,我们看到的就是动作连续的视频。而连续的图像之间相似性极高,因此为便于储存和传输,可以对原始的视频进行编码压缩,以去除视频中空间、时间维度的冗余。
另外,码率指的是单位时间传输的数据位数,视频中每帧图像的码率大小与视频的数据量相关。由于对视频压缩的比例不同,视频的数据量也会不同。通常来说,视频压缩的比例越大,视频的数据量就越小,也就越方便传输;但视频压缩的比例越大,视频的图像质量也就越差。因此,可以通过控制视频中各帧图像的码率的方式,平衡视频的图像质量和数据量大小。
基于此,本申请实施例提供了一种视频编码方案,该方案会根据目标视频中各帧图像的纹理特征和运动特征,对目标视频进行分类处理,以判断在各帧图像的图像质量达到预设图像质量时,是否存在小于各帧图像的初始码率的码率。若判断出在各帧图像的图像质量达到 预设图像质量时,存在小于各帧图像的初始码率的码率,则可以根据各帧图像的纹理特征和运动特征,得到各帧图像的目标码率,以便根据各帧图像的目标码率,对目标视频进行编码处理,得到编码后的目标视频。需要说明的是:各帧图像的目标码率指的是在各帧图像的图像质量达到预设图像质量时,各帧图像的码率;各帧图像的目标码率小于各帧图像的初始码率。
由于各帧图像的纹理特征可以反映各帧图像的内容复杂度,以及运动特征可以反映各帧图像的运动复杂度,故而可以通过各帧图像的纹理特征和运动特征确定各帧图像中的冗余信息。因此,本方案会先通过各帧图像的纹理特征和运动特征,判断在各帧图像的图像质量达到预设图像质量时,是否存在小于各帧图像的初始码率的码率的方式,判断出各帧图像的码率是否还可以进一步降低。在确定了各帧图像的初始码率还有降低的空间之后,可以进一步通过各帧图像的纹理特征和运动特征,得到可以保证图像质量达到预设图像质量且小于各帧图像的初始码率的各帧图像的目标码率。由此可见,本方案在确定出各帧图像的码率还可以降低之后,会进一步通过各帧图像的纹理特征和运动特征,得到小于各帧图像的初始码率的目标码率,以便对视频进行编码,从而在保证视频中各帧图像的图像质量达到预设图像质量的情况下,减小编码后的视频的数据量,有利于视频的存储和传输。
其中,所述各帧图像的初始码率是基于目标视频的预设图像质量,以及目标视频的预设图像分辨率得到的。所述预设图像质量指的是通过人为或者电子设备预先设定的目标视频的图像质量。其中,电子设备可以指的是下文中的终端设备或者服务器。可选地,可以是电子设备对目标视频的传输需求和/或存储需求进行分析之后,设定出一个合理的预设图像质量。也可以是电子设备直接设定出一个能够满足人眼观看需求的预设图像质量。举例来说,目标视频的传输需求为传输时间小于5秒,以及存储需求为存储空间小于800kB,那么电子设备可以根据目标视频的原始数据量,传输时间小于5秒,以及存储空间小于800kB等条件,分析得到满足该条件的预设图像质量。
具体实现中,当码率控制方式是固定比特因子(Constant Rate Factor,CRF)时,可以设定用于反映视频中图像质量的固定码率因子,即CRF。其中,CRF的取值范围为0至51,0为无损模式,即控制编码后的视频的图像质量为视频原本的图像质量;而CRF的数值越大,则说明编码后的视频的图像质量越差。因此,预设图像质量具体可以是预先设定的CRF值所指示的图像质量。
同时,所述预设图像分辨率指的是通过人为或者电子设备预先设定的目标视频的图像分辨率。可选地,电子设备可以是对目标视频的显示需求进行分析之后,得到预设图像分辨率。示例性地,由于待显示目标视频的电子设备的显示分辨率最高为720P,因此目标视频的显示需求为图像分辨率小于或等于720P,那么电子设备可以设定预设图像分辨率为720P、540P和360P。
具体来说,视频的图像分辨率指的是视频中各帧图像包括的像素点的数量。举例来说,720P的分辨率为1280*720像素(P,Progressive,即逐行扫描,其中多少P就表示区域纵向有多少行像素),1080P的分辨率为1920*1080像素,2k的分辨率为2560*1440像素。此外,同一个视频的预设图像分辨率可以有一个或多个。
此外,纹理特征指的是一种反映图像中同质现象的视觉特征,它体现了物体表面的具有缓慢变化或者周期性变化的表面结构组织排列属性。具体来说,纹理特征可以通过像素及其周围空间领域的灰度分布来表现,根据灰度范围大小可以分为局部纹理特征和全局纹理特征。运动特征包括各帧图像中的运动对象的运动轨迹和运动向量等。
基于上述视频编码方法,本申请实施例提供了一种视频编码系统,可参见图1,图1所示的视频编码系统可以包括多个终端设备101和多个服务器102,其中任一终端设备和任一服务器之间均建立有通信连接。终端设备101可以包括智能手机、平板电脑、笔记本电脑、台式计算机、智能车载以及智能可穿戴设备中的任意一种或多种。终端设备101内可以运行各式各样的应用程序(application,APP),如多媒体播放客户端、社交客户端、浏览器客户端、信息流客户端、教育客户端,等等。服务器102可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。终端设备101以及服务器102之间可以通过有线或无线通信方式进行直接或间接地通信连接,本申请在此不做限制。
在一个实施例中,上述视频编码方法可以仅由图1所示视频编码系统中的终端设备101执行,具体执行过程为:终端设备101可以先对目标视频中的各帧图像进行特征提取处理,得到各帧图像的纹理特征,以及各帧图像的运动特征;同时,终端设备101还会通过目标视频的预设图像质量以及目标视频的预设图像分辨率,确定出各帧图像的初始码率。然后,终端设备101会根据各帧图像的纹理特征和各帧图像的运动特征,对目标视频进行分类处理,得到用于指示在各帧图像的图像质量达到预设图像质量时,是否存在小于各帧图像的初始码率的码率的分类结果。若分类结果指示在各帧图像的图像质量达到预设图像质量时,存在小于各帧图像的初始码率的码率,则终端设备101会根据各帧图像的纹理特征和各帧图像的运动特征,确定各帧图像的目标码率。之后,终端设备101可以根据各帧图像的目标码率,对目标视频进行编码处理,得到编码后的目标视频。可选地,终端设备101还可以将编码后的目标视频传输至服务器102,以便服务器102可以将编码后的目标视频传输至其他终端设备。
可选地,上述视频编码方法也可以仅由图1所示的视频编码系统中的服务器102执行,其具体执行过程可参见上述终端设备101在视频编码时的具体执行过程,在此不再赘述。可选地,终端设备101还可以将编码后的目标视频传输至服务器102,然后服务器102可以将编码后的目标视频进行解码,得到解码后的视频;最后,服务器102可以显示解码后的视频或者将解码后的视频传输至其他终端设备。
在另一个实施例中,上述视频编码方法可以运行在视频编码系统中,视频编码系统可以包括终端设备和服务器。具体来说,所述视频编码方法可由图1所示的视频编码系统中所包含的终端设备101,以及服务器102来共同完成,具体执行过程为:终端设备101采集目标视频的预设图像质量以及目标视频的预设图像分辨率,然后将采集到的目标视频的预设图像质量和预设图像分辨率上传至服务器102。然后,服务器102会先对目标视频中的各帧图像进行特征提取处理,得到各帧图像的纹理特征,以及各帧图像的运动特征;同时,服务器102还会通过目标视频的预设图像质量以及目标视频的预设图像分辨率,确定出各帧图像的初始码率。之后,服务器102可以根据各帧图像的纹理特征和各帧图像的运动特征,对目标视频进行分类处理,得到用于指示在各帧图像的图像质量达到预设图像质量时,是否存在小于各帧图像的初始码率的码率的分类结果。若分类结果指示在各帧图像的图像质量达到预设图像质量时,存在小于各帧图像的初始码率的码率,则服务器102会根据各帧图像的纹理特征和各帧图像的运动特征,确定各帧图像的目标码率。最后,服务器102可以根据各帧图像的目标码率,对目标视频进行编码处理,得到编码后的目标视频。
可选地,服务器102还可以将编码后的目标视频传输至终端设备101,以便终端设备101 可以将编码后的目标视频进行解码,然后显示解码后的目标视频。
可选地,可以是终端设备101先显示视频选择界面,其中视频选择界面包括多个视频标识,不同的视频标识指示不同的视频。可选地,为了方便使用终端设备101的对象能够确定自己想要看的视频,视频选择界面还可以包括多个视频缩略图,其中一个视频缩略图对应一个视频标识。具体来说,视频缩略图也就相当于视频的封面;视频标识可以是序号、视频名称等能够标识视频的字符或字符段,在此不限定。
然后,终端设备101可以响应对视频选择界面中多个视频标识的选择操作,得到目标视频标识。终端设备101采集预设图像质量和预设图像分辨率。其中,终端设备101采集预设图像质量以及预设图像分辨率的具体方式可以是:终端设备101显示视频格式编辑界面,其中视频格式编辑界面包括质量设置组件和分辨率设置组件;终端设备101响应对质量设置组件的编辑操作,获取预设图像质量;以及终端设备101响应对分辨率设置组件的编辑操作,获取预设分辨率。终端设备101根据目标视频标识,预设图像质量和预设分辨率,生成视频获取请求。最后,终端设备101将视频获取请求发送至服务器102。
服务器102在接收到视频获取请求之后,可以对视频获取请求进行解析,得到目标视频标识,预设图像质量和预设分辨率。服务器102先在预设视频库中查找视频标识为目标视频标识的目标视频;然后,服务器102对查找到的目标视频中的各帧图像进行特征提取处理,得到各帧图像的纹理特征,以及各帧图像的运动特征。同时,服务器102还会通过解析出的预设图像质量以及预设图像分辨率,确定出各帧图像的初始码率。之后,服务器102可以根据各帧图像的纹理特征和各帧图像的运动特征,对目标视频进行分类处理,得到用于指示在各帧图像的图像质量达到预设图像质量时,是否存在小于各帧图像的初始码率的码率的分类结果。若分类结果指示在各帧图像的图像质量达到预设图像质量时,存在小于各帧图像的初始码率的码率,则服务器102会根据各帧图像的纹理特征和各帧图像的运动特征,确定各帧图像的目标码率。最后,服务器102可以根据各帧图像的目标码率,对目标视频进行编码处理,得到编码后的目标视频。
服务器102在得到编码后的目标视频之后,将编码后的目标视频发送至终端设备101。终端设备101在接收到编码后的目标视频之后,可以对编码后的目标视频进行解码,并显示解码后的目标视频。
基于上述视频编码方案以及视频编码系统,本申请实施例提供了一种视频编码方法。参见图2,为本申请实施例提供的一种视频编码方法的流程示意图。图2所示的视频编码方法可由服务器或者终端设备执行。图2所示的视频编码方法可包括步骤S201-S205:
S201,对目标视频中的各帧图像进行特征提取处理,得到各帧图像的纹理特征,以及各帧图像的运动特征。
在本申请实施例中,目标视频指的是待编码的视频。示例性地,当视频M被选为待编码的视频,以便于视频M的存储和传输时,视频M就是目标视频。同时,由于一个视频是由多帧图像按照一定的图像序列组成的,因此目标视频中的各帧图像指的就是组成目标视频的多帧图像中的各帧图像。
此外,纹理特征指的是反映图像中物体表面纹理的特征。可选地,视频中的纹理特征也可以称作空域复杂度。具体来说,纹理特征可以包括灰度变化特征,灰度共生矩阵(Gray-Level Coocurrence Matrix,CLCM)等用于表征图像纹理的一种或多种特征。具体来说,所述灰度变化特征指的是归一化相关性(Normalised Cross-Correlation,NCC)。
具体实现中,当纹理特征为灰度共生矩阵时,对目标视频中的各帧图像进行特征提取处理,得到各帧图像的纹理特征的具体方式可以是:获取目标视频的各帧图像中各个像素点的灰度值;根据各帧图像中各个像素点的灰度值,构建各帧图像的灰度共生矩阵;其中,各帧图像的灰度共生矩阵用于表征:各帧图像中各个像素点与距离各个像素点预设范围内的像素点之间的灰度比值。需要说明的是,所述预设范围可以是人为设定的,也可以是系统设定的,在此不限定。举例来说,所述预设范围可以是5个像素点(即距离某个像素点预设范围内的像素点为距离某个像素点5个像素点内的像素点),10个像素点(即距离某个像素点预设范围内的像素点为距离某个像素点10个像素点内的像素点)等。
当纹理特征为灰度变化特征时,对目标视频中的各帧图像进行特征提取处理,得到各帧图像的纹理特征的具体方式可以是:获取目标视频的各帧图像中各个像素点的灰度值;根据各帧图像中各个像素点的灰度值,构建各帧图像的灰度共生矩阵;其中,各帧图像的灰度共生矩阵用于表征:各帧图像中各个像素点与距离各个像素点预设范围内的像素点之间的灰度比值;针对各帧图像的灰度共生矩阵中的每一个灰度比值,统计各帧图像的灰度共生矩阵中灰度比值的数量,得到各帧图像的灰度变化特征。
可选地,纹理特征还可以包括灰度变化特征,灰度共生矩阵、Tamura纹理特征、小波变换等用于表征图像纹理的一种或多种特征,在此不限定。其中,Tamura纹理特征指的是Tamura等人提出了纹理特征的表达,Tamura纹理特征的六个分量对应于心理学角度上纹理特征的六种属性,分别是粗糙度(coarseness)、对比度(contrast)、方向度(directionality)、线像度(linelikeness)、规整度(regularity)和粗略度(roughness)。需要说明的是,提取视频中各帧图像的Tamura纹理特征和小波变换的技术手段为本领域技术人员所惯用的技术手段,在此不赘述。
另外,运动特征指的是反映视频中各帧图像的图像内容的运动程度的特征。可选地,运动特征也可以称作时域复杂度。具体来说,运动特征可以包括加权峰值信噪比(A LOw-COMPLEXITY EXTENSION OF THEPERCEPTUALLY WEIGHTED PEAK SIGNAL-TO-NOISE RATIO FOR HIGH-RESOLUTION VIDEOQUALITY ASSESSMENT,XPSNR),各帧图像中的各个像素块与各帧图像的目标图像中相应像素块之间的位移。
需要说明的是,各帧图像的目标图像指的是在目标视频中的图像序列数与各帧图像在目标视频中的图像序列数之间的差值小于或等于预设阈值的图像。具体来说,预设阈值可以是人为设定的,也可以是系统设定的,在此不限定。举例来说,所述预设阈值可以是10、6、2等。当预设阈值为2时,目标视频中的图像A在目标视频中的图像序列数为10,那么,图像A的目标图像包括在目标视频中的图像序列数为8、9、11、12的图像。
具体实现中,各帧图像中的各个像素块与各帧图像的目标图像中相应像素块之间的位移也就是运动估计(motion estimation)。当运动特征为各帧图像中的各个像素块与各帧图像的目标图像中相应像素块之间的位移时,对目标视频中的各帧图像进行特征提取处理,得到各帧图像的运动特征的具体方式可以是:针对各帧图像中的每一个像素块,在各帧图像的目标图像中搜索与像素块匹配的目标像素块(即各帧图像的目标图像中相应像素块);其中,任一像素块包括至少一个像素点,像素块也可以称作图像的宏块。然后,基于各帧图像中的各个像素块在各帧图像中的位置信息,以及各帧图像中的各个像素块对应的目标像素块在各帧图像的目标图像中的位置信息,得到各帧图像中的各个像素块与各帧图像的目标图像中相应像素块之间的位移。
可选地,上述针对各帧图像中的每一个像素块,在各帧图像的目标图像中搜索与所述像 素块匹配的目标像素块的具体方式可以是:针对各帧图像中的每一个像素块,确定像素块在各帧图像的目标图像中的至少一个重叠像素块;基于像素块中各个像素点的灰度值,以及各个重叠像素块中各个像素点的灰度值,得到像素块与各个重叠像素块的像素块相似度;将像素块相似度最大的重叠像素块确定为目标像素块。
另外,加权峰值信噪比通常用于反映原始图像与被压缩后的原始图像之间的误差情况。提取加权峰值信噪比的技术手段为本领域技术人员所惯用的技术手段,在此不赘述。
S202,基于目标视频的预设图像质量,以及目标视频的预设图像分辨率,确定各帧图像的初始码率。
在本申请实施例中,预设图像质量指的是人为或电子设备预先设定的图像质量。其中,图像质量可以是图像的失真情况,也可以是人眼感知到的图像的清晰情况等。也就是说,图像质量有多种评价标准,故而在此不赘述。此外,视频在采集、压缩、传输和存储等过程中会发生各种各样的畸变或失真,任何的畸变或失真都可能导致人眼视觉感知图像质量的下降,因此可以预先设定目标视频中的图像质量至少需要达到的图像质量,从而保证目标视频整体的图像质量。需要说明的是,预设图像质量可以是一个具体的数值,也可以是一个数值范围,在此不限定。
另外,预设图像分辨率指的是通过人为或者电子设备预先设定的目标视频的图像分辨率。视频的图像分辨率指的是视频中各帧图像包括的像素点的数量。本申请实施例中,由于同一个视频在终端设备上显示时通常可以显示多种分辨率,以方便终端设备的使用对象根据当前网络情况,选择最合适的分辨率,因此,本申请实施例中的预设图像分辨率可以为一个或多个。
具体实现中,当码率控制方式是固定比特因子(Constant Rate Factor,CRF)时,预设图像质量指的是人为或者系统预先设定的用于反映视频中图像质量的固定码率因子,即CRF。预设图像分辨率可以是人为或者系统根据视频的播放需求,设定的多个图像分辨率。举例来说,可以是设定CRF=23,预设图像分辨率包括360P、720P、1080P。
可选地,基于目标视频的预设图像质量,以及目标视频的预设图像分辨率,确定各帧图像的初始码率的具体方式可以是:根据目标视频的预设图像质量,以及目标视频的预设图像分辨率,确定目标视频中各帧图像的固定量化值(即Constant Quantizer,QP);基于各帧图像的固定量化值,以及各帧图像的数据量,得到各帧图像的初始码率。其中,确定目标视频中各帧图像的固定量化值为本领域技术人员所惯用的技术手段,在此不赘述。
S203,根据各帧图像的纹理特征和各帧图像的运动特征,对目标视频进行分类处理,得到分类结果。
在本申请实施例中,分类结果指的是用于指示在各帧图像的图像质量达到预设图像质量时,是否存在小于各帧图像的初始码率的码率。具体来说,分类结果指的是在保持各帧图像的图像质量达到最低的预设图像质量的情况下,各帧图像的码率在各帧图像的初始码率的基础上是否还可以更小。举例来说,分类结果可以是1或者0,同时预先设定1指示在各帧图像的图像质量达到预设图像质量时,存在小于各帧图像的初始码率的码率;以及0指示在各帧图像的图像质量达到预设图像质量时,不存在小于各帧图像的初始码率的码率。
由于各帧图像的纹理特征可以反映各帧图像的内容复杂度(也可称为空域复杂度),以及运动特征可以反映各帧图像的运动复杂度(也可称为时域复杂度),而对视频进行编码压缩,就是为了去除视频中空间、时间维度的冗余。因此,可以通过各帧图像的纹理特征和运动特征确定各帧图像中的冗余信息,有利于后续确定各帧图像的码率是否还有降低的空间。
具体实现中,请参见附图3,示出了一种图像质量和码率的关系示意图。图3中的横坐标为码率,单位为k/bps;纵坐标为用于评价视频中图像质量的视频质量多方法评价融合参数(Visual Multimethod Assessment Fusion,VMAF)。其中,VMAF越高说明编码后的视频的图像质量越好,码率越低则说明编码后的视频越便于传输和存储。此外,可以设定视频M的预设图像质量为VMAF为94时指示的图像质量,视频M的预设图像分辨率为540P、720P和1080P。其中,图3中浅灰色曲线表示图像分辨率为540P时,实际测量出视频M中某帧图像的码率与图像质量的对应关系;深灰色曲线表示图像分辨率为720P时,实际测量出视频M中某帧图像的码率与图像质量的对应关系;黑色曲线表示图像分辨率为1080P时,实际测量出视频M中某帧图像的码率与图像质量的对应关系。
首先,可以采用CRF的码率控制方式,根据视频M的平均码率和平均图像质量,确定出如图3所示的原码点,在VMAF为94时,原码点的码率为(即上述的初始码率)为900k/bps,以及图像分辨率为720P。
但如图3所示,在VMAF为94时,黑色曲线对应的码率比原码点的码率更小,因此实际上在VMAF为94时,原码点的码率并不是最小的码率,也就是说原码点并非最优码点,而最优码点的码率应该为750k/bps,图像分辨率为1080P。
同时,请参见附图4,示出了另一种图像质量和码率的关系示意图。图4中的横坐标为码率,单位为k/bps;纵坐标为VMAF。可以设定视频M的预设图像质量为VMAF为73时指示的图像质量,视频M的预设图像分辨率为540P、720P和1080P。其中,图4中浅灰色曲线表示图像分辨率为540P时,实际测量出视频M中某帧图像的码率与图像质量的对应关系;深灰色曲线表示图像分辨率为720P时,实际测量出视频M中某帧图像的码率与图像质量的对应关系;黑色曲线表示图像分辨率为1080P时,实际测量出视频M中某帧图像的码率与图像质量的对应关系。
首先,可以采用CRF的码率控制方式,根据视频M的平均码率和平均图像质量,确定出如图4所示的原码点,在VMAF为73时,原码点的码率为(即上述的初始码率)为2600k/bps,以及图像分辨率为720P。然后,如图4所示,在VMAF为73时,没有曲线对应的码率比2600k/bps更小。因此在VMAF为73时,原码点的码率就是最小的码率,原码点就是最优码点。
由此可见,通过码率控制方式确定出的原码点可能是最优码点,也可能不是最优码点;而分类结果所指示的在各帧图像的图像质量达到预设图像质量时,是否存在小于各帧图像的初始码率的码率,也就相当于用于指示原码点是否是最优码点。
那么,可选地,若分类结果指示在各帧图像的图像质量达到预设图像质量时,不存在小于各帧图像的初始码率的码率,则根据各帧图像的初始码率,对目标视频进行编码处理,得到编码后的目标视频。
可选地,根据各帧图像的纹理特征和各帧图像的运动特征,对目标视频进行分类处理,得到分类结果的具体方式可以是:调用分类模型,对各帧图像的纹理特征和各帧图像的运动特征进行分类处理,得到分类结果。具体来说,分类模型的训练过程是一个有监督训练的过程,因此可以通过多个训练视频,以及各个训练视频的分类标签,对初始分类模型进行训练,从而得到分类模型。其中,分类标签用于指示在各个训练视频中各帧图像的图像质量达到预设图像质量时,是否存在小于各个训练视频中各帧图像的初始码率的码率。各个训练视频中各帧图像的初始码率是基于各个训练视频的预设图像质量,以及各个训练视频的预设图像分辨率得到的。
具体实现中,可以采用自适应提升算法(Ada boosting)、梯度提升算法(Gradient Boosting,GBDT)、XGBoost(eXtreme Gradient Boosting,一种基于决策树的集成机器学习算法,使用梯度上升框架)等多种boosting算法(一种可以用来减小监督式学习中偏差的机器学习算法)中的任一种算法对神经网络模型(即上述的初始分类模型)进行有监督训练,得到分类模型。其中,神经网络模型具体可以是循环神经网络模型(Recurrent Neural Network,RNN),卷积神经网络模型(Convolutional Neural Network,CNN)、决策树模型等具有分类能力的多种模型中的一种。
可选地,由于步骤S201中提及各帧图像的纹理特征可以有一个或多个,以及各帧图像的运动特征可以有一个或多个;因此,根据各帧图像的纹理特征和各帧图像的运动特征,对目标视频进行分类处理,得到分类结果的具体方式还可以是:将各帧图像的各个纹理特征进行拼接,得到各帧图像的目标纹理特征;以及将各帧图像的各个运动特征进行拼接,得到各帧图像的目标运动特征;最后,根据各帧图像的目标纹理特征,以及各帧图像的目标运动特征,对目标视频进行分类处理,得到分类结果。
S204,若分类结果指示在各帧图像的图像质量达到预设图像质量时,存在小于各帧图像的初始码率的码率,则根据各帧图像的纹理特征和各帧图像的运动特征,确定各帧图像的目标码率。
在本申请实施例中,各帧图像的目标码率指的是在所述各帧图像的图像质量达到预设图像质量时,各帧图像的码率;同时,各帧图像的目标码率小于各帧图像的初始码率。具体来说,参见图3和图4中的示例,原码点的码率指的是图像的初始码率,而最优码点的码率指的是在图像质量达到预设图像质量时图像的码率,且最优码点的码率小于原码点的码率。因此,在具体实现中,各帧图像的目标码率也就相当于是最优码点的码率;也就是说,确定出的各帧图像的目标码率,也就相当于确定出了各帧图像的最优码点的码率。
此外,根据各帧图像的纹理特征和各帧图像的运动特征,确定各帧图像的目标码率的具体方式可以是:调用回归预测模型,对各帧图像的纹理特征和各帧图像的运动特征进行回归预测处理,得到各帧图像的目标码率。具体来说,回归预测模型的训练过程是一个有监督训练的过程,因此可以通过多个训练视频,以及各个训练视频中各帧图像的训练码率,对初始回归预测模型进行训练,从而得到回归预测模型。其中,训练码率指的是在各个训练视频的各帧图像的图像质量达到训练图像质量时,各个训练视频的各帧图像的码率;训练码率小于各个训练视频的各帧图像的初始码率。训练图像质量可以参见预设图像质量的具体含义,在此不赘述。
具体实现中,可以采用自适应提升算法(Ada boosting)、梯度提升(Gradient Boosting,GBDT)、XGBoost(eXtreme Gradient Boosting,一种基于决策树的集成机器学习算法,使用梯度上升框架)等多种boosting算法(一种可以用来减小监督式学习中偏差的机器学习算法)中的任一种算法对神经网络模型(即上述的初始回归预测模型)进行有监督训练,得到回归预测模型。其中,神经网络模型具体可以是深度神经网络模型(Deep Neural Networks,DNN),卷积神经网络模型(Convolutional Neural Network,CNN)、决策树模型等具有回归预测能力的多种模型中的一种。
可选地,由于步骤S201中提及各帧图像的纹理特征可以有一个或多个,以及各帧图像的运动特征可以有一个或多个;因此,根据各帧图像的纹理特征和各帧图像的运动特征,确定各帧图像的目标码率的具体方式还可以是:将各帧图像的各个纹理特征进行拼接,得到各帧图像的目标纹理特征;以及将各帧图像的各个运动特征进行拼接,得到各帧图像的目标运动 特征;最后,根据各帧图像的目标纹理特征,以及各帧图像的目标运动特征,对目标视频进行回归预测处理,得到各帧图像的目标码率。
S205,根据各帧图像的目标码率,对目标视频进行编码处理,得到编码后的目标视频。
在本申请实施例中,所述根据各帧图像的目标码率,对目标视频进行编码处理,得到编码后的目标视频的具体方式可以是:根据各帧图像的目标码率,以及各帧图像的数据量,对目标视频中的相应图像进行编码处理,得到编码后的目标视频。
同理,步骤S204中据各帧图像的初始码率,对目标视频进行编码处理,得到编码后的目标视频的具体方式可以是:根据各帧图像的初始码率,以及各帧图像的数据量,对目标视频中的相应图像进行编码处理,得到编码后的目标视频。
本申请实施例中,会先通过各帧图像的纹理特征和运动特征,判断在各帧图像的图像质量达到预设图像质量时,是否存在小于各帧图像的初始码率的码率的方式,判断出各帧图像的码率是否还可以进一步降低。在确定了各帧图像的初始码率还有降低的空间之后,可以进一步通过各帧图像的纹理特征和运动特征,得到可以保证图像质量达到预设图像质量且小于各帧图像的初始码率的各帧图像的目标码率。也就是说,本申请实施例在确定出各帧图像的码率还可以降低之后,会进一步通过各帧图像的纹理特征和运动特征,得到小于各帧图像的初始码率的目标码率,以便对视频进行编码,从而在保证视频中各帧图像的图像质量达到预设图像质量的情况下,减小编码后的视频的数据量,有利于视频的存储和传输。
基于上述视频编码方案以及视频编码系统,本申请实施例提供了另一种视频编码方法。参见图5,为本申请实施例提供的另一种视频编码方法的流程示意图。图5所示的视频编码方法可由图1所示的服务器或者终端设备执行。图5所示的视频编码方法可包括如下步骤:
S501,对目标视频中的各帧图像进行特征提取处理,得到各帧图像的纹理特征,以及各帧图像的运动特征。
S502,基于目标视频的预设图像质量,以及目标视频的预设图像分辨率,确定各帧图像的初始码率。
其中,步骤S501-S502的具体实施方式可参见步骤S201-S202的具体实施方式,在此不赘述。
S503,调用分类模型中的各个决策分类器,根据各帧图像的纹理特征和各帧图像的运动特征,对目标视频进行决策处理,得到各个决策分类器的决策结果。
在本申请实施例中,各个决策分类器的决策结果用于指示在各帧图像的图像质量达到预设图像质量时,是否存在小于各帧图像的初始码率的码率。分类模型包括多个决策分类器。具体实现中,分类模型中的每一个决策分类器都是一个小型的用于分类的神经网络模型。
可选地,分类模型的训练过程具体可以是:首先,获取多个训练视频集;其中,任一训练视频集包括至少一个训练视频,至少一个训练视频中各个训练视频的目标决策结果,至少一个训练视频中各个训练视频的各帧图像的纹理特征和运动特征。然后,调用初始分类模型中的多个决策分类器,对多个训练视频集中各个训练视频的各帧图像的纹理特征和运动特征进行决策处理,得到多个训练视频集中各个训练视频的训练决策结果。最后,根据多个训练视频集中各个训练视频的训练决策结果,以及多个训练视频集中各个训练视频的目标决策结果,对初始分类模型进行训练,得到分类模型。
具体实现中,调用初始分类模型中的多个决策分类器,对多个训练视频集中各个训练视 频的各帧图像的纹理特征和运动特征进行决策处理,得到多个训练视频集中各个训练视频的训练决策结果的具体过程可以是:遍历初始分类模型中的各个决策分类器,基于上一次遍历的决策分类器处理得到的上一次遍历的决策分类器对应的训练视频集中各个训练视频的训练决策结果,从多个训练视频集中确定当前遍历的目标决策分类器对应的目标训练视频集。然后,调用目标决策分类器,根据目标训练视频集中各个训练视频的各帧图像的纹理特征和运动特征,对目标训练视频集中的各个训练视频进行决策处理,得到目标训练视频集中的各个训练视频的训练决策结果。
具体来说,基于上一次遍历的决策分类器处理得到的上一次遍历的决策分类器对应的训练视频集中各个训练视频的训练决策结果,从多个训练视频集中确定当前遍历的目标决策分类器对应的目标训练视频集的方式可以是:当目标决策分类器是第一个遍历的决策分类器时,先从多个训练视频集中选取任一训练视频集作为第一个遍历的决策分类器的目标训练视频集;当目标决策分类器不是第一个遍历的决策分类器时,基于上一次遍历的决策分类器的目标训练视频集中训练决策结果与目标决策结果不相同的训练视频,以及多个训练视频集的剩余训练视频集中选取的一个训练视频集,得到目标决策分类器的目标训练视频集。
此外,还可以根据目标训练视频集中的各个训练视频的训练决策结果和目标决策结果,确定目标决策分类器的决策权重,以得到初始分类模型中的各个决策分类器的决策权重。
具体来说,在开始训练之前,初始分类模型中的各个决策分类器的决策权重是均匀分布的。举例来说,如果初始分类模型包括10个决策分类器,那么初始分类模型中的各个决策分类器的决策权重都是0.1。然后,会通过遍历过的各个决策分类器的目标训练视频集中的各个训练视频的训练决策结果是否与其目标决策结果相同的方式,确定遍历过的各个决策分类器的分类正确率;最后,通过遍历过的各个决策分类器的分类正确率,得到遍历过的各个决策分类器的决策权重。
S504,根据各个决策分类器的决策权重,以及各个决策分类器的决策结果,得到分类结果。
在本申请实施例中,分类模型中各个决策分类器的决策权重都是一个小于1的数值,分类模型中所有决策分类器的决策权重相加为1。由步骤S503可知,分类模型中各个决策分类器的决策权重是根据各个决策分类器在训练时的目标训练视频集中的各个训练视频的训练决策结果和目标决策结果得到的。在此不赘述。
此外,根据各个决策分类器的决策权重,以及各个决策分类器的决策结果,得到分类结果的具体方式可以是:将各个决策分类器的决策权重与相应决策分类器的决策结果相乘,得到各个决策分类器的加权决策结果;最后将分类模型中所有决策分类器的加权决策结果相加,得到分类结果。
S505,若分类结果指示在各帧图像的图像质量达到预设图像质量时,存在小于各帧图像的初始码率的码率,则根据各帧图像的纹理特征和各帧图像的运动特征,确定各帧图像的目标码率。
本申请实施例中,根据各帧图像的纹理特征和各帧图像的运动特征,确定各帧图像的目标码率的具体方式可以是:调用回归预测模型中的各个回归预测器,对各帧图像的纹理特征和各帧图像的运动特征进行预测处理,得到各个回归预测器的预测码率;基于各个回归预测器的回归权重,以及各个回归预测器的预测码率,得到各帧图像的目标码率。具体来说,各个回归预测器的回归权重可以都相同,举例来说,各个回归预测器的回归权重可以都为1。
可选地,回归预测模型的训练过程具体可以是:首先,获取训练视频,训练视频的各帧 图像的训练码率,以及训练视频的各帧图像的纹理特征和运动特征;其中,训练视频的各帧图像的训练码率指的是在训练视频的各帧图像的图像质量达到训练图像质量时,训练视频的各帧图像的码率;训练视频的各帧图像的训练码率小于训练视频的各帧图像的初始码率。然后,调用初始回归预测模型中的各个初始回归预测器,对训练视频的各帧图像的纹理特征和运动特征进行预测处理,得到各个初始回归预测器的预测码率。最后,根据各个初始回归预测器的预测码率,以及训练视频的各帧图像的训练码率,对初始回归预测模型进行训练,得到回归预测模型。
具体来说,调用初始回归预测模型中的各个初始回归预测器,对训练视频的各帧图像的纹理特征和运动特征进行预测处理,得到各个初始回归预测器的预测码率的过程可以是:遍历初始回归预测模型中的各个初始回归预测器,并调用当前遍历的初始回归预测器,对训练视频的各帧图像的纹理特征和运动特征进行预测处理,得到当前遍历的初始回归预测器的预测码率。
那么,根据各个初始回归预测器的预测码率,以及训练视频的各帧图像的训练码率,对初始回归预测模型进行训练,得到回归预测模型的过程可以是:基于训练视频的各帧图像的训练码率,以及遍历过的初始回归预测器的预测码率,确定当前遍历的初始回归预测器的残差码率;根据当前遍历的初始回归预测器的残差码率,以及遍历过的初始回归预测器的残差码率,对初始回归预测模型进行训练,得到回归预测模型。
也就是说,初始回归预测模型中的各个初始回归预测器都会对训练视频进行处理,预测得到一个预测码率;然后,训练视频的训练码率减去遍历过的初始回归预测器的预测码率以及当前遍历的初始回归预测器的预测码率,也就得到了当前遍历的初始回归预测器的残差码率。然后,通过按照减小每个初始回归预测器的残差码率的方向,对每个初始回归预测器进行训练的方式,对初始回归预测模型进行训练,从而得到回归预测模型。
可选地,除了可以根据各帧图像的纹理特征和各帧图像的运动特征,确定各帧图像的目标码率之外,还可以根据各帧图像的纹理特征和各帧图像的运动特征,确定各帧图像的目标码率和目标图像分辨率。其中,视频中各帧图像的目标图像分辨率可以是相同的。目标图像分辨率指的是在各帧图像的图像质量达到所述预设图像质量时,各帧图像的目标码率所对应的图像分辨率。
具体来说,根据各帧图像的纹理特征和各帧图像的运动特征,确定各帧图像的目标码率和目标图像分辨率的具体方式可以是:调用回归预测模型,对各帧图像的纹理特征和各帧图像的运动特征进行回归预测处理,得到各帧图像的目标码率和目标图像分辨率。
需要说明的是,由于此时回归预测模型可以预测各帧图像的目标码率和目标图像分辨率,因此回归预测模型可以通过多个训练视频,以及各个训练视频的训练码率和训练图像分辨率,对初始分类模型进行训练,从而得到分类模型。其中,训练图像分辨率指的是在各帧图像的图像质量达到训练图像质量时,各帧图像的训练码率所对应的图像分辨率。
S506,根据各帧图像的目标码率,对目标视频进行编码处理,得到编码后的目标视频。
其中,步骤S506的具体实施方式可参见步骤S205的具体实施方式,在此不赘述。
在实际应用中,请参见附图6,示出了一种视频编码过程的示意图。输入目标视频,并对目标视频进行解码,得到多帧预设视频格式的图像。其中,对目标视频进行解码的具体过程可以是:将目标视频的原始视频格式转换为预设视频格式。举例来说,预设视频格式可以是YUV(一种颜色编码方法。常使用在各个视频处理组件中,常见的YUV格式有YUV420p、YUV420sp、NV21等)、RGB(一种颜色标准)、Raw(一种未经处理、也未经压缩的视频格 式)等,在此不限定。
在得到多帧预设视频格式的图像之后,可以提取各帧图像的纹理特征和运动特征,并将各帧图像的纹理特征和运动特征输入至分类模型,得到分类结果。若分类结果指示原码点为最优码点(即分类结果指示在各帧图像的图像质量达到预设图像质量时,不存在小于各帧图像的初始码率的码率),则根据各帧图像的初始码率,对目标视频进行编码处理,得到编码后的目标视频。
若分类结果指示原码点不为最优码点(即分类结果指示在各帧图像的图像质量达到预设图像质量时,存在小于各帧图像的初始码率的码率),则将各帧图像的纹理特征和运动特征输入至回归预测模型,得到各帧图像的目标码率。最后根据各帧图像的目标码率,对目标视频进行编码处理,得到编码后的目标视频。
本申请实施例中,会先通过各帧图像的纹理特征和运动特征,判断在各帧图像的图像质量达到预设图像质量时,是否存在小于各帧图像的初始码率的码率的方式,判断出各帧图像的码率是否还可以进一步降低。在确定了各帧图像的初始码率还有降低的空间之后,可以进一步通过各帧图像的纹理特征和运动特征,得到可以保证图像质量达到预设图像质量且小于各帧图像的初始码率的各帧图像的目标码率。也就是说,本申请实施例在确定出各帧图像的码率还可以降低之后,会进一步通过各帧图像的纹理特征和运动特征,得到小于各帧图像的初始码率的目标码率,以便对视频进行编码,从而在保证视频中各帧图像的图像质量达到预设图像质量的情况下,减小编码后的视频的数据量,有利于视频的存储和传输。
此外,本申请实施例会根据分类模型中各个决策分类器对各帧图像的纹理特征和运动特征进行处理得到的多个决策结果,得到最终的分类结果,有利于提高分类结果的准确性。同时,本申请实施例还会根据回归预测模型中各个回归预测器对各帧图像的纹理特征和运动特征进行处理得到的各帧图像的多个预测码率,得到各帧图像的目标码率,有利于提高预测得到的码率的准确性。
基于上述视频编码方法的相关描述,本申请还公开了一种视频编码装置。该视频编码装置可以是运行与上述所提及的计算机设备中的一个计算机程序(包括程序代码)。该视频编码装置可以执行如图2和图5所示的视频编码方法,请参见图7,该视频编码装置至少可以包括:处理单元701和编码单元702。
处理单元701,用于对目标视频中的各帧图像进行特征提取处理,得到各帧图像的纹理特征,以及各帧图像的运动特征;
处理单元701,还用于基于目标视频的预设图像质量,以及目标视频的预设图像分辨率,确定各帧图像的初始码率;
处理单元701,还用于根据各帧图像的纹理特征和各帧图像的运动特征,对目标视频进行分类处理,得到分类结果;其中,分类结果用于指示在各帧图像的图像质量达到预设图像质量时,是否存在小于各帧图像的初始码率的码率;
处理单元701,还用于若分类结果指示在各帧图像的图像质量达到预设图像质量时,存在小于各帧图像的初始码率的码率,则根据各帧图像的纹理特征和各帧图像的运动特征,确定各帧图像的目标码率;其中,各帧图像的目标码率指的是在各帧图像的图像质量达到预设图像质量时,各帧图像的码率;各帧图像的目标码率小于各帧图像的初始码率。
编码单元702,用于根据各帧图像的目标码率,对目标视频进行编码处理,得到编码后 的目标视频。
在一种实施方式中,处理单元701在根据各帧图像的纹理特征和各帧图像的运动特征,对目标视频进行分类处理,得到分类结果时,还可以用于执行:
调用分类模型中的各个决策分类器,根据各帧图像的纹理特征和各帧图像的运动特征,对目标视频进行决策处理,得到各个决策分类器的决策结果;其中,决策结果用于指示在各帧图像的图像质量达到预设图像质量时,是否存在小于各帧图像的初始码率的码率;
根据各个决策分类器的决策权重,以及各个决策分类器的决策结果,得到分类结果。
在又一种实施方式中,处理单元701还可用于执行:
获取多个训练视频集;其中,任一训练视频集包括至少一个训练视频,至少一个训练视频中各个训练视频的目标决策结果,至少一个训练视频中各个训练视频的各帧图像的纹理特征和运动特征;
调用初始分类模型中的多个决策分类器,对多个训练视频集中各个训练视频的各帧图像的纹理特征和运动特征进行决策处理,得到多个训练视频集中各个训练视频的训练决策结果;
根据多个训练视频集中各个训练视频的训练决策结果,以及多个训练视频集中各个训练视频的目标决策结果,对初始分类模型进行训练,得到分类模型。
在又一种实施方式中,处理单元701在调用初始分类模型中的多个决策分类器,对多个训练视频集中各个训练视频的各帧图像的纹理特征和运动特征进行决策处理,得到多个训练视频集中各个训练视频的训练决策结果时,具体还可用于执行:
遍历初始分类模型中的各个决策分类器,基于上一次遍历的决策分类器处理得到的上一次遍历的决策分类器对应的训练视频集中各个训练视频的训练决策结果,从多个训练视频集中确定当前遍历的目标决策分类器对应的目标训练视频集;
调用目标决策分类器,根据目标训练视频集中各个训练视频的各帧图像的纹理特征和运动特征,对目标训练视频集中的各个训练视频进行决策处理,得到目标训练视频集中的各个训练视频的训练决策结果。
在一种实施方式中,处理单元701在根据各帧图像的纹理特征和各帧图像的运动特征,确定各帧图像的目标码率时,具体可以用于执行:
调用回归预测模型中的各个回归预测器,对各帧图像的纹理特征和各帧图像的运动特征进行预测处理,得到各个回归预测器的预测码率;
基于各个回归预测器的回归权重,以及各个回归预测器的预测码率,得到各帧图像的目标码率。
在一种实施方式中,处理单元701,还可用于执行:
获取训练视频,训练视频的各帧图像的训练码率,以及训练视频的各帧图像的纹理特征和运动特征;其中,训练视频的各帧图像的训练码率指的是在训练视频的各帧图像的图像质量达到训练图像质量时,训练视频的各帧图像的码率,训练视频的各帧图像的训练码率小于训练视频的各帧图像的初始码率;
调用初始回归预测模型中的各个初始回归预测器,对训练视频的各帧图像的纹理特征和运动特征进行预测处理,得到各个初始回归预测器的预测码率;
根据各个初始回归预测器的预测码率,以及训练视频的各帧图像的训练码率,对初始回归预测模型进行训练,得到回归预测模型。
在又一种实施方式中,处理单元701在调用初始回归预测模型中的各个初始回归预测器,对训练视频的各帧图像的纹理特征和运动特征进行预测处理,得到各个初始回归预测器的预 测码率时,具体可用于执行:
遍历初始回归预测模型中的各个初始回归预测器,并调用当前遍历的初始回归预测器,对训练视频的各帧图像的纹理特征和运动特征进行预测处理,得到当前遍历的初始回归预测器的预测码率;
处理单元701在根据各个初始回归预测器的预测码率,以及训练视频的各帧图像的训练码率,对初始回归预测模型进行训练,得到回归预测模型时,具体可用于执行:
基于训练视频的各帧图像的训练码率,以及遍历过的初始回归预测器的预测码率,确定当前遍历的初始回归预测器的残差码率;
根据当前遍历的初始回归预测器的残差码率,以及遍历过的初始回归预测器的残差码率,对初始回归预测模型进行训练,得到回归预测模型。
在又一种实施方式中,各帧图像的纹理特征包括如下至少一种:灰度变化特征,灰度共生矩阵;各帧图像的运动特征包括如下至少一种:加权峰值信噪比,各帧图像中的各个像素块与各帧图像的目标图像中相应像素块之间的位移;其中,各帧图像的目标图像指的是在目标视频中的图像序列数与各帧图像在目标视频中的图像序列数之间的差值小于或等于预设阈值的图像。
处理单元701在根据各帧图像的纹理特征和各帧图像的运动特征,对目标视频进行分类处理,得到分类结果时,具体可以用于执行:
将各帧图像的各个纹理特征进行拼接,得到各帧图像的目标纹理特征;
将各帧图像的各个运动特征进行拼接,得到各帧图像的目标运动特征;
根据各帧图像的目标纹理特征,以及各帧图像的目标运动特征,对目标视频进行分类处理,得到分类结果。
根据本申请的一个实施例,图2和图5所示的方法所涉及各个步骤可以是由图7所示的视频编码装置中的各个单元来执行的。例如,图2所示的步骤S201至步骤S204可由图7所示的视频编码装置中的处理单元701来执行;步骤S205可由图7所示的视频编码装置中的编码单元702来执行。再如,图5所示的步骤S501至步骤S505可由图7所示的视频编码装置中的处理单元701来执行;步骤S506可由图7所示的视频编码装置中的编码单元702来执行。
根据本申请的另一个实施例,图7所示的视频编码装置中的各个单元是基于逻辑功能划分的,上述各个单元可以分别或全部合并为一个或若干个另外的单元来构成,或者其中的某个(些)单元还可以再拆分为功能上更小的多个单元来构成,这可以实现同样的操作,而不影响本申请的实施例的技术效果的实现。在本申请的其它实施例中,上述基于视频编码装置也可以包括其它单元,在实际应用中,这些功能也可以由其它单元协助实现,并且可以由多个单元协作实现。
根据本申请的另一个实施例,可以通过在包括中央处理单元(CPU)、随机存取存储介质(RAM)、只读存储介质(ROM)等处理元件和存储元件的例如计算机设备的通用计算设备上,运行能够执行如图2或图5所示的方法所涉及的各步骤的计算机程序(包括程序代码),来构造如图7所示的视频编码装置,以及来实现本申请实施例的视频编码方法。计算机程序可以记载于例如计算机存储介质上,并通过计算机存储介质装载于上述计算机设备中,并在其中运行。
本申请实施例中,会先通过各帧图像的纹理特征和运动特征,判断在各帧图像的图像质 量达到预设图像质量时,是否存在小于各帧图像的初始码率的码率的方式,判断出各帧图像的码率是否还可以进一步降低。在确定了各帧图像的初始码率还有降低的空间之后,可以进一步通过各帧图像的纹理特征和运动特征,得到可以保证图像质量达到预设图像质量且小于各帧图像的初始码率的各帧图像的目标码率。也就是说,本申请实施例在确定出各帧图像的码率还可以降低之后,会进一步通过各帧图像的纹理特征和运动特征,得到小于各帧图像的初始码率的目标码率,以便对视频进行编码,从而在保证视频中各帧图像的图像质量达到预设图像质量的情况下,减小编码后的视频的数据量,有利于视频的存储和传输。
基于上述的方法实施例以及装置实施例,本申请还提供了一种电子设备。参见图8,为本申请实施例提供的一种电子设备的结构示意图。图8所示的电子设备可至少包括处理器801、输入接口802、输出接口803以及计算机存储介质804。其中,处理器801、输入接口802、输出接口803以及计算机存储介质804可通过总线或其他方式连接。
计算机存储介质804可以存储在电子设备的存储器中,计算机存储介质804用于存储计算机程序,计算机程序包括程序指令,处理器801用于执行计算机存储介质804存储的程序指令。处理器801(或称CPU(Central Processing Unit,中央处理器))是电子设备的计算核心以及控制核心,其适于实现一条或多条指令,具体适于加载并执行一条或多条指令从而实现上述视频编码方法流程或相应功能。
本申请实施例还提供了一种计算机存储介质(Memory),计算机存储介质是电子设备中的记忆设备,用于存放程序和数据。可以理解的是,此处的计算机存储介质既可以包括终端中的内置存储介质,当然也可以包括终端所支持的扩展存储介质。计算机存储介质提供存储空间,该存储空间存储了终端的操作系统。并且,在该存储空间中还存放了适于被处理器801加载并执行的一条或多条的指令,这些指令可以是一个或一个以上的计算机程序(包括程序代码)。需要说明的是,此处的计算机存储介质可以是高速随机存取存储器(random access memory,RAM)存储器,也可以是非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器;可选的还可以是至少一个位于远离前述处理器的计算机存储介质。
在一个实施例中,可由处理器801加载并执行计算机存储介质中存放的一条或多条指令,以实现上述有关图2和图5的视频编码方法实施例中的方法的相应步骤,具体实现中,计算机存储介质中的一条或多条指令由处理器801加载并执行如下步骤:
处理器801对目标视频中的各帧图像进行特征提取处理,得到各帧图像的纹理特征,以及各帧图像的运动特征;
处理器801基于目标视频的预设图像质量,以及目标视频的预设图像分辨率,确定各帧图像的初始码率;
处理器801根据各帧图像的纹理特征和各帧图像的运动特征,对目标视频进行分类处理,得到分类结果;其中,分类结果用于指示在各帧图像的图像质量达到预设图像质量时,是否存在小于各帧图像的初始码率的码率;
处理器801若分类结果指示在各帧图像的图像质量达到预设图像质量时,存在小于各帧图像的初始码率的码率,则根据各帧图像的纹理特征和各帧图像的运动特征,确定各帧图像的目标码率;其中,各帧图像的目标码率指的是在各帧图像的图像质量达到预设图像质量时,各帧图像的码率;各帧图像的目标码率小于各帧图像的初始码率;
处理器801根据各帧图像的目标码率,对目标视频进行编码处理,得到编码后的目标视频。
在一个实施例中,处理器801在根据各帧图像的纹理特征和各帧图像的运动特征,对目标视频进行分类处理,得到分类结果时,具体可用于执行:
调用分类模型中的各个决策分类器,根据各帧图像的纹理特征和各帧图像的运动特征,对目标视频进行决策处理,得到各个决策分类器的决策结果;其中,决策结果用于指示在各帧图像的图像质量达到预设图像质量时,是否存在小于各帧图像的初始码率的码率;
根据各个决策分类器的决策权重,以及各个决策分类器的决策结果,得到分类结果。
在一个实施例中,处理器801具体还可用于执行:
获取多个训练视频集;其中,任一训练视频集包括至少一个训练视频,至少一个训练视频中各个训练视频的目标决策结果,至少一个训练视频中各个训练视频的各帧图像的纹理特征和运动特征;
调用初始分类模型中的多个决策分类器,对多个训练视频集中各个训练视频的各帧图像的纹理特征和运动特征进行决策处理,得到多个训练视频集中各个训练视频的训练决策结果;
根据多个训练视频集中各个训练视频的训练决策结果,以及多个训练视频集中各个训练视频的目标决策结果,对初始分类模型进行训练,得到分类模型。
在一个实施例中,处理器801在调用初始分类模型中的多个决策分类器,对多个训练视频集中各个训练视频的各帧图像的纹理特征和运动特征进行决策处理,得到多个训练视频集中各个训练视频的训练决策结果时,具体可以用于执行:
遍历初始分类模型中的各个决策分类器,基于上一次遍历的决策分类器处理得到的上一次遍历的决策分类器对应的训练视频集中各个训练视频的训练决策结果,从多个训练视频集中确定当前遍历的目标决策分类器对应的目标训练视频集;
调用目标决策分类器,根据目标训练视频集中各个训练视频的各帧图像的纹理特征和运动特征,对目标训练视频集中的各个训练视频进行决策处理,得到目标训练视频集中的各个训练视频的训练决策结果。
在一个实施例中,处理器801在根据各帧图像的纹理特征和各帧图像的运动特征,确定各帧图像的目标码率时,具体可用于执行:
调用回归预测模型中的各个回归预测器,对各帧图像的纹理特征和各帧图像的运动特征进行预测处理,得到各个回归预测器的预测码率;
基于各个回归预测器的回归权重,以及各个回归预测器的预测码率,得到各帧图像的目标码率。
在一个实施例中,处理器801还可用于执行:
获取训练视频,训练视频的各帧图像的训练码率,以及训练视频的各帧图像的纹理特征和运动特征;其中,训练视频的各帧图像的训练码率指的是在训练视频的各帧图像的图像质量达到训练图像质量时,训练视频的各帧图像的码率,训练视频的各帧图像的训练码率小于训练视频的各帧图像的初始码率;
调用初始回归预测模型中的各个初始回归预测器,对训练视频的各帧图像的纹理特征和运动特征进行预测处理,得到各个初始回归预测器的预测码率;
根据各个初始回归预测器的预测码率,以及训练视频的各帧图像的训练码率,对初始回归预测模型进行训练,得到回归预测模型。
在一个实施例中,处理器801在调用初始回归预测模型中的各个初始回归预测器,对训练视频的各帧图像的纹理特征和运动特征进行预测处理,得到各个初始回归预测器的预测码率时,具体可用于执行:
遍历初始回归预测模型中的各个初始回归预测器,并调用当前遍历的初始回归预测器,对训练视频的各帧图像的纹理特征和运动特征进行预测处理,得到当前遍历的初始回归预测器的预测码率;
处理器801在根据各个初始回归预测器的预测码率,以及训练视频的各帧图像的训练码率,对初始回归预测模型进行训练,得到回归预测模型时,具体可用于执行:
基于训练视频的各帧图像的训练码率,以及遍历过的初始回归预测器的预测码率,确定当前遍历的初始回归预测器的残差码率;
根据当前遍历的初始回归预测器的残差码率,以及遍历过的初始回归预测器的残差码率,对初始回归预测模型进行训练,得到回归预测模型。
在一个实施例中,各帧图像的纹理特征包括如下至少一种:灰度变化特征,灰度共生矩阵;各帧图像的运动特征包括如下至少一种:加权峰值信噪比,各帧图像中的各个像素块与各帧图像的目标图像中相应像素块之间的位移;其中,各帧图像的目标图像指的是在目标视频中的图像序列数与各帧图像在目标视频中的图像序列数之间的差值小于或等于预设阈值的图像;
处理器801在根据各帧图像的纹理特征和各帧图像的运动特征,对目标视频进行分类处理,得到分类结果时,具体可用于执行:
将各帧图像的各个纹理特征进行拼接,得到各帧图像的目标纹理特征;
将各帧图像的各个运动特征进行拼接,得到各帧图像的目标运动特征;
根据各帧图像的目标纹理特征,以及各帧图像的目标运动特征,对目标视频进行分类处理,得到分类结果。
本申请实施例中,会先通过各帧图像的纹理特征和运动特征,判断在各帧图像的图像质量达到预设图像质量时,是否存在小于各帧图像的初始码率的码率的方式,判断出各帧图像的码率是否还可以进一步降低。在确定了各帧图像的初始码率还有降低的空间之后,可以进一步通过各帧图像的纹理特征和运动特征,得到可以保证图像质量达到预设图像质量且小于各帧图像的初始码率的各帧图像的目标码率。也就是说,本申请实施例在确定出各帧图像的码率还可以降低之后,会进一步通过各帧图像的纹理特征和运动特征,得到小于各帧图像的初始码率的目标码率,以便对视频进行编码,从而在保证视频中各帧图像的图像质量达到预设图像质量的情况下,减小编码后的视频的数据量,有利于视频的存储和传输。
本申请实施例提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。电子设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该电子设备执行上述如图2和图5所示的方法实施例。其中,计算机可读存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。
本发明实施例中的视频编码方法主要以视频传输和存储领域进行举例说明,本发明实施例中的视频编码方法还可应用于视频显示等与视频编解码相关的场景,在此不限定。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。此外,本申请不对具体实施方式中的各个步骤的执行顺序作限定。

Claims (10)

  1. 一种视频编码方法,其特征在于,包括:
    对目标视频中的各帧图像进行特征提取处理,得到所述各帧图像的纹理特征,以及所述各帧图像的运动特征;
    基于所述目标视频的预设图像质量,以及所述目标视频的预设图像分辨率,确定所述各帧图像的初始码率;
    根据所述各帧图像的纹理特征和所述各帧图像的运动特征,对所述目标视频进行分类处理,得到分类结果;其中,所述分类结果用于指示在所述各帧图像的图像质量达到所述预设图像质量时,是否存在小于所述各帧图像的初始码率的码率;
    若所述分类结果指示在所述各帧图像的图像质量达到所述预设图像质量时,存在小于所述各帧图像的初始码率的码率,则根据所述各帧图像的纹理特征和所述各帧图像的运动特征,确定所述各帧图像的目标码率;其中,所述各帧图像的目标码率指的是在所述各帧图像的图像质量达到所述预设图像质量时,所述各帧图像的码率;所述各帧图像的目标码率小于所述各帧图像的初始码率;
    根据所述各帧图像的目标码率,对所述目标视频进行编码处理,得到编码后的目标视频。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述各帧图像的纹理特征和所述各帧图像的运动特征,对所述目标视频进行分类处理,得到分类结果,包括:
    调用分类模型中的各个决策分类器,根据所述各帧图像的纹理特征和所述各帧图像的运动特征,对所述目标视频进行决策处理,得到所述各个决策分类器的决策结果;其中,所述决策结果用于指示在所述各帧图像的图像质量达到所述预设图像质量时,是否存在小于所述各帧图像的初始码率的码率;
    根据所述各个决策分类器的决策权重,以及所述各个决策分类器的决策结果,得到所述分类结果。
  3. 根据权利要求2所述的方法,其特征在于,所述方法还包括:
    获取多个训练视频集;其中,任一训练视频集包括至少一个训练视频,所述至少一个训练视频中各个训练视频的目标决策结果,所述至少一个训练视频中各个训练视频的各帧图像的纹理特征和运动特征;
    调用初始分类模型中的多个决策分类器,对所述多个训练视频集中各个训练视频的各帧图像的纹理特征和运动特征进行决策处理,得到所述多个训练视频集中各个训练视频的训练决策结果;
    根据所述多个训练视频集中各个训练视频的训练决策结果,以及所述多个训练视频集中各个训练视频的目标决策结果,对所述初始分类模型进行训练,得到所述分类模型。
  4. 根据权利要求3所述的方法,其特征在于,所述调用初始分类模型中的多个决策分类器,对所述多个训练视频集中各个训练视频的各帧图像的纹理特征和运动特征进行决策处理,得到所述多个训练视频集中各个训练视频的训练决策结果,包括:
    遍历初始分类模型中的各个决策分类器,基于上一次遍历的决策分类器处理得到的所述 上一次遍历的决策分类器对应的训练视频集中各个训练视频的训练决策结果,从所述多个训练视频集中确定当前遍历的目标决策分类器对应的目标训练视频集;
    调用所述目标决策分类器,根据所述目标训练视频集中各个训练视频的各帧图像的纹理特征和运动特征,对所述目标训练视频集中的各个训练视频进行决策处理,得到所述目标训练视频集中的各个训练视频的训练决策结果。
  5. 根据权利要求1所述的方法,其特征在于,所述根据所述各帧图像的纹理特征和所述各帧图像的运动特征,确定所述各帧图像的目标码率,包括:
    调用回归预测模型中的各个回归预测器,对所述各帧图像的纹理特征和所述各帧图像的运动特征进行预测处理,得到所述各个回归预测器的预测码率;
    基于所述各个回归预测器的回归权重,以及所述各个回归预测器的预测码率,得到所述各帧图像的目标码率。
  6. 根据权利要求5所述的方法,其特征在于,所述方法还包括:
    获取训练视频,所述训练视频的各帧图像的训练码率,以及所述训练视频的各帧图像的纹理特征和运动特征;其中,所述训练视频的各帧图像的训练码率指的是在所述训练视频的各帧图像的图像质量达到训练图像质量时,所述训练视频的各帧图像的码率,所述训练视频的各帧图像的训练码率小于所述训练视频的各帧图像的初始码率;
    调用初始回归预测模型中的各个初始回归预测器,对所述训练视频的各帧图像的纹理特征和运动特征进行预测处理,得到所述各个初始回归预测器的预测码率;
    根据所述各个初始回归预测器的预测码率,以及所述训练视频的各帧图像的训练码率,对所述初始回归预测模型进行训练,得到所述回归预测模型。
  7. 根据权利要求6所述的方法,其特征在于,所述调用初始回归预测模型中的各个初始回归预测器,对所述训练视频的各帧图像的纹理特征和运动特征进行预测处理,得到所述各个初始回归预测器的预测码率,包括:
    遍历所述初始回归预测模型中的各个初始回归预测器,并调用当前遍历的初始回归预测器,对所述训练视频的各帧图像的纹理特征和运动特征进行预测处理,得到所述当前遍历的初始回归预测器的预测码率;
    所述根据所述各个初始回归预测器的预测码率,以及所述训练视频的各帧图像的训练码率,对所述初始回归预测模型进行训练,得到所述回归预测模型,包括:
    基于所述训练视频的各帧图像的训练码率,以及遍历过的初始回归预测器的预测码率,确定所述当前遍历的初始回归预测器的残差码率;
    根据当前遍历的初始回归预测器的残差码率,以及遍历过的初始回归预测器的残差码率,对所述初始回归预测模型进行训练,得到所述回归预测模型。
  8. 根据权利要求1所述的方法,其特征在于,所述各帧图像的纹理特征包括如下至少一种:灰度变化特征,灰度共生矩阵;所述各帧图像的运动特征包括如下至少一种:加权峰值信噪比,所述各帧图像中的各个像素块与所述各帧图像的目标图像中相应像素块之间的位移;其中,所述各帧图像的目标图像指的是在所述目标视频中的图像序列数与所述各帧图像在所述目标视频中的图像序列数之间的差值小于或等于预设阈值的图像;
    所述根据所述各帧图像的纹理特征和所述各帧图像的运动特征,对所述目标视频进行分类处理,得到分类结果,包括:
    将所述各帧图像的各个纹理特征进行拼接,得到所述各帧图像的目标纹理特征;
    将所述各帧图像的各个运动特征进行拼接,得到所述各帧图像的目标运动特征;
    根据所述各帧图像的目标纹理特征,以及所述各帧图像的目标运动特征,对所述目标视频进行分类处理,得到所述分类结果。
  9. 一种视频编码装置,其特征在于,所述视频编码装置包括处理单元和编码单元,其中:
    所述处理单元,用于对目标视频中的各帧图像进行特征提取处理,得到所述各帧图像的纹理特征,以及所述各帧图像的运动特征;
    所述处理单元,还用于基于所述目标视频的预设图像质量,以及所述目标视频的预设图像分辨率,确定所述各帧图像的初始码率;
    所述处理单元,还用于根据所述各帧图像的纹理特征和所述各帧图像的运动特征,对所述目标视频进行分类处理,得到分类结果;其中,所述分类结果用于指示在所述各帧图像的图像质量达到所述预设图像质量时,是否存在小于所述各帧图像的初始码率的码率;
    所述处理单元,还用于若所述分类结果指示在所述各帧图像的图像质量达到所述预设图像质量时,存在小于所述各帧图像的初始码率的码率,则根据所述各帧图像的纹理特征和所述各帧图像的运动特征,确定所述各帧图像的目标码率;其中,所述各帧图像的目标码率指的是在所述各帧图像的图像质量达到所述预设图像质量时,所述各帧图像的码率;所述各帧图像的目标码率小于所述各帧图像的初始码率;
    所述编码单元,还用于根据所述各帧图像的目标码率,对所述目标视频进行编码处理,得到编码后的目标视频。
  10. 一种计算机存储介质,其特征在于,所述计算机存储介质存储有一条或多条计算机程序,所述一条或多条计算机程序适于由处理器加载并执行如权利要求1-8任一项所述的视频编码方法。
PCT/CN2023/109319 2022-11-24 2023-07-26 视频编码方法、装置及存储介质 WO2024109138A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211514738.0A CN117750014A (zh) 2022-11-24 2022-11-24 视频编码方法、装置及存储介质
CN202211514738.0 2022-11-24

Publications (1)

Publication Number Publication Date
WO2024109138A1 true WO2024109138A1 (zh) 2024-05-30

Family

ID=90259776

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/109319 WO2024109138A1 (zh) 2022-11-24 2023-07-26 视频编码方法、装置及存储介质

Country Status (2)

Country Link
CN (1) CN117750014A (zh)
WO (1) WO2024109138A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103974060A (zh) * 2013-01-31 2014-08-06 华为技术有限公司 视频质量调整方法和装置
CN105430422A (zh) * 2015-11-06 2016-03-23 济南草履虫电子科技有限公司 一种防止医学影像重建视频闪烁的方法
CN108495130A (zh) * 2017-03-21 2018-09-04 腾讯科技(深圳)有限公司 视频编码、解码方法和装置、终端、服务器和存储介质
CN111787318A (zh) * 2020-06-24 2020-10-16 浙江大华技术股份有限公司 一种视频码率控制方法、装置、设备以及存储装置
US20210360233A1 (en) * 2020-05-12 2021-11-18 Comcast Cable Communications, Llc Artificial intelligence based optimal bit rate prediction for video coding

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103974060A (zh) * 2013-01-31 2014-08-06 华为技术有限公司 视频质量调整方法和装置
CN105430422A (zh) * 2015-11-06 2016-03-23 济南草履虫电子科技有限公司 一种防止医学影像重建视频闪烁的方法
CN108495130A (zh) * 2017-03-21 2018-09-04 腾讯科技(深圳)有限公司 视频编码、解码方法和装置、终端、服务器和存储介质
US20210360233A1 (en) * 2020-05-12 2021-11-18 Comcast Cable Communications, Llc Artificial intelligence based optimal bit rate prediction for video coding
CN111787318A (zh) * 2020-06-24 2020-10-16 浙江大华技术股份有限公司 一种视频码率控制方法、装置、设备以及存储装置

Also Published As

Publication number Publication date
CN117750014A (zh) 2024-03-22

Similar Documents

Publication Publication Date Title
CN111107395B (zh) 一种视频转码的方法、装置、服务器和存储介质
CN109844736B (zh) 概括视频内容
CN111026914B (zh) 视频摘要模型的训练方法、视频摘要生成方法及装置
EP3454261A1 (en) Apparatus and method to process and cluster data
TWI806199B (zh) 特徵圖資訊的指示方法,設備以及電腦程式
WO2022057789A1 (zh) 视频清晰度识别方法、电子设备及存储介质
CN114245209B (zh) 视频分辨率确定、模型训练、视频编码方法及装置
Karim et al. Quality of service (QoS): measurements of image formats in social cloud computing
CN113191495A (zh) 超分模型的训练及人脸识别方法、装置、介质及电子设备
CN114554211A (zh) 内容自适应视频编码方法、装置、设备和存储介质
WO2023207205A1 (zh) 视频编码方法及装置
CN111385577B (zh) 视频转码方法、装置、计算机设备和计算机可读存储介质
CN113192147A (zh) 显著性压缩的方法、系统、存储介质、计算机设备及应用
WO2024017106A1 (zh) 一种码表更新方法、装置、设备及存储介质
Micó-Enguídanos et al. Per-title and per-segment CRF estimation using DNNs for quality-based video coding
US20230412821A1 (en) Encoding a Video Frame Using Different Compression Ratios for Text Blocks and Non-Text Blocks
CN113452996A (zh) 一种视频编码、解码方法及装置
WO2024109138A1 (zh) 视频编码方法、装置及存储介质
US20230336739A1 (en) Rate control machine learning models with feedback control for video encoding
US10764578B2 (en) Bit rate optimization system and method
CN114241350A (zh) 视频编码测试序列确定方法、相关装置及计算机程序产品
CN116918329A (zh) 一种视频帧的压缩和视频帧的解压缩方法及装置
WO2024139166A1 (zh) 视频编码方法及装置、电子设备和存储介质
US20240185572A1 (en) Systems and methods for joint optimization training and encoder side downsampling
US20230099526A1 (en) Method For Constructing A Perceptual Metric For Judging Video Quality